<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>The International Journal of Robotics Research 36 (2017) 142-149.
[28] S. Bryner</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Review of Event-Based Indoor Positioning and Navigation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chenyang Shi</string-name>
          <email>shicy@buaa.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ningfang Song</string-name>
          <email>Songnf@buaa.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenzhuo Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuzhen Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Boyi Wei</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hanxiao Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jing Jin</string-name>
          <email>jinjing@buaa.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Instrumentation and Opto-electronics Engineering, Beihang University</institution>
          ,
          <addr-line>Beijing, 100191</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>81</volume>
      <issue>98</issue>
      <fpage>2402</fpage>
      <lpage>2412</lpage>
      <abstract>
        <p>Event cameras are neuromorphic vision sensors that work diferently from frame-based cameras. Instead of outputting global images of the scene at fixed frequency, event cameras generate pixel-wise output asynchronously under illumination changes. Event cameras have desirable features that make them suitable for indoor navigation and positioning: high dynamic range, high temporal resolution (consequently less motion blur) and low power consumption. However, as conventional algorithms are no longer valid for event cameras, they call for new methods to exploit their potential. This paper thus surveys sensors and algorithms for event-based navigation and positioning. We investigate event cameras (also known as Dynamic Vision Sensors) including their working principle, the trend of development and an overview of recently available sensors. We also summarize event-based algorithms that have maximized the superiority of event sensors in terms of ego-motion estimation, tracking and depth estimation. In the end, we discuss the advantages, challenges, hardware requirements and future of event cameras application in indoor navigation and positioning.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Event camera</kwd>
        <kwd>event-based vision</kwd>
        <kwd>indoor positioning</kwd>
        <kwd>indoor navigation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Event cameras are bio-inspired vision sensors. They respond to the relative light changes of the natural
world in an asynchronous and sparse way, completely subverting the imaging mode of the global exposure
of standard camera. The event stream they output is fundamentally diferent from frames, boasting high
time resolution, high dynamic range and low power consumption. Thus, in scenarios, event cameras are
an alternative to traditional cameras. Recent studies have shown that event cameras have outperformed
standard cameras in challenging positioning and mapping scenarios. Event stream naturally reflects the
edges of scenes and remain low data rate, providing a new option for indoor positioning and navigation
tasks that require high real-time performance. However, there are still many challenges and dificulties
to be solved in practice. Therefore, we conduct a detailed investigation and discussion on the application
of event cameras in positioning and navigation to further tap the potential of event cameras and provide
researchers with some ideas to solve the current dificulties encountered in this field.</p>
      <p>
        Currently, the main applications of event cameras are, objects detection and recognition [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] feature
extraction and tracking [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], motion estimation [
        <xref ref-type="bibr" rid="ref3 ref5">3, 5</xref>
        ], pose estimation [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], depth estimation [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ],
video interpolation [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ], super-resolution [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ], 3D reconstruction and mapping [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ], etc. In the
survey of [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], it reviewed the main application and development process of the event camera. Diferent
from [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], our review aims at the application of event cameras in vision navigation and positioning,
where the rationality of applying event cameras will be illustrated.
      </p>
      <p>Outline: The rest of paper is organized as follows. Section.2 introduces the principle of the event
cameras. Section.3 reviews the algorithms of event-based ego-motion estimation and discusses the
superiority of it in complex conditions. Section.4 reviews event-based Visual Odometry (VO) and Visual
Inertial Odometry (VIO) for pose estimation and tracking and discusses the development tendency of
these methods. Section.5 discusses the method for event-based mapping, including depth estimation and
3D reconstruction. Section.6 summarizes the datasets for estimating the performance of these method.
The paper ends with a discussion (Section.7) and a conclusion (Section.8).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Event Camera</title>
      <p>
        Inspired by biological vision, event cameras have a completely diferent working mechanism compared
with traditional cameras [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Event cameras, also known as Dynamic Vision Sensors (DVSs), no longer
measure the "absolute" brightness at a constant rate, but asynchronously measure the brightness change
per pixel [
        <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
        ]. Each pixel works independently. Once the brightness change of a pixel exceeds the
threshold, an event will be output at the pixel location without waiting for global exposure, which
guarantees the low latency feature. Additionally, this working mechanism fundamentally gets rid of the
constraint of frame rate, leading to faster response to the change of brightness (up to 1MHz) and higher
dynamic range (up to 120), making it capable of imaging in extremely fast motion in bright or dark
environment. Moreover, as event cameras only transmit brightness changes, no output will be generated
without relative displacement or change of light between the environment and the camera, which largely
eliminates redundant data and reduces the transmission bandwidth and power consumption.
      </p>
      <p>
        The output of a DVS is an event stream. Event is represented as a tuple e = (x, y, p, t) in which
t represents the time when the pixel brightness changes in microsecond, with high sensitivity and
resolution; the coordinates (x, y) are the position of the pixel where the brightness change occurs;
polarity p indicates the direction of brightness change [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. If the brightness increase exceeds the
threshold, the polarity is +1 (ON Event). If the brightness decrease exceeds the threshold, the polarity is
–1 (OFF Event).
      </p>
      <p>Speicifically, DVSs refer to those whose photodiodes only contain circuits that trigger events, the
mechanism of which is shown in Figure. 1. DAVISs, however, refer to those whose photodiodes can also
carry out global exposure.</p>
      <p>When  reaches the threshold  (upper boundary) or  (lower boundary), the comparator
outputs the signal of OFF or ON Event, which is connected with the global exposure signal and the
Address-Event-Representation (AER) handshake circuit through the OR gate. Then the row request
signal is output through the handshake circuit. When the row request is answered, the column request
signal is sent, and the column response signal is returned through the decision tree. This event is read
out and the pixel coordinates are obtained through the address encoder.</p>
      <p>In conclusion, the superiorities of the DVS make it especially suitable for intelligent systems such
as Unmanned Aerial Vehicle (UAV), aircraft, missile, intelligent shell and high-speed robot to carry out
tasks such as target detection and tracking, motion estimation and autonomous navigation in indoor and
outdoor environments.</p>
      <p>
        With years of development, the dynamic vision sensor has made progress towards higher resolution,
smaller pixel size and higher readout speed [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. At present, the resolution of the mainstream DVS has
reached 1 million pixels, with multiple modes such as gray mode, dynamic mode and optical flow mode.
Table.1 introduces the parameters of the latest dynamic vision sensors in comparison to a traditional
image sensor.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Ego-motion estmation</title>
      <p>Ego-motion estimation recovers the state of vision sensors given their output images of the scene they
are in with high accuracy. Considering the dimensions, 2D problem aims at solving 3-DOF (Degree
of Freedom) motion (2-DOF translation and 1-DOF rotation or 3-DOF pure rotation) and 3D problem
tackles the estimation of 6-DOF arbitrary motion.</p>
      <p>Frame-based ego-motion estimation is mostly realized through either filter-based or
optimizationbased algorithms. Filter-based algorithms are the earliest applied in positioning and navigation, among
log
intensity
∆()</p>
      <p>reconstruction
NO NO NO</p>
      <p>FFO FFO FFO
(b) The event triggering principal of DVS.</p>
      <p>time
OFF Threshold
Reset Level</p>
      <p>ON Threshold
time
DAVIS346
event camera</p>
      <p>IniVation</p>
      <p>2020
346× 260
1
120
&lt;900
18.5× 18.5
12
✓</p>
      <p>DVXplorer
event camera</p>
      <p>IniVation</p>
      <p>2020
640× 480
200
110
&lt;700
9× 9
165
✕</p>
      <p>
        EB sensor [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]
      </p>
      <p>
        DVS Gen4 [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]
event camera
Prophesee
      </p>
      <p>2020
1280× 720
&gt;124
73@300Meps
4.86× 4.86
1066
✕
event camera</p>
      <p>Samsung</p>
      <p>2019
1280× 960
90
140
4.95× 4.95
✕</p>
      <p>
        CeleX-V [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]
event camera
      </p>
      <p>CelePixel</p>
      <p>2019
1280× 800
&lt;0.5
&gt;120
390
9.8× 9.8
100
✓
(a) The diference between imaging machanism of</p>
      <p>DVS and standard camera.
which the most widely used is Extended Kalman Filter (EKF). They are incremental methods, where
the current camera state is considered only relevant to the camera state at one timestamp ahead. This
presumption makes them suitable for small sources of data yet is rather idealized in real situations. In
contrast, optimization-based algorithms are batch methods that consider all state estimation results
within an interval ahead to estimate the current camera state. They concern more information and are
proved to be more robust and accurate.</p>
      <p>Event-based ego-motion estimation is carried out following two event processing patterns: (1)
processing event-by-event and (2) processing on groups of events. Event-by-event-based methods enable
every event to asynchronously update the system state, preserving the inherent high temporal resolution
of event sensors. However, an individual event fails to depict the change of the whole scene and may
sufer from strong noise signals. Therefore, it is reasonable to update the camera state with forms of
event groups, such as event maps (EM), time surfaces (TSs), event frames, voxel grids and so on,</p>
      <p>Under these patterns, the estimation problem is usually addressed within three kinds of frameworks:
iflter-based, optimization-based and Artificial Neural Network (ANN) -based framework. An overview
of recent works on event-based ego-motion estimation can be seen in Table.2.</p>
      <sec id="sec-3-1">
        <title>3.1. Filter-based framework</title>
        <p>Probabilistic (Bayesian) filters, including Kalman filters, EKFs and particle filters (PFs), update present
camera state by prior states.</p>
        <p>Probabilistic filters have grown to be major pose estimation methods in event-by-event processing
scenarios, because they naturally fit the characteristics of events: (1) filters operate asynchronous data of
events, ensuring high temporal resolutions and (2) filters are particularly applicable to limited scale of
computing resources of events. [24] proposed the first 6-DOF high-speed camera tracking algorithm in
random natural scenes. A robust filter combining Bayesian estimation and posterior approximation of a
distribution in the exponential family was put forward, enabling event-by-event pose updates from an
existing photometric map of the scene. This work revealed 6-DOF high-speed tracking capabilities by
event-based method and freed the tracking algorithm from limitation of scene texture.</p>
        <p>Dimension</p>
        <p>In recent years, the filter-based framework has also become workable for groups of events with the
contribution of event outlier rejection technique. For instance, [25] presented an EKF that updated
camera pose for event packets collected within small temporal windows of 100 . This was made possible
by an event-to-line matching which validated or discarded events quickly before they are stacked for
estimation.</p>
        <p>To summarize, filter-based methods suit the asynchronous nature of events and are applicable to
both event-by-event and event-group algorithms. Broadly speaking, they appear to be used not as much
as other methods, especially in complex scenes, due to their considerable consumption of resource to
calculate and save camera states.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Optimization-based framework</title>
        <p>Optimization is another dominant means of ego-pose estimation, which is mostly carried out on event
groups. In practice, the optimization of the pose of camera is realized through the optimization of a loss
function, which takes on diferent forms for diferent algorithms and optimization objectives [47].</p>
        <p>For example, [28] tracked event camera under a maximum likelihood optimization framework from a
photometric 3D map. The optimization objective is the error between the measured intensity change
from event frames and the predicted intensity change calculated from the given photometric 3D map.
[30] presented an enhanced motion tracker that first used TS-based method in all circumstances, and
then applied EM-based method to optimize pose parameters when the optimization problem might
degenerate. [33] proposed a 2D translation velocity estimation algorithm which could be seen as the
backend of a VIO system. The loss function was built based on the so-called Continuous Event-Line
Constraint that described the relationship between line projections from events and ego-motion of the
event camera. The optimization objective was the geometric distance between the reprojected 3D line
and the events.</p>
        <p>Event-based optimization algorithms depart from conventional frame-based algorithms in that most
involve motion compensation to eliminate noise and motion blur for accumulated event groups. In
motion compensation algorithms, events are supposed to be triggered on the pixels which an edge
moves across. [35] put forward the first unifying framework of motion compensation on the assumption
that the ego-motion is uniform within a small time interval. It made a representative contribution of
Contrast Maximization (CMax) framework which produced motion-compensated edge-like event images
for 6-DOF camera pose estimation. It estimated the parameters of the motion that best fits a group of
events by warping events to a reference time and maximizing their alignment, producing a sharp image
of warped events (IWE). This fundamental framework was later renovated by several works [36, 37, 38].
[39] was another milestone that proposed the Entropy Minimization (EMin) framework. It estimated
motion directly in the 3D space rather than projecting them onto image planes like CMax [35]. Therefore,
it can solve motion problems in arbitrary dimensions by optimizing a family of entropy loss functions for
the minimal dispersion. [26] addressed ego-motion estimation with a novel probabilistic approach which
Event-to-Frame Conversion
 &gt; 0 ℎ!(,)</p>
        <p>Synchronous</p>
        <p>Event Frame
Asynchronous Events
(,,,)
*
"
Integration Time &lt; 0 ℎ#(,)

Features</p>
        <p>Vectorized
Descriptor</p>
        <p>Rotation Estimation
modeled event alignment as a spatial-temporal Poisson point process. Camera rotation was estimated by
maximizing joint probability of events, which achieved higher accuracy than Cmax [35], AEMin [39]
and EMin [39] models in most scenarios.</p>
        <p>In general, optimization-based methods are considered as most widely adopted for event-based
egomotion estimation. Optimization of the pose of camera is implemented by minimizing specific loss
functions with the help of optimizers. Future works in this field may similar paths as prior works:
refining object functions and inventing motion compensation methods for events to better depict the
change of scene and upgrading existing algorithms for higher dimension of motion.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. ANN-based framework</title>
        <p>As deep-learning technologies flourish in recent years, they are popularly applied to ego-motion
estimation. [48] introduced the first deep learning framework to retrieve the 6-DOF camera pose from
a single frame. This groundbreaking work found that, compared to conventional key-point methods,
using Convolutional Neural Network (CNN) to learn deep features appeared to be more robust in
challenging scenarios such as noisy or uncleared images. This conclusion boosted the development of ANN
architecture for 6-DOF pose investigation in computer vision.</p>
        <p>Accordingly, multi-layer ANN has grown to be another mainstream method for ego-motion estimation
from events, atypical structure is shown in Figure.2. They train their networks to optimize loss functions
that include the state parameters of camera (also discussed in [47]). One of the representative works using
event-by-event deep learning method was [40], which applied an event-based on-chip Spiking Neural
Network (SNN) to the estimation of 2-DOF head pose of the iCub robot. ANN-based methods in the form
of event groups are abundant to list, most of which were supervised. The work of [41] and [43] both
served event frames as the neural network input, diferent in that [ 41] stacked events in dual channels of
opposite polarities while [43] accumulated events in one single channel. Unlike the prior works that only
used CNN or LSTM to obtain depth and geometry information, the network in [43] was composed of both
a CNN to learn deep features from the event frames and a stack of LSTM to learn spatial dependencies in
the image feature space, outperforming the state-of-the-art in pose estimation in general or challenging
circumstances with short inference time. In recent years, unsupervised ANNs with loss functions built
without restrictive conditions were also developed to solve event-based ego-motion estimation tasks.
The earliest works that adopted a self and unsupervised manner still relied on input resources other than
events, like greyscale images [49], or auxiliary assumptions, like the photoconsistency assumption [50].
The most recent works have realized unsupervised ANNs that only take events as input [44, 45].</p>
        <p>To conclude, multi-layer ANNs with architectures like CNN and SNN have shown to perform well
in event-based ego-motion estimation. Either original events or event groups were fed into ANNs for
whom to regress the pose of camera. The birth of unsupervised networks that took pure events as input
has further simplified the problem. It is prospective that novel networks will be designed to fully exploit
the advantages of diferent architectures.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Event-based tracking</title>
      <p>Estimating the pose and trajectory of rigid-body robustly and accurately is the first step to achieve
positioning and mapping. Vision sensors output refined textures of the scenes for 6-DOF motion
estimation. To achieve that, VO and VIO frameworks were proposed. However, under restrictions of
global exposure mechanism of standard cameras, frame-based VO often sufer from motion blur especially
in terms of rotation, high-speed and high-mobility motion. Adding Inertia Measurement Unit (IMU) to
VO will increase the robustness of system. In a tightly or loose coupled VIO framework, triaxial angular
velocity and acceleration output from IMU provide pose estimation when feature tracking fails, and
visual features correct the drift of IMU. Unfortunately, when feature tracking fails for a long time, the
drift cannot be corrected. In conclusion, robust vision information output is the key component of VO or
VIO system.</p>
      <p>Event cameras can output information continuously in high temporal resolution without any motion
blur. Event-based VO and VIO are thus explored to deal with the problem of pose and trajectory estimation
in challenging scenes.</p>
      <sec id="sec-4-1">
        <title>4.1. Event-based visual odometry</title>
        <p>Visual Odometry is a dominant approach to estimate the pose and trajectory using the visual features of
scenes. If the real depth of visual features in scenes is estimated in the meanwhile, a global map can
then be built, namely Simultaneously Localization and Mapping (SLAM). Similar to frame-based VO, two
configuration schemes of event-based VO are generally considered, namely monocular and stereo VO.
The majority of research focus on monocular event-based VO, because the configuration of this scheme
is simpler than stereo scheme and comparable in terms of accuracy.</p>
        <sec id="sec-4-1-1">
          <title>4.1.1. Angular velocity and rotation estimation</title>
          <p>Angular velocity and rotation estimation are fundamental process of VO. An angular velocity estimation
method was presented in [51], confirming that event camera was capable to estimate the 3D rotational
motion of rigid-body. Currently, learning-based and optimization methods domain this field. For the
learning-base methods, SNNs [52] are introduced to this task, which are comparable than ANNs-based
method. For the optimization methods, CMax [53] and Rodrigues’Rotation Formula [54] are introduced
as objective function for optimization.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.1.2. Monocular and stereo visual odometry</title>
          <p>In this task, optimization and filter-based methods are mainstreams. Filter-based methods are the first to
be proposed. In [55], an event-based VO named EVO was presented, which was considered as the first
SLAM system that only depended on event camera. The system transformed event streams into event
frames and tracked the poses via optimizing the error between event image and semi-dense map using
the inverse compositional Lucas–Kanade (L-K) method. A semi-dense 3D map was constructed by
Eventbased Multi-View Stereo (EMVS), a geometric 3D reconstruction method. Optimization-based methods
are currently the most dominant. The choice of the objective equation for optimization is the core
diference among these methods. Specifically, reprojection error minimization [ 56, 57], spatiotemporal
registration [58] and CMax [59] et.al. It is worth noting that [59] presented an event-based VO called
ETAM using continuous ray warping and volumetric contrast maximization. It extended CMax into 3D
estimation, in which the target of optimization was maximizing the variance of Volume Warped Event,
and achieved a sharpest warped event frame. It then built a VO, consisting of single-frame optimization
as front-end based on CMax and a global optimization using B-spline curve model as back-end. In
addition, there are methods [32, 60] that utilize Time Surface Maps (TSMs) to build maps and track poses
while performing depth estimation.</p>
          <p>In summary, precise pose estimation and tracking are the front-end of event-based VO, and
optimization is the back-end. The key step of event-based tracking is motion compensation, the majority of
aforementioned event-based VO selected optimization method to achieve that. CMax and nonlinear
optimization have become mainstream in recent years, because filter-based methods occupy a large
storage for saving the landmarks of a map ineficiently. Specifically, event-based tracking compute
an image of warped events and sharpen the image by optimization. The sharpness of warped images
reflected the accuracy of pose tracking. Thus, the objective function and tool for optimization are critical
research topics. However, current event-based VO and VIO continue in the typical framework aiming
at frame-based VO and VIO, the unique characteristics of event camera are still not be manifested in
current processing.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Event-based visual inertial odometry</title>
        <p>Visual Inertial Odometry is based on VO, adding IMU as a component of pose and trajectory estimation
system. VIO transcends VO in terms of accuracy and robustness upon most occasions. Generally, VIO
comprises of two steps, namely the front-end and the back-end. The front-end decides the data format
of visual information, such as event frame and time surface. It extracts visual features as the input of
the back-end. The back-end refers to the fusion method of visual information and IMU measurements,
the main approaches are filter-based method [ 61], probabilistic method [62] and optimization method
[63, 64, 65, 66]. Typically, [64] presented an approach for tightly-coupled VIO named UltimateSLAM
combining events, images and IMU measurements. To synchronize the two vision sensors, this approach
accumulated events as event frames at the same timestamps of standard frames and motion-compensated
the event frames. It tracked the features of event frames and standard frames using FAST conner detector
[67] and L-K tracker [68] individually. If the features could be triangulated and belonged to key-frames,
then it fused these features and IMU measurements with nonlinear optimization, achieving the results of
pose and trajectory estimation. This pipeline was demonstrated in real-time on a light-weight quadrotor
system.</p>
        <p>The original intention of adding IMU is to enhance the robustness of frame-based VO, because IMU
can still maintain data output when standard cameras sufer from motion blur. The addition of the IMU
has also improved the robustness of event-based VO. However, event cameras can work stably in rotating
and high-speed scenes. Therefore, in theory, event cameras can perform without the addition of IMU.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Event-based mapping</title>
      <p>Mapping is the final goal of SLAM. In section IV, the world coordinates of landmarks are obtained to build
a sparse spatial map. However, more information of the scene is needed for diversified applications, which
prompts the birth of semi-dense and dense map. Compared to sparse map, semi-dense or dense map
models more or all of what is captured by the camera instead of only landmarks. These are commonly
used in robot navigation, where routes and obstructions should all be reconstructed. They are also
applied where 3D reconstruction with full texture of the scene or a target object is necessary for realistic
and aesthetic purposes. Driven by practical purposes, this section thus focuses on semi-dense and dense
mapping, which is equivalent to estimating the depth of objects in the scene. Note that mapping in most
cases is preceded by ego-motion estimation, which means that the poses of cameras at all timestamps
are given information.</p>
      <p>Depth estimation in frame-based SLAM is solved in three mainstream approaches: (1) in the case of
adopting a monocular camera, calculating the motion of the camera and then triangulating the depth of
space points; (2) in the case of adopting a stereo camera, triangulating the depth of space points with
the optical parallax between two frames; (3) using depth estimation setup, for example RGB-D camera
and lidar, to directly obtain depth information. In comparison with the third approach, the former two
approaches involve a significantly larger amount of computing resources and are more fragile. But they
are more robust in large-scale outdoor scenes.</p>
      <p>With event cameras emerging, event-based monocular and stereo depth estimation methods arise
inheriting the former two for frame-based depth estimation. Meanwhile, event-based depth estimation
using structured light is developed and works for both monocular and stereo scenarios.</p>
      <sec id="sec-5-1">
        <title>5.1. Monocular depth estimation</title>
        <p>Table. 3 lists recent event-based monocular depth estimation methods, classified according to diferent
criteria on method and experiment. Depth estimation from a monocular event camera is a challenging
task for the hardship in data association. Specifically, the temporal relationship between events cannot
be straightly acquired. Therefore, early methods for event-based monocular depth estimation involved
additional information, like an intensity image, in order to address the data association issue. Works in
recent years have simplified the work of mapping by eliminating those auxiliary conditions.
Rebecq [69]
Gallego [35]
Haessig [73]
Chaney [74]
Zhu [44]
Carrió [75]
Baudron [76]
Gehrig [77]</p>
        <p>Density
semi-dense
semi-dense
semi-dense
semi-dense
semi-dense
dense
dense
dense
"Own data" refers to own dataset applied that is not open source.</p>
        <p>[69] did the pioneering work that reconstructed semi-dense depth map from monocular event streams
without requiring event associations or intensity images. It generalized an even-based space-sweep
algorithm that estimated 3D structures from frame-based MVS [70] data without traditional data association
[71] for a moving event-based EMVS. In this work, individual events were considered to back-project
corresponding rays that spanned spatial structures and events from multi views split the space up into
disparity space image (DSI) voxels [72]. A ray counter counting rays that traversed each voxel was
formed to determine ray density per voxel and a semi-dense map was obtained by computing voxels
with a local maxima of ray density, which corresponds to a structural point of the scene. [35] solved
the same problem as [69], estimating depth of 3D structures from multi-view events. It worked under
the optimization-based CMax framework that is mentioned in Section.3.2 where events are warped into
motion-corrected images. And the correct depth could be found where patches of warped events had the
highest variance.</p>
        <p>Most works in recent years addressed the mapping problem in an ANN-based fashion. However, due
to the asynchronous nature of event streams, data association appeared to be a major hardship, especially
for deep-learning methods. Attempts were made to achieve better events alignment by applying variable
network architectures, renovating algorithms or adjusting event representation inputs. [73] transplanted
the model of depth estimation from the time of focus to event-based SLAM. This work presented a novel
SNN approach to the depth from defocus problem for depth map reconstruction, considering events were
ideal spikes input to the SNN. The core of this network was a Leaky-Integrate-and-Fire-neurons based
focus detection network composed of two input neurons for ON and OFF polarities events respectively.
[74] designed an ANN specially for environments with a ground plane. It trained to learn the ratio
between the height of a point from the ground plane and its depth in the event camera frame, after which
height and depth information could be decomposed easily given the ground plane calibration.</p>
        <p>Some other works discussed event representation for preserving the spatial-temporal information of
event streams. CNN [44, 75, 77] is introduced for this task. [44] constructed a CNN with an unsupervised
encoder-decoder architecture for depth prediction. It took discretized volumes of events as input to
preserve the temporal distribution of the events as well as remove motion blur. Meanwhile, Rotational
Neural Network (RNN) is introduced to handle the asynchronous data of events combined with frames.
[77] did this by applying an encoder-decoder architecture on UNet which maintained an internal state that
was updated asynchronously through events or frames input and could be decoded to depth estimation
at any timestamp.</p>
        <p>Overall, depth is estimated via the projection and coordinate transformation of features in monocular
SLAM. Monocular depth estimation methods attempted to address the hardship of data association,
which was to recover the temporal association between events. Among these methods, the ones based
on deep learning have shown more robustness, because they can integrate several cues from the event
stream, and thus have drawn great attention from researchers. The rise in the ability of novel methods to
estimate depth from monocular events have also resulted in denser maps, which provided more detailed
information of the scene for more real 3D reconstruction and more accurate navigation.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Stereo depth estimation</title>
        <p>It is feasible to use frame-based stereo systems to estimate depth, because the shutter of the two cameras
is triggered synchronously, thus feature extraction and matching for the left and right images are directly
operated at the same timestamps. However, in the stereo system composed of two events cameras,
the pixel matching of the left and right cameras is dificult. The principles of event co-occurrence and</p>
        <p>Event Frame
light patterns to the scene. An event camera extracts features along illuminated patterns to generate
event streams. In some works, events are further aggregated into event frames to depict features in a
clearer manner, where green line represents ON events and red line represents OFF events.
epipolar constraint are often used to estimate the depth. Namely, the two events triggered by the edge in
3D space are on corresponding epipolar lines of the left and right cameras. However, due to the existence
of latency and noise, it is dificult to achieve this in pixel level implementation. In summary, the key step
of depth estimation in event-based stereo system is finding the correspondence events of both cameras.</p>
        <p>The most significant theory, hardship and algorithms about event-based stereo depth estimation were
surveyed by [78]. For one thing, it introduced the supporting principle for stereo vision that disparity
(the horizontal displacement) in two eyes of stereo camera is inversely proportional to the depth. For
another, it outlined the core problem to obtaining disparity, which was to match corresponding events
from two eyes along with the mismatching problem incurred by the high temporal resolution and high
sensitivity of event sensors.</p>
        <p>As was mentioned in [78], correspondence existed between disparity information and depth
information in stereo problems. [79] thus realized event-based disparity estimation by introducing lifetime
estimation of single events, which can be used for map reconstruction. It raised accuracy of disparity
estimation by generating sharp gradient images from lifetime matching between corresponding events
from two sensors. [80] utilized the velocity of event camera for generating disparity estimation. [81]
developed a disparity mapping network with the stereo framework of [82] as the baseline, reserving the
event embedding and stereo matching sub-networks in the previous study. In the meantime, it made
major architectural modifications to the image reconstruction by integrating a cross-semantic attention
mechanism and feature aggregation sub-networks by modulating event features with reconstructed
image features with a stacked dilated spatially-adaptive denormalization mechanism.</p>
        <p>
          In addition, window-based [83, 84], uniqueness constraint [85] and optimization method [
          <xref ref-type="bibr" rid="ref7">7, 32, 86, 87,
88</xref>
          ] were feasible for event matching. Furthermore, frame-based deep learning method [31, 86, 89, 90, 91,
92, 93, 94, 95, 96, 97] were applied to address these problems. The above works took inputs from a pair
of event sensors. Distinctly, [98, 99] went down a diferent route. In this work, the so-called stereo setup
included a frame-based camera and an event-based camera. [98] estimated dense disparity from stereo
frames when they were available, predicted the disparity using odometry information, and tracked the
disparity asynchronously using optical flow of events between frames.
        </p>
        <p>In summary, depth is estimated via stereo matching using the disparity of two sensors in stereo SLAM.
Accuracy and eficiency of events correspondence of both cameras are the key standards for evaluating
stereo mapping algorithms.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Depth Estimation Using Structured Light</title>
        <p>Structured light (SL) is considered as the most reliable technique in depth estimation. When applied in
event-based SLAM, the hardware setup for an SL system mostly includes a Digital Light Process (DLP)
lightcrafter model casting simple or encoded light patterns to the illuminated scene with a mirror array
reflecting light back and an event camera or a pair of event cameras receiving light to generate images.
A common setup for event-based monocular SL can be seen in Figure.3. Its main purpose is to simplify
the extraction of features and facilitate data association in two views. In event-based systems, the
measurement of spatial points using SL is accomplished by the calibration of the relative pose between
the lightcrafter and the event camera, followed by triangulation when events of corresponding points
are identified by data association.</p>
        <p>A universal calibration procedure for event-driven DLP-based monocular depth estimation systems
was firstly proposed by [ 100]. Its main contribution was a Temporal Metrices Mapping (TMM) calibration
algorithm that calibrates the event camera and galvanometer of DLP with two temporal matrices attained
through scanning a front-parallel plane and corresponding scanning speed.</p>
        <p>As for triangulation, recent works on monocular depth estimation using SL largely focused on adopting
high-frequency light patterns to fit the high temporal resolution of event cameras, such as
frequencytagged light patterns [101], blinking lights of a pseudo-random pattern [103] and periodic fringe patterns
[91]. [102] projected temporally modulated lights of two wavelengths and triggered events by bispectral
diference induced by light absorbance diference of a certain medium. The good merits of high temporal
resolution and high dynamic range of event cameras were fully exploited to obtain unafected bispectral
diference for depth calculation. [ 104] built a novel formulation comprising a laser point projector and
an event camera. It estimated dense depth by maximizing the spatial-temporal consistency between
data from the projector and the event camera, when interpreted as a stereo system. This work took
advantage of the focusing power of laser point light source and the data redundancy suppression, high
temporal resolution and HDR of event camera to produce more robust mapping in high-speed motion.
[105] adopted a similar hardware system with [104] but followed a more adaptive path in SL illumination
where the density of projected laser in a certain area depended on the intensity of scene activity in that
area to reduce power consumption.</p>
        <p>SL can also be integrated with event-based stereo setup to simplify stereo correspondence. A typical
work of event-based stereo depth estimation using SL was [106], in which a mirror-galvanometer-driven
laser served as SL projector to generate blobs in the space. These blobs triggered events that were
captured by two event cameras and were served as the key points for triangulation.</p>
        <p>In general, the integration of SL has intuitively made depth features directly accessed by SLAM
systems. Hardware innovations have exploited the attractive properties of events with diverse light
encoding patterns adapting to the high temporal resolution merit of event cameras, while laser point
light source was popularly applied to meet the HDR merit of event cameras.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Resources</title>
      <p>In this section we summarize present resources (datasets and simulators) for event-based navigation and
positioning, as listed in Table.4. Most of these resources have been widely applied for researchers to
test the accuracy, robustness and computational eficiency of their event-based SLAM algorithms. The
results served as benchmarks for the performance of new methods, which has played significant roles in
driving the techniques in this field forward.</p>
      <sec id="sec-6-1">
        <title>6.1. Resources for ego-motion estimation</title>
        <p>One of the features of ego-pose estimation datasets is to provide existing information, in most cases a
reconstructed depth map. [24, 28] released datasets for event-based camera tracking from an existing
photometric depth map constructed by and RGB-D camera. Event streams were generated by DVS and
the known photometric depth map was constructed from prior mapping by an RGB-D camera. The latter
further improved the accuracy of 3D reconstructed map by attaching ElasticFusion poses from a motion
capture system. [107] released the first dataset specifically for ornithopter robot perception in indoor
and outdoor scenarios. This dataset was generated to prove the advantage of event camera applied to
lfapping-wing ornithopters.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Resources for tracking</title>
        <p>Datasets and simulators are numerous to list for VO, VIO and SLAM. DAVISs, RGB-D cameras or stereo
cameras, external motion capture systems like OptiTrack or odometry systems on hardware platforms
are commonly used for generating events, depth, groundtruth motion respectively [108, 109]. One of
the earliest and most classic event-based SLAM resources was [27], which released an event-based
dataset and simulator for pose estimation, visual odometry and SLAM presenting variable scenes. Latter
work [34, 110] boosted the study of event-based positioning and navigation with dataset derived from
aggressive high-speed motions in changeable illumination scenes that were beyond the capabilities
of existing tracking algorithms. In light of real-world SLAM applications, [111] proposed to include
multi-sensor configurations for solving motion disturbances and illumination conditions together.</p>
        <p>Catering to the rise of deep learning methods in event-based vision, datasets devised to train and
test the performance of ANNs were released accordingly. [42] published the first annotated DAVIS
driving recordings. This dataset was specially built for end-to-end (E2E) CNN and CNN/RNN networks
in VO/SLAM. Vehicle speed, GPS position and driver steering, throttle, brake captured from the car’s
on-board diagnostics interface were given for computing ground truth. This work was expanded by
[112] in terms of road types, weather and daylight conditions. Following these works, [113] published
the largest event-based dataset with ground truth of independently moving entities. This dataset was
recorded specially for testing deep-learning-based SLAM algorithms targeted for cameras in anomalous
motion, which was made possible by including multiple labeled independently moving entities into the
dataset. [114] released the first event-based dataset which included accurate pixel-wise motion masks,
ego-motion and ground truth depth for the test of learning motion segmentation method.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Resources for mapping</title>
        <p>Resources for mapping so far were mostly generated by synchronised stereo setups originally for stereo
depth estimation. However, they can also be adapted to monocular depth estimation by only using events
and images from one of the cameras, left or right. [31] released the first and most widely used
eventbased stereo depth dataset for driving, which was later improved by [115] for the first high-resolution,
large-scale stereo event dataset in driving scenarios. [94] published synthetic sequences of rotating
synthetic 3D object and real-world sequences of fast-rotating objects for testing the ability of algorithm
to operate on nonrigid rapidly rotating objects.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Discussion</title>
      <sec id="sec-7-1">
        <title>7.1. Current and future application of event-based positioning and navigation</title>
        <p>In general, event camera has the ability to estimate rotation, depth, and pose in complex environments
(whether indoors or outdoors) with low power consumption and without interruption. Meanwhile, it is
very suitable for deployment in navigation and positioning scenarios that frequently perform complex
maneuvers with strict restrictions on power consumption and highly dependency on visual information.
Specifically, UAV autonomous navigation, high-speed object detection and obstacle avoidance are some
examples.</p>
        <p>Researchers will continue to work on event-based navigation and positioning algorithms that are
more eficient and easier to implement on hardware. Faster and more accurate motion compensation
approaches will hopefully be worked out to output high-quality poses for tracking. At the same time,
in parallel pipelines, the depth of the scene can be eficiently estimated, whether monocular or stereo,
and finally realize positioning and mapping in complex environments, improving robustness and speed
while minimizing power consumption.</p>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. Advantages of event camera in indoor positioning and navigation</title>
        <p>For indoor environment concerned in this paper, event camera can be incorporated as vision sensor in
positioning and navigation. The high dynamic range (HDR), high temporal resolution and low power
of event camera better cater to the complex characteristics of indoor environment, ensuring robust
performance of the system.</p>
        <sec id="sec-7-2-1">
          <title>7.2.1. High dynamic range</title>
          <p>Unlike outdoor fields where natural light ofers consistent illumination bright enough for camera to
capture scene features, indoor environments are often dynamic with complex structures illuminated by
artificial lighting. Event cameras boast high dynamic range that can reach 140  compared to a common
60 of frame-based cameras. This property is especially required by navigation and positioning in
extreme working scenarios, for example in natural open field for long durations, where illumination
condition may vary largely within a long period of time. High dynamic range ensures that navigation is
consistent and robust to environmental alternations.</p>
        </sec>
        <sec id="sec-7-2-2">
          <title>7.2.2. High temporal resolution</title>
          <p>Indoor scenes are usually limited in space with a lot of obstructions. Therefore, vehicles or robots are
frequently put under rapid maneuver control to avoid crashing onto obstacles. In these circumstances, the
high temporal resolution property of event camera is needed for robust SLAM. Event cameras are capable
of outputting event streams at microsecond level of temporal resolution in lab and sub-millisecond in
the real world, enabling the navigation system to reconstruct obstacles rapidly and thus vehicles to react
quickly. This also results in less motion blur as in common frame-based cameras, so that events are
generated by actual features within the scene rather than noises caused by high-speed motion.</p>
        </sec>
        <sec id="sec-7-2-3">
          <title>7.2.3. Low power consumption</title>
          <p>The limited scale of indoor environment places restrictions on the volume and power dissipation of
the hardware system for complicated motion of vehicle and more durable navigation. In event sensors,
pixels only react to brightness changes that reach a priorly defined threshold. While the system-level
power consumption of a traditional camera may be around 1 2W, that of event cameras can reach lower
than 24mW. The power-saving feature makes event camera applicable to indoor onboard navigation and
positioning for compact equipments that may not be able to carry power packs with mass battery.</p>
        </sec>
      </sec>
      <sec id="sec-7-3">
        <title>7.3. Changllenges of event camera in indoor positioning and navigation</title>
        <sec id="sec-7-3-1">
          <title>7.3.1. The lower bound of dynamic range</title>
          <p>Event cameras can sense strong light intensity changes, but are not sensitive enough to weak changes
(0.1lux). In extremely dim scenes, minor changes in lighting can generate large numbers of events, when
in reality, they are all noise. The real events are drowned in noise, this phenomenon is very severe in
low-light scenes. The latest event camera Prophesee EVK4 can perceive a minimum light level of 0.08lux
and has enhanced low-light capability, but the noise problem still cannot be solved. This brings great
challenges to the application of event cameras in indoor scenes that are often dimly lit. In 2021, DARPA
announced that it has begun research on event cameras in the infrared band to enhance the ability of
event-driven sensors to work in low-light conditions, but it still remains on paper.</p>
        </sec>
        <sec id="sec-7-3-2">
          <title>7.3.2. The noise from event stream</title>
          <p>Existing neuromorphic vision sensors sufer from three main types of output noises: background activity
noise (BA), hotspot noise and flicker noise. In a static scene, most noises can be easily removed by
judging the time correlation and the flicker frequency in the sliding time window. However, when the
sensor performs complex motion, it is very dificult to remove hot spot noise and flicker noise. The
plane is represented as a region, not a sparse point. Events generated by ambient light and reflections in
windows have longer timestamps due to camera motion, much like events generated by dynamic objects
in the scene. Meanwhile, events triggered by the static objects without the flickering efect under the
motion of camera are temporally consistent. Using methods such as TS, it is easy to distinguish static
from dynamic containing flicker noise, but it is dificult to distinguish flicker noise from real dynamic
objects. By accurately estimating the camera trajectory, optical flow estimation and pixel area matching,
the flicker noise from reflective objects can be judged to a certain extent. But this is still dificult in
practice, because it is hard to efectively extract and track their features. We are likely to regard a mirror
as a dynamic object and ignore it when building a map, which is easy to cause collisions.</p>
        </sec>
        <sec id="sec-7-3-3">
          <title>7.3.3. The configuration scheme of sensors</title>
          <p>The hardware configuration of existing event-based systems for navigation includes single event camera,
binocular event cameras, an event camera combined with other visual sensor, and multi-source integration
with IMU. The above schemes are proven to be feasible. From a complementary perspective, the scheme
of an event camera and a standard camera combination can take into account both high-speed and
low-speed scenes. On the one hand, in high-speed exercise, the camera as the main sensor provides
event flow without motion blur. On the other hand, the standard camera as the main sensor provides fine
scene texture characteristics. This configuration is suitable to compensate for the lack of information
when event camera stays at low speed or stationary and motion blur when standard camera moving at
high speed. Judging from Section.4, a single event camera can complete the feature extraction, tracking,
and depth estimation. Although the addition of IMU has proven to improve the robustness of the system,
we believe that a single event camera can be competent without IMU. The task of indoor positioning
and navigation is actually very complicated with lot of parallel pipelines. If detection, identification and
control (such as obstacle avoidance) are summarized into part of the navigation task, then the system
requires the addition of standard cameras and IMU to meet the needs of diverse tasks.</p>
        </sec>
      </sec>
      <sec id="sec-7-4">
        <title>7.4. Neccesity of specific event-based hardwares for indoor navigation and positioning</title>
        <sec id="sec-7-4-1">
          <title>7.4.1. Necessity of specific event camera for positioning and navigation</title>
          <p>In indoor tasks, sensors with low resolution are suficient to obtain finer scene information for the
limited depth of field. The reduction of sensor resolution results in the reduction of data volume and
accordingly the load pressure on the back-end data processing system as well as improved computing
speed. Furthermore, the noise of sensors is much higher in a dim environment (more than half of the
data being noise) than in a bright scene. The reduction in sensor resolution also reduces the amount of
noise. For complex indoor environments, this improvement can greatly enhance the responsiveness and
maneuverability of the system.</p>
          <p>Existing sensors output single event streams without any denoising processing, which leads to a high
event rate in complex scenes. The back-end system has to use complex algorithms or hardware to perform
denoising first, which leads to low eficiency. Simultaneously, due to the separation of sensors and
computing hardware, data transfer process needs to be repeated. This unnecessary process also causes a
lot of computational delays. Therefore, an ideal event-based sensor for positioning and navigation should
have chip-level or sensor-level denoising capabilities and output high-quality data at the sensor level,
which can significantly reduce the event rate and maintain the sparsity property of event stream. In order
to enhance the intelligence level of the sensor, the output data should undergo a level of preprocessing
to output features that can be used for tracking, namely events that have undergone feature extraction.
After these preprocessing, the data output by the sensor can be directly used by the back-end, which has
the significant advantages of high eficiency, sparseness and low power consumption.</p>
        </sec>
        <sec id="sec-7-4-2">
          <title>7.4.2. Necessity of specific neuromorphic processing hardware for positioning and navigation</title>
          <p>As a matter of fact, signals triggered by event-based sensor are naturally processed by an event-based
processing system, namely SNNs calculating hardware. The existing calculation hardware is generally
CPU, FPGA or GPU, but they are not designed for events. In order to use these setups to handle events,
the events must be converted to data formats suitable for hardware, but such transformations often
sacrifice the sparse and asynchronous properties of the event itself. This leads to the acceptance of only
neural network methods, but without their usually heavy calculation. Those methods do not actually play
the sensing advantages of event drive. Therefore, there are still some gaps compared to ordinary visual
sensors in the performance indicators of actual applications. So far no SNN training mechanism that is
broadly-accepted and feasible has been generated, which should hopefully facilitate the deployment and
implementation of SNN networks on the hardware to truly exert the advantages of neuropsychological
perception and calculation in visual navigation and positioning. In the end, the neurological sensor and
the neurological calculation hardware are combined into neurological visual navigation and positioning
system, which is truly high speed, high dynamic, low power consumption.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion</title>
      <p>Event cameras are the representative achievement of neuromorphic vision boasting high time
resolution, high dynamic range and low latency compared with standard camera. Their emergence makes
applications that traditional cameras cannot handle possible, bringing a revolution to visual applications,
especially in vision navigation and positioning that are full of challenges and dificulties. In this paper,
we briefly introduce the principle of event cameras. Then we overviewed the research of event-based
vision in navigation and positioning, including ego-motion estimation, event-based tracking, event-based
mapping and datasets for estimation and analysis. Great challenges are remained in existing event-based
navigation and positioning research. But challenges are opportunities. We analyze the advantages of
event-based solutions, the possible improvements and research directions and make suggestions for
neuromorphic hardware specialized for navigation. Finally, we put forward prospects. We hope that
this paper can give researchers inspirations, so that neuromorphic vision can play a greater role in
indoor navigation and positioning as well as achieve intelligent perception and calculation in complex
conditions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Event-based timestamp image encoding network for human action recognition and anticipation</article-title>
          , in: 2021
          <source>International Joint Conference on Neural Networks (IJCNN)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hadviger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Cvišić</surname>
          </string-name>
          , I. Marković,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vražić</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Petrović</surname>
          </string-name>
          ,
          <article-title>Feature-based event stereo visual odometry</article-title>
          ,
          <source>in: 2021 European Conference on Mobile Robots (ECMR)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grimaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Boutin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Perrinet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-H.</given-names>
            <surname>Ieng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Benosman</surname>
          </string-name>
          ,
          <article-title>A homeostatic gain control mechanism to improve event-driven object recognition</article-title>
          ,
          <source>in: 2021 International Conference on Content-Based Multimedia Indexing (CBMI)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Cao</surname>
          </string-name>
          , G. Chen,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Knoll</surname>
          </string-name>
          ,
          <article-title>Fusion-based feature attention gate component for vehicle detection based on event camera</article-title>
          ,
          <source>IEEE Sensors Journal</source>
          <volume>21</volume>
          (
          <year>2021</year>
          )
          <fpage>24540</fpage>
          -
          <lpage>24548</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Akolkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-H.</given-names>
            <surname>Ieng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Benosman</surname>
          </string-name>
          ,
          <article-title>Real-time high speed motion prediction using fast aperture-robust event-driven visual flow</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>44</volume>
          (
          <year>2020</year>
          )
          <fpage>361</fpage>
          -
          <lpage>372</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Scarpellini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Morerio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Del Bue</surname>
          </string-name>
          ,
          <article-title>Lifting monocular events to 3d human poses</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1358</fpage>
          -
          <lpage>1368</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.-H.</given-names>
            <surname>Ieng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carneiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Osswald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Benosman</surname>
          </string-name>
          ,
          <article-title>Neuromorphic event-based generalized time-based stereovision</article-title>
          ,
          <source>Frontiers in Neuroscience</source>
          <volume>12</volume>
          (
          <year>2018</year>
          )
          <fpage>442</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tang</surname>
          </string-name>
          , G. Pan,
          <article-title>Indoor lighting estimation using an event camera</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>14760</fpage>
          -
          <lpage>14770</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Risi</surname>
          </string-name>
          , E. Calabrese, G. Indiveri,
          <article-title>Instantaneous stereo depth estimation of real-world stimuli with a neuromorphic stereo-vision setup</article-title>
          ,
          <source>in: 2021 IEEE International Symposium on Circuits and Systems (ISCAS)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <article-title>Image reconstruction with event cameras based on asynchronous particle filter</article-title>
          ,
          <source>in: 2022 5th International Symposium on Autonomous Systems (ISAS)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , D. Liu,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>Training weakly supervised video frame interpolation with events</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>14589</fpage>
          -
          <lpage>14598</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <article-title>Turning frequency to resolution: Video super-resolution via event cameras</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>7772</fpage>
          -
          <lpage>7781</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.-K. Kim</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-J. Yoon</surname>
          </string-name>
          ,
          <article-title>Joint framework for single image reconstruction and super-resolution with an event camera</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis &amp; Machine Intelligence</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kim</surname>
          </string-name>
          , Eventvlad:
          <article-title>Visual place recognition with reconstructed edges from event cameras</article-title>
          ,
          <source>in: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>2247</fpage>
          -
          <lpage>2252</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jeong</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-J. Yoon</surname>
          </string-name>
          , Eomvs:
          <article-title>Event-based omnidirectional multi-view stereo</article-title>
          ,
          <source>IEEE Robotics and Automation Letters</source>
          <volume>6</volume>
          (
          <year>2021</year>
          )
          <fpage>6709</fpage>
          -
          <lpage>6716</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G.</given-names>
            <surname>Gallego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Delbrück</surname>
          </string-name>
          , G. Orchard,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bartolozzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Taba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Censi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Leutenegger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Davison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Conradt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Daniilidis</surname>
          </string-name>
          , et al.,
          <article-title>Event-based vision: A survey</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>44</volume>
          (
          <year>2020</year>
          )
          <fpage>154</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.-C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Delbruck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Indiveri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Whatley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Douglas</surname>
          </string-name>
          ,
          <article-title>Event-based neuromorphic systems</article-title>
          , John Wiley &amp; Sons,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>T.</given-names>
            <surname>Delbruck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Berner</surname>
          </string-name>
          ,
          <article-title>Temporal contrast aer pixel with 0.3%-contrast event threshold</article-title>
          ,
          <source>in: Proceedings of 2010 IEEE International Symposium on Circuits and Systems</source>
          , IEEE,
          <year>2010</year>
          , pp.
          <fpage>2442</fpage>
          -
          <lpage>2445</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>C.</given-names>
            <surname>Posch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Matolin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wohlgenannt</surname>
          </string-name>
          ,
          <article-title>A qvga 143 db dynamic range frame-free pwm image sensor with lossless pixel-level video compression and time-domain cds</article-title>
          ,
          <source>IEEE Journal of Solid-State Circuits</source>
          <volume>46</volume>
          (
          <year>2010</year>
          )
          <fpage>259</fpage>
          -
          <lpage>275</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>C.</given-names>
            <surname>Posch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Serrano-Gotarredona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Linares-Barranco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Delbruck</surname>
          </string-name>
          ,
          <article-title>Retinomorphic event-based vision sensors: bioinspired cameras with spiking output</article-title>
          ,
          <source>Proceedings of the IEEE</source>
          <volume>102</volume>
          (
          <year>2014</year>
          )
          <fpage>1470</fpage>
          -
          <lpage>1484</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>T.</given-names>
            <surname>Finateu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Niwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Matolin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tsuchimoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mascheroni</surname>
          </string-name>
          , E. Reynaud,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mostafalu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Brady</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chotard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>LeGof</surname>
          </string-name>
          , et al.,
          <volume>5</volume>
          .10 a 1280 ×
          <article-title>720 back-illuminated stacked temporal contrast event-based vision sensor with 4.86  m pixels, 1.066 geps readout, programmable event-rate controller and compressive data-formatting pipeline</article-title>
          , in: 2020 IEEE
          <string-name>
            <given-names>International</given-names>
            <surname>Solid-State Circuits</surname>
          </string-name>
          Conference-(ISSCC), IEEE,
          <year>2020</year>
          , pp.
          <fpage>112</fpage>
          -
          <lpage>114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Suh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Seo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.-H.</given-names>
            <surname>Yeo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Namgung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bong</surname>
          </string-name>
          , et al.,
          <source>A</source>
          <volume>1280</volume>
          ×
          <article-title>960 dynamic vision sensor with a 4</article-title>
          .
          <fpage>95</fpage>
          -
          <article-title>m pixel pitch and motion artifact minimization</article-title>
          ,
          <source>in: 2020 IEEE international symposium on circuits and systems (ISCAS)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Live demonstration: Celex-v: A 1m pixel multi-mode event-based sensor</article-title>
          , in: 2019 IEEE/CVF Conference on
          <article-title>Computer Vision and Pattern Recognition Workshops (CVPRW)</article-title>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>1682</fpage>
          -
          <lpage>1683</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>