<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Visual Odometry with Depth-Wise Separable Convolution and Quaternion Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giorgio De Magistris</string-name>
          <email>demagistris@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Danilo Comminiello</string-name>
          <email>danilo.comminiello@uniroma1.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Napoli</string-name>
          <email>cnapoli@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Janusz T. Starczewski</string-name>
          <email>janusz.starczewski@pcz.pl</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer, Control, and Management Engineering Sapienza University of Rome</institution>
          ,
          <addr-line>Via Ariosto 25, 00185 Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Information Engineering, Electronics and Telecommunications Sapienza University of Rome</institution>
          ,
          <addr-line>Via Eudossiana 18, 00184 Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Intelligent Computer Systems Czestochowa University of Technology</institution>
          ,
          <addr-line>al.Armii Krajowej 36, 42-200C Czestochowa</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Monocular visual odometry is a fundamental problem in computer vision and it was extensively studied in literature. The vast majority of visual odometry algorithms are based on a standard pipeline consisting in feature detection, feature matching, motion estimation and local optimization. Only recently, deep learning approaches have shown cutting-edge performance, replacing the standard pipeline with an end-to-end solution. One of the main advantages of deep learning approaches over the standard methods is the reduced inference time, that is an important requirement for the application of visual odometry in real-time. Less emphasis, however, has been placed on memory requirements and training eficiency. The memory footprint, in particular, is important for real world applications such as robot navigation or autonomous driving, where the devices have limited memory resources. In this paper we tackle both aspects introducing novel architectures based on Depth-Wise Separable Convolutional Neural Network and deep Quaternion Recurrent Convolutional Neural Network. In particular, we obtain equal or better accuracy with respect to the other state-of-the-art methods on the KITTI VO dataset with a reduction of the number of parameters and a speed-up in the inference time.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Monocular Visual Odometry consists in estimating the trajectory of an agent from a sequence
of images acquired at consecutive time instants from a single camera mounted on the agent.
Most of the visual odometry (VO) systems are based on a standard pipeline based entirely on
geometry. Some implementations achieved excellent results, and a number of them can also run
in real-time [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][2], however there is always a trade-of between real-time performances and
consistency due to the final local optimization procedure. Moreover, in the case of monocular
VO, the global scale is unknown and it must be computed using other measurements. Recently,
the first end-to-end deep learning pipeline for monocular VO (DeepVO) was introduced in [ 3].
The authors used a RCNN [4] in order to automatically extract and match features in consecutive
frames. In particular DeepVO is trained to predict the pose and orientation (and also the global
scale) directly from a sequence of RGB images. With this work we extend the aforementioned
model by introducing some optimizations aimed at reducing the number of parameters and
increasing the inference speed. We will achieve this with two important changes. We will replace
the standard real value convolution with the convolution in the quaternion domain, that allows
to reduce the number of parameters without harming the expressive power. We also introduced
Depth-Wise Separable Convolution (DC) both in the real and in the quaternion domain. The DC
convolution separates the spatial correlation from the channels correlation reducing drastically
the number of parameters and the number of operations. The rest of the paper is structured as
follows: section 2 introduces the problem of monocular visual odometries along with related
works; section 3 provides the mathematical background and describes in detail the quaternion
convolution (section 3.1) and the depth-wise separable convolution (section 3.2); section 5
illustrates the proposed method with the implementation details; section 6 discusses the results
and the conclusion is drawn in section 7.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>Visual Odometry has a long history in the computer vision community[5, 6, 7, 8, 9]. However,
thanks to the precise and clean geometric formulation of the problem, the vast majority of the
state-of-art monocular VO systems are based on the following standard pipeline [10, 11]:
Image sequence: the input of the pipeline is an ordered sequence of images collected by the
camera in consecutive instants. In order to match features, it is important that two consecutive
images have a suficient scene overlap.</p>
      <p>Feature Detection: in this step salient keypoints are extracted by each image, where keypoints
are patterns that are diferent from their neighbours and can be easily identified in a diferent
pose and orientation. Common algorithms such as SIFT [12] or FAST[13] consist in applying a
feature-response function over the entire image and detect features as local maxima.
Feature Description: in which each feature is converted into a compact representation that
can be matched with other descriptors.</p>
      <p>Feature Matching: the feature descriptors in consecutive images are matched according to a
similarity measure.</p>
      <p>Motion Estimation: the motion of the consecutive frames is computed using the
correspondences between the features descriptors in the two images.</p>
      <p>Bundle Adjustment: in this step the result is refined by optimizing the reprojection error on
the entire sequence of images. This is the most costly operation.</p>
      <p>Even tough applications based on this pipeline achieved excellent results, there is always a
trade-of between performance and consistency, and the right solution must be chosen carefully
considering both the navigation environment and the requirements. Moreover, monocular VO
approaches based on geometry are not able to recover the global scale, that must be recovered
through external measurements. On the other hand, deep learning have been rarely used
to tackle VO problems, and some existing approaches require a pre-processed optical-flow
as input [14]. DeepVO, introduced in [3], was the first deep learning model to estimate the
6 Dof of the poses directly from a sequence of RGB images. The model is composed of a
convolutional neural network (CNN) to extract local descriptors and a recurrent neural network
(RNN) to match the extracted features in consecutive frames. DeepVO will be used as a baseline
model and its performances will be compared to those of the same model with the addiction of
quaternion convolution, separable depth-wise convolution and both. Quaternion convolutional
networks (QCNN) for image processing were introduced in [15] in order to give colored images
a meaningful representation through the quaternion algebra. In particular a pixel is represented
as a single quaternion rather than a vector in ℛ3. In this way it is possible to interpret pixel
multiplications (with the Hamilton product) as rotations in the color space. The efect of this
representation is investigated in [16] where a QCNN based autoencoder, trained only on gray
scale images, is able to perfectly reproduce colored images at test time. Depth-wise separable
convolution (DC) firstly appeared in [ 17] and was popularized by its extensive application in
the famous Xception[18] and MobileNets[19] architectures. DC allows to drastically reduce the
number of parameters and operation without harming the accuracy. More details will be given
in section 3.2.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Background</title>
      <p>This section describes the mathematical foundation of the proposed approach. In particular the
ifrst part introduces quaternions and quaternion convolution, while the second part is about
the detph-wise separable convolution.</p>
      <sec id="sec-3-1">
        <title>3.1. Quaternions</title>
        <p>The quaternion algebra H was introduced by Hamilton in 1843 as an extension to the complex
algebra. A quaternion has a real component and three imaginary components i,j,k, with the
property that
Given two quaternions
and
2 = 2 = 2 =  = − 1
1 = 1 + 1 + 1 + 1
2 = 2 + 2 + 2 + 2
, they can be summed, multiplied by a scalar and multiplied by each other according to the
following formulae:
1 + 2 =(1 + 2) + (1 + 2)+</p>
        <p>(1 + 2) + (1 + 2)
 1 = 1 +  1 +  1 +  1
1 ⊗ 2 =(12 − 12 − 12 − 12)+
(1)
(12 + 12 + 12 − 12)+
(12 − 12 + 12 + 12)+
(12 + 12 − 12 + 12)
(2)
(3)
where ⊗ is the Hamilton product and it is the core of the Quaternion Convolution (QC). In
standard convolution (SC) each pixel is represented as a three channels (RGB) feature vector,
while in QC a pixel is a single quaternion where the imaginary parts are its RGB components
and the real part is the gray scale image.</p>
        <p>() = () + () + () + ()
Let  ∈ ×  be the image and  ∈ ×  the filter, both in the quaternion domain, then
the QC can be defined as:</p>
        <p>○ *  [, ′] = ∑︁ ∑︁ ,′ ⊗ +,′+′</p>
        <p>=1 ′=1
The peculiarity of this operation is that the information about the color space is preserved,
whereas in the standard convolution the contributions from the RGB channels are summed. In
QC the color space is modeled in the quaternion domain and each pixel, each weight of the
network and each element of the intermediate feature maps are represented in this domain. By
replacing the SC with the QC, the number of parameters and operations increases by a factor 4.
However it was shown [15] that quaternion convolutional networks have good performances
even if the number of kernels in each layer is reduced to match those of the real value convolution,
hence input and output channels are divided by √4 = 2. Quaternion convolution acts as a
regularizer and reduces the degrees of freedom of the trainable parameters, as explained in
[16]. In order to exploit this property, we further reduced the number of parameters dividing
both input and output channels by 4 (instead of 2). Thanks to this optimization, the number of
parameters of the convolutional networks drops from 1.3M to 416K as shown in table 1.</p>
        <sec id="sec-3-1-1">
          <title>Model Name</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>DeepVO</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>QDeepVO</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>DeepVO DSC</title>
        </sec>
        <sec id="sec-3-1-5">
          <title>QDeepVO DSC</title>
        </sec>
        <sec id="sec-3-1-6">
          <title>CNN Parameters 14.6M 3.7M 1.6M</title>
          <p>416K</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Depth-Wise Separable Convolution</title>
        <p>Depth-Wise Separable Convolution (DC) splits the correlation of the spatial features and the
features channels in two separate steps and consequently reducing the number of parameters.
In standard convolution, each output channel is the result of the sum of the activations of
N kernels, where N is the number of input channels. Let [] be the channel  of the
convolution at position  where  ∈ 2. The standard convolution equation can be written as:
[] = ∑︁  *  = ∑︁ ∑︁ [ + ]</p>
        <p>[]
  
(4)
Hence, a convolution of a feature map with size  ×  with  input channels with  filters
(output channels) with size  ×  requires  2 parameters and  2 2 operations. In
DC convolution, first the spatial convolution is computed indepentently for each input channel:
[] =  *  = ∑︁ [ + ][]

Of course the number of output channels equals the number of input channels. This computation
requires  2 parameters and  2 2 operations. Then a 11 convolution correlates the
channels:
[]′ = ∑︁  * 1×1</p>
        <p>This computation requires   parameters and  2  operations. Hence in DC convolution
the total number of parameters is   +  2 and the number of operations is  2 2 +
 2 . With simple algebraic manipulations, it can be shown that the reduction factor in
both the number of parameters and operations is +22 .
(5)
(6)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Dataset</title>
    </sec>
    <sec id="sec-5">
      <title>5. Method</title>
      <p>For training and testing our models we used the famous KITTI dataset [20] and in particular
the KITTI VO/SLAM benchmark, containing 22 sequences of RGB images, where the first 11
have the ground truth pose matrix associated to each image in the sequence. Figure 4 shows
two consecutive images in a sequence while figure 4 shows a ground truth trajectory computed
from the ground truth file.</p>
      <p>The three architectures presented in this section are derived from the DeepVO network. The
network is composed of 9 convolutional blocks, each with 2D convolution, ReLU activation,
Batch Normalization and Dropout. After the convolutional blocks there are two stacked LSTM
that receive as input the sequence of feature maps extracted by the convolutional layers (more
details are reported in table 3).</p>
      <p>The input of the network is a sequence of raw RGB images and the output consists in the 3
components of the translation vector expressed in meters and the 3 euler angles expressed in
degrees for each image in the sequence. The three variants of DeepVO introduced in this paper
are:
Quaternion DeepVO: with Quaternion convolution (as described in section 3.1) instead of
standard convolution.</p>
      <p>DeepVO DSC: with Depth-Wise Separable convolution instead of standard convolution.
Quaternion DeepVO DSC: with both Quaternion convolution and Depth-Wise Separable
convolution. The three networks have the same type and number of layers, but the implementation
of the convolution algorithm changes as explained in section 3. The advantage is a considerable
reduction in the number of parameters. In particular with Quaternion DeepVO, DeepVO DSC
and Quaternion DeepVO DSC we obtained a reduction in the number of parameters of 13M,
10.9M, 14.2M respectively in the three cases (these results are summarized in table 2). For the
quaternion implementation we opted for a full quaternion representation, where the real part
represents the gray scale image and the imaginary parts the RGB components, with respect to a
pure quaternion representation, in which the real part is set to zero.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>The three networks were trained for 200 epochs with learning rate 10− 3 on 5 of the 11 labelled
sequences (in particular on sequences 00, 01, 02, 08, 09), while the remaining 6 labelled sequences
are used for testing. The average translational RMSE drift in percentage on lengths of 100m
800m and the average rotational RMSE drift expressed in deg /100 on lengths of 100m-800m
are reported in table 2. The learning curves of the models are reported in figure 3 while tables 4
and 5 show the true and predicted trajectories in the test sequences. The results show that the
best performances are obtained by our Quaternion DeepVO model, while the metrics of our
Quaternion DeepVO DSC model are comparable to those of the baseline, but with a significant
reduction in the number of parameters and operations.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this paper we presented a novel end-to-end deep learning approach for monocular visual
odometry. The proposed method was based on the DeepVO, a state-of-art deep learning
architecture for VO, and introduced quaternion convolution and depth-wise separable convolution. We
obtained the best result in the quaternion domain, with 10.9M parameter less and a comparable
result with the depth-wise convolution in the quaternion domain, with 14.2M parameter less.
However the changes we introduced apply only on the convolutional part. We left as a future
work the investigation of other optimization strategies for the recurrent part, and in particular,
considering the success of the quaternion convolution, it would be of interest to keep also the
LSTM part in the quaternion domain.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>The initial work was carried out with the help of the students Alessandro Lambertini and Denise
Landini and supported by the Hermes-WIRED project within the Large Research Projects grant
framework 2020 funded by Sapienza University of Rome.
Networks</p>
      <sec id="sec-8-1">
        <title>DeepVO</title>
      </sec>
      <sec id="sec-8-2">
        <title>LSTM</title>
        <p>1000</p>
        <p>and automation, Ieee, 2009, pp. 4293–4299.
[2] A. Pumarola, A. Vakhitov, A. Agudo, A. Sanfeliu, F. Moreno-Noguer, Pl-slam: Real-time
monocular visual slam with points and lines, in: 2017 IEEE international conference on
robotics and automation (ICRA), IEEE, 2017, pp. 4503–4508.
[3] S. Wang, R. Clark, H. Wen, N. Trigoni, Deepvo: Towards end-to-end visual odometry with
deep recurrent convolutional neural networks, in: 2017 IEEE international conference on
robotics and automation (ICRA), IEEE, 2017, pp. 2043–2050.
[4] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko,
T. Darrell, Long-term recurrent convolutional networks for visual recognition and
description, in: Proceedings of the IEEE conference on computer vision and pattern recognition,
2015, pp. 2625–2634.
[5] M. Srinivasan, S. Zhang, M. Lehrer, T. Collett, Honeybee navigation en route to the goal:
visual flight control and odometry, The Journal of experimental biology 199 (1996) 237–244.
[6] G. De Magistris, R. Caprari, G. Castro, S. Russo, L. Iocchi, D. Nardi, C. Napoli,
Visionbased holistic scene understanding for context-aware human-robot interaction,
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics) 13196 LNAI (2022) 310 – 325. doi:10.1007/
978-3-031-08421-8_21.
[7] C. F. Olson, L. H. Matthies, M. Schoppers, M. W. Maimone, Stereo ego-motion
improvements for robust rover navigation, in: Proceedings 2001 ICRA. IEEE International
Conference on Robotics and Automation (Cat. No. 01CH37164), volume 2, IEEE, 2001, pp.
1099–1104.
[8] N. Brandizzi, S. Russo, G. Galati, C. Napoli, Addressing vehicle sharing through behavioral
analysis: A solution to user clustering using recency-frequency-monetary and vehicle
relocation based on neighborhood splits, Information (Switzerland) 13 (2022). doi:10.
3390/info13110511.
[9] D. Nistér, O. Naroditsky, J. Bergen, Visual odometry, in: Proceedings of the 2004 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR
2004., volume 1, Ieee, 2004, pp. I–I.
[10] N. Brandizzi, S. Russo, R. Brociek, A. Wajda, First studies to apply the theory of mind
theory to green and smart mobility by using gaussian area clustering, volume 3118, 2021,
p. 71 – 76.
[11] F. Fraundorfer, D. Scaramuzza, Visual odometry: Part i: The first 30 years and fundamentals,</p>
        <p>IEEE Robotics and Automation Magazine 18 (2011) 80–92.
[12] D. G. Lowe, Distinctive image features from scale-invariant keypoints, International
journal of computer vision 60 (2004) 91–110.
[13] E. Rosten, T. Drummond, Machine learning for high-speed corner detection, in: European
conference on computer vision, Springer, 2006, pp. 430–443.
[14] G. Costante, M. Mancini, P. Valigi, T. A. Ciarfuglia, Exploring representation learning with
cnns for frame-to-frame ego-motion estimation, IEEE robotics and automation letters 1
(2015) 18–25.
[15] X. Zhu, Y. Xu, H. Xu, C. Chen, Quaternion convolutional neural networks, in: Proceedings
of the European Conference on Computer Vision (ECCV), 2018, pp. 631–647.
[16] T. Parcollet, M. Morchid, G. Linarès, Quaternion convolutional neural networks for
heterogeneous image processing, in: ICASSP 2019-2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 8514–8518.
[17] L. Sifre, S. Mallat, Rigid-motion scattering for texture classification, arXiv preprint
arXiv:1403.1687 (2014).
[18] F. Chollet, Xception: Deep learning with depthwise separable convolutions, in: Proceedings
of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.
[19] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto,
H. Adam, Mobilenets: Eficient convolutional neural networks for mobile vision
applications, arXiv preprint arXiv:1704.04861 (2017).
[20] A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets robotics: The kitti dataset,
International Journal of Robotics Research (IJRR) (2013).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Scaramuzza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fraundorfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Siegwart</surname>
          </string-name>
          ,
          <article-title>Real-time monocular visual odometry for on-road vehicles with 1-point ransac</article-title>
          , in: 2009 IEEE International conference on robotics
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>