<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluation of Self-localization Estimation in Indoor Environments Using Cameras for Small Drones</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Genki Higashiuchi</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomoyasu Shimada</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hiroki Nishikawa</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiangbo Kong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hiroyuki Tomiyama</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Intelligent Robotics, Faculty of Information Engineering, Toyama Prefectural University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Graduate School of Information Science and Technology, Osaka University</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Graduate School of Science and Engineering, Ritsumeikan University</institution>
        </aff>
      </contrib-group>
      <fpage>93</fpage>
      <lpage>103</lpage>
      <abstract>
        <p>Recently, the demand of drones has increased dramatically. Drones are employed in variety fields, e.g., border security, search and rescue, surveying and recreational activities. Drones have become indispensable tools in each of these fields. Self-localization is an essential function that enables UAVs to lfy autonomously. While Global Positioning System (GPS) is efective outdoors, GPS is dysfunctional in the indoor environment because of signal blocking or irregular reflection. Additionally, small drones valued for their agility in narrow spaces. However, they are often limited by weight and cost constraints, restricting their sensor capabilities. In this paper, we investigates self-localization for small drones using monocular camera images. Specifically, we will employ Visual Odometry [ 1] to estimate the drone's position. By analyzing the impact of varying the number of image frames used per estimation, we aim to evaluate potential improvements in self-localization accuracy quantitatively. For the three pre-defined paths, the average error when using 25 images per estimation was found to be 89.17 cm.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Drone</kwd>
        <kwd>Tello</kwd>
        <kwd>Self-localization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        There has been a surge in recent research activity on drones. Due to their increased afordability
and convenience, drones are expected to find applications in a lot of fields, e.g., border security,
search and rescue, surveying, and recreational activities [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Self-localization is an essential
method for enabling autonomous drones flight. GPS is a standard and high-precision
selflocalization method in outdoor environments. However, GPS often lacks function in indoor
environments because of signal blocking and irregular reflection. Therefore, GPS is insuficient
for small drones, which is suficient in indoor environments. However, there are limitations
to installing sensors on small drones. In order to tackle this issue, we employ a monocular
camera to estimate self-lo localization for small drones. We utilize visual odometry to estimate
the relative position of moving drones based on images. In order to improve a self-localization
accuracy, we will experiment with varying the number of image frames used per estimation
and analyze the resulting performance changes. In the experiments, we use Tello, showed in
Figure 1. Tello is an afordable and programmable drone suitable for educational purposes. This
small drone weighs 80 grams, measures 98×92.5×41 mm, and mounts a monocular camera with
a field of view of 82.6 degrees. The camera can capture 720p resolution video at 30 fps. Tello
can operate at approximately 13 minutes of continuous flight time.
      </p>
      <p>The contributions of this paper are as follows:
1. By utilizing monocular camera images, we enable drone position estimation even in
GPS-denied indoor environments. This extends the operational range of drones and
allows them to function in areas where GPS is unavailable.
2. Utilizing only a monocular camera eliminates the need for expensive GPS receivers
and Inertial Measurement Units (IMUs). This approach results in a cost-efective
selflocalization system, making the technology more accessible for various applications.</p>
      <p>The structure of this paper is as follows. Chapter 2 discusses related studies to this
research. In Chapter 3, we describe self-localization using Visual Odometry. Chapter 4 shows the
experimental results. Chapter 5 concludes this paper and discusses future challenges.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Study</title>
      <p>
        This chapter presents a review of related work on self-localization. Monocular SLAM was
initially tackled using filtering techniques [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">3, 4, 5, 6</xref>
        ]. This approach involves processing each
frame through a filter to jointly estimate the positions of map features and the camera’s pose
(orientation and location). However, this method sufers from drawbacks such as computational
ineficiency when dealing with consecutive frames containing limited new information, and the
accumulation of errors due to linearization. In contrast, keyframe-based approaches estimate the
map using only a subset of selected frames (keyframes). This allows for the application of more
computationally expensive but highly accurate Bundle Adjustment (BA) optimizations. Unlike
ifltering methods, keyframe-based approaches decouple map updates from frame rate, enabling
more flexibility. Study by Strasdat et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] demonstrate that keyframe-based techniques achieve
superior accuracy compared to filtering approaches at similar computational costs. PTAM,
developed by Klein and Murray [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], is arguably one of the most well-known keyframe-based
SLAM systems. This system introduced the concept of parallel threads for camera tracking
and mapping, enabling real-time performance in small-scale augmented reality applications.
More recently, Mur-Artal et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposed ORB-SLAM2, a real-time, feature-based monocular
visual odometry system. This method combines FAST keypoint detection with BRIEF
rotationinvariant feature descriptors to achieve simultaneous estimation of the camera’s ego-motion
(self-motion) and the surrounding environment map. Engel et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] introduced Direct Sparse
Odometry (DSO), which directly utilizes the brightness gradients within images, rather than
relying on features, to estimate the camera’s ego-motion. By leveraging depth maps for direct
pose estimation, this approach achieves high accuracy and excels in fast-moving or large-scale
environments due to its ability to exploit visual depth information for ego-motion estimation.
Building upon both feature-based and direct methods, Forster et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposed Semi-Direct
Visual Odometry (SVO). This technique employs simultaneous feature tracking and direct
methods for optical flow estimation, achieving real-time visual odometry. SVO is characterized
by improved computational eficiency and exceptional real-time performance, demonstrating
its suitability for various indoor and outdoor environments.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Visual Odometry for Self-Localization</title>
      <p>This chapter delves into the methodology and experimental setup for estimating the
selflocalization of the Tello using Visual Odometry.</p>
      <sec id="sec-3-1">
        <title>3.1. Visual Odometry</title>
        <p>In this section, we will explain Visual Odometry. Visual Odometry is a technique used to
estimate the self-position and movement trajectory of moving robots or drones using image data
from vision sensors (typically cameras). This technology is particularly useful in environments
where GPS is not available, such as indoors or between buildings. Self-position estimation is
performed through six steps illustrated in Figure 2:
1. Reading of the camera’s intrinsic parameters: Load the intrinsic parameters of the camera
that captured the images.
2. Loading the image to be processed: Load the images captured during the previous flight.
3. Detection and tracking of keypoints: Detect feature points in specific pairs of image
frames using the ORB (Oriented FAST and Rotated BRIEF) algorithm. ORB is a popular
feature detector and descriptor in computer vision, commonly used for tasks like SLAM
(Simultaneous Localization and Mapping) and Visual Odometry, where camera pose
estimation is crucial. Additionally, the BRIEF (Binary Robust Independent Elementary</p>
        <p>Features) algorithm computes feature descriptors for each detected point, representing
the surrounding features numerically.
4. Feature matching: Utilize the FLANN (Fast Library for Approximate Nearest Neighbors)
library to compare and match detected feature points. FLANN is a widely used library for
eficient nearest neighbor searches in computer vision and machine learning applications.
To enhance matching accuracy, the distances between the two nearest neighbor points
are compared, and the closer point is chosen as the match.
5. Calculation of the essential matrix: Calculate the essential matrix from the matched
feature points. This matrix represents the geometric relationship between the two images,
encoding information about the relative positions and rotations of the features.
6. Camera pose estimation: Decompose the essential matrix into a translation vector and a
rotation matrix. These elements can be combined to form a transformation matrix that
represents the camera’s position and orientation in space. This information ultimately
allows for estimating the self-position and movement trajectory of the robot or drone.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Self-Localization Using Visual Odometry with the Tello</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation Experiment</title>
      <p>This section aims to quantitatively assess the performance of the self-localization algorithm by
varying the number of images used in a single estimation. The objective is to verify the impact
of varying image count on the accuracy of the task.</p>
      <sec id="sec-4-1">
        <title>4.1. Evaluation Methods</title>
        <p>The evaluation method for self-localization involves assessing the error between the estimated
position and the true (actual) position. This error is the distance between the two points shown
in Figure 7, and it is calculated using equation (1).</p>
        <p>= √︀( − )2 + ( − )2
(1)</p>
        <p>The research plan includes flying the pre-planned Paths 1 to 3, as depicted in Figures 8 to 10,
and evaluating their performance. Along these paths, the Tello will maintain a constant
orientation, without executing any turns. Error calculation will be performed at the red checkpoints
designated along Paths 1 to 3. The average error values from these checkpoints will be calculated
and compared.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results and Discussion</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In recent years, small drones have been widely used in various fields such as logistics, agriculture,
and disaster relief. Among them, The Tello have gained popularity among individual users
due to their afordable price and ease of operation. However, self-localization is essential for
the Tello to fly autonomously. In this study, we aimed to realize self-localization using the
Tello. The Tello are equipped with a monocular camera, and we performed self-localization
using a technique called visual odometry by utilizing the information of the image frames
captured by this camera. Visual odometry is a technique for estimating the camera’s movement
by identifying corresponding features between consecutive images. In the experiment, we flew
the Tello along a pre-set path and verified the accuracy of self-localization. The results show
that using 5 images per estimation minimizes the average error for path 1. Additionally, it was
found that using 25 images per estimation minimizes the average error for paths 2 and 3. These
results clarify the relationship between the number of images and accuracy/processing time
and contribute to the optimization of the self-localization model for the Tello. However, further
improvement in the accuracy of self-localization remains a challenge to be solved. In response
to these challenges, we plan to improve the accuracy by complementing the information from
sensors other than the monocular camera, such as a gyroscope and a barometer.
This work is partly supported by a research grant provided by the First Bank of Toyama, Ltd.
and partly commissioned by NEDO (Project Number JPNP22006).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Scaramuzza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fraundorfer</surname>
          </string-name>
          , Visual odometry [tutorial],
          <source>IEEE robotics &amp; automation magazine 18</source>
          (
          <year>2011</year>
          )
          <fpage>80</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Couturier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Akhloufi</surname>
          </string-name>
          ,
          <article-title>A review on absolute visual localization for uav</article-title>
          ,
          <source>Robotics and Autonomous Systems</source>
          <volume>135</volume>
          (
          <year>2021</year>
          )
          <fpage>103666</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Davison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. D.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Molton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Stasse</surname>
          </string-name>
          , Monoslam:
          <article-title>Real-time single camera slam</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>29</volume>
          (
          <year>2007</year>
          )
          <fpage>1052</fpage>
          -
          <lpage>1067</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Civera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Davison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Montiel</surname>
          </string-name>
          ,
          <article-title>Inverse depth parametrization for monocular slam</article-title>
          ,
          <source>IEEE transactions on robotics 24</source>
          (
          <year>2008</year>
          )
          <fpage>932</fpage>
          -
          <lpage>945</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chiuso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Favaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Soatto</surname>
          </string-name>
          ,
          <article-title>Structure from motion causally integrated over time</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>24</volume>
          (
          <year>2002</year>
          )
          <fpage>523</fpage>
          -
          <lpage>535</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Eade</surname>
          </string-name>
          , T. Drummond,
          <article-title>Scalable monocular slam</article-title>
          ,
          <source>in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)</source>
          , volume
          <volume>1</volume>
          , IEEE,
          <year>2006</year>
          , pp.
          <fpage>469</fpage>
          -
          <lpage>476</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Strasdat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Montiel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Davison</surname>
          </string-name>
          , Visual slam: why filter?,
          <source>Image and Vision Computing</source>
          <volume>30</volume>
          (
          <year>2012</year>
          )
          <fpage>65</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Murray, Parallel tracking and mapping for small ar workspaces, in: 2007 6th IEEE</article-title>
          and
          <article-title>ACM international symposium on mixed and augmented reality</article-title>
          , IEEE,
          <year>2007</year>
          , pp.
          <fpage>225</fpage>
          -
          <lpage>234</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mur-Artal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M. M.</given-names>
            <surname>Montiel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Tardos</surname>
          </string-name>
          ,
          <article-title>Orb-slam: a versatile and accurate monocular slam system</article-title>
          ,
          <source>IEEE transactions on robotics 31</source>
          (
          <year>2015</year>
          )
          <fpage>1147</fpage>
          -
          <lpage>1163</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Engel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Koltun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cremers</surname>
          </string-name>
          ,
          <article-title>Direct sparse odometry</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>40</volume>
          (
          <year>2017</year>
          )
          <fpage>611</fpage>
          -
          <lpage>625</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Forster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pizzoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Scaramuzza</surname>
          </string-name>
          , Svo:
          <article-title>Fast semi-direct monocular visual odometry</article-title>
          ,
          <source>in: 2014 IEEE international conference on robotics and automation (ICRA)</source>
          , IEEE,
          <year>2014</year>
          , pp.
          <fpage>15</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>