Finding a Strong Key Point Correspondences in
     Large-Scale Images and Depth Maps

               Andrey Dashkevich1[0000−0002−9963−0998] , Darya
                 1[0000−0001−7868−0067]
    Vorontsova                  , and Sergii Rosokha2[0000−0001−8239−6122]
          1
                National Technical University ”Kharkiv Polytechnic Institute”,
                                  Kharkiv 61002, Ukraine
                          {dashkewich.a, dvorontso}@gmail.com
              2
                 NPP ”SPECPOZHTEHNIKA” Ltd, Kharkiv 61017, Ukraine


       Abstract. Depth map fusion is an essential task in multi-view stereo
       vision and structure from motion methods, but there are some issues of
       the process of depth maps matching of large-scale outdoor scenes, which
       are acquired by unmanned aerial vehicles. First, they can have low depth
       resolution due to long distance from a scene to a camera, in the second, as
       a consequence of above problem and some camera defects, they can have
       some noise. In our work we propose an approach of matching the depth
       maps based on a local shape and a color similarity correspondences. The
       proposed idea is to find similar regions by calculating local key points
       in small patches of two images by ORB feature detector. Then we find
       corresponding regions assuming spatial proximity of movement vectors
       of key points, this is done by the search of the main vector direction
       and length in feature space. We test our approach with the help of video
       sequences acquired through a consumer multicopter.

       Keywords: Depth Map Fusion · Multi-view Stereo Vision · Structure
       From Motion · Unmanned Aerial Vehicles · Key Points · ORB Feature
       Detector · Spatial Similarity · Nearest Neighbors Search.


1    Introduction
At present there is an increase in quantity of stereo vision method applications
in the production, robotics, in systems of virtual and augmented reality, geoin-
formation systems, computer games and simulating. This brings about the need
for creation and development efficient methods and algorithms for solving the
problems of stereo vision. The present paper represent a process of finding the
effective ways to solve the problem of comparing large-scale depth maps obtained
by unmanned aerial vehicles. At the same time, the obtained depth maps can
have a low depth resolution due to the large distance from the camera to the
scene and, as a result, such depth maps can have noise.
    There are several approaches for solving the problem of finding strong cor-
respondences of images and depth maps which aimed to increase accuracy of 3d
model reconstruction. Among them are various variations of the iterative clos-
est points (ICP) [1], robust point matching [2], methods based on probability
theory, correlation and Gaussian mixture models [3], coherent point drift [4],
method of sorting the correspondence space [5]. In consequence of simpleness of
realization and low computational complexity, various modifications of the ICP
algorithm have found wide application [6]. A detailed comparative review and
ways to increase the accuracy of algorithms for ICP matching can be found in
[7]. In [8], the problem of sampling point’s assortment for increase the accuracy
of ICP algorithms was considered. As the ICP algorithms are the greedy opti-
mization techniques, the initial matching has great influence on registration of
the point sets. Therefore we need methods to find more precise initial approxi-
mation of surface matching to increase probability of convergence to the global
minimum. The main contribution of presented work is the algorithm of finding
of initial correspondences between two surfaces which leads to decreasing of time
complexity of the ICP algorithms and increasing of ability to converge to global
minimum.


2   Method Overview

Given a set of n consecutive frames in F = f1 , f2 , ..., fn and a set of n − 1 depth
maps M = m1 , m2 , ..., mn−1 , acquired from pairs of adjacent images in F . It is
necessary to find such pairs of points p, q ∈ R3 that for each point pj ∈ mi and
the corresponding point qk ∈ mi + 1, i = 1..n − 1, j, k = 1..|m|, | · | – number of
pixels in the depth maps will be: qk = Apj + t, where A is the 3 × 3 rotation
matrix, t is the 3 × 1 transfer vector. Alongside strict correspondences between
points of neighboring depth maps are unknown in advance. Thus, the task of
finding the transformation A and t can be divided into two sub-tasks: finding
the indices j, k for the corresponding pairs of points p, q on the depth maps and
comparison of the obtained point sets.
    In our work, we proceed from the assumption that adjacent frames have
a small differences of the direction of movement of points, which consists of
transference and rotation combination, while camera is significantly distant from
the scene, we assume a rectilinear motion. Based on these assumptions, it can
be expected that the movement of points in close sections of images will have
almost similar values of the motion direction. At the same time, the initial point
correspondences are found in the original RGB image space using the robust
search of feature descriptions, such as scale-invariant feature transform (SIFT)
[9] or oriented FAST and rotated BRIEF (ORB) [10].
    We propose the following method for detecting strong correspondences be-
tween points on large-scale depth maps:

 1. We take set of images F as input. Each image has size W × H.
 2. We set the parameters for dividing images into local areas, which include
    the width W and the height H of the image local part.
 3. At each algorithm iteration, we take the images fi and fi+1 from F and the
    corresponding to them depth maps mi from M . Thus, we form the structure
    data = fi , fi+1 , mi .
 4. We split fi and fi+1 into t rectangular sections (patches) pj , j = 1..t, Pi =
    p1 , p2 , ..., pt , where t = bW/wc · bH/hc, bxc - integer part of x.
 5. The next step is defining arrays of key points Ki and Ki+1 for each segment
    pi+1 by the ORB algorithm, Ki = kj , j = 1..KN - number of key points.
 6. Initial mapping of points from Ki and Ki+1 is conducted. As a method of
    mapping can be used any method of search approximate nearest neighbors
    algorithm. We used the implementation from the FLANN library [11]. How-
    ever, at this stage, there are many false-positive pairs, which have strongly
    influence on the further depth map matching process.
 7. To remove incorrectly recognized correspondences of points, we propose a
    novel approach based on the dividing the parameter space of the key points
    into a regular grid and spatial indexing of the resulting grid for nearest points
    search.
 8. The obtained exact matches are transferred to the depth maps and they are
    compared by the ICP method.

    Let us demonstrate our approach to determine the exact matches of key
points. For each pair of key points we set the vector vj = kpj+1 − kpj , where
kpj – coordinates of kj in the patch p. Each vector vj has an angle α and length
l and indicates the direction of key point movement between frames.
    Assumption 1. In small areas of the image, correctly mapped points will give
close values of a and l, and false-positive pairs of points will give arbitrary cor-
responding values.
    Based on this assumption, it is possible to construct a classifier that will
find closely spaced points in the parameter space α − l. To speed up the finding
process of close elements in α−l-space, we discretize it by splitting into a regular
grid with the number of cells A along the Oα axis and the number of cells L
along the Ol axis, thus the size of the grid cell is:
                                               max α
                                    αsize =                                      (1)
                                                A
                                               max l
                                     lsize =                                     (2)
                                                L
subject to α, l ≥ 0
    Then we build accumulator Acc array with size (A × L) and initialize each
cell in it with zeros. For each vj we discretize it α and l:
                                                  αj
                                    αjdiscr = b      c                           (3)
                                                  A
                                                lj
                                     ljdiscr = b c                               (4)
                                                L
And then:
                   Acc[αjdiscr ][ljdiscr ] = Acc[αjdiscr ][ljdiscr ] + 1         (5)
Summary of our approach is provided on the Fig. 1.
                   Input images and depth maps


                                           Key points detection
                                         (ORB or SIFT descriptors)


                                           Initial matching
                                           with FLANN lib
                           Movement vectors of
                           key points


                  Discretization step
                         (t=8)        length cell with maximal
                                             number of vectors
                90o
       135o            45o           8
                          Hash-table 7
                           building 6
                                     5
180o                        0o       4
                            length                                     Recti cation
                                     3
                                     2
                         315o        1
                                                             angle
                                     0
                    o                 0 o 90o 180o 270o 360o
        angle   270


                                                                     Recti ed key
                                                                     point pairs


                        Fig. 1. Illustration of proposed approach.
    Due to Assumption 1 the cell with maximal value corresponds to the matched
key point pairs and we mark such as a ”good” matches. To accelerate computa-
tions we use a hash-table H to store indices of key points, the key value of the
elements of H we calculate as follows:
                             hash = αjdiscr · L + ljdiscr                       (6)
Therefore we can find strong correspondences of key points and their consecutive
directions.


3      Experiments
We tested our algorithm on depth maps obtained from a video taken by a UAV
with a large, relatively to the size of the seen, depth.
   Image data and algorithm parameters:
    • We have frames of video with 8 Mpx resolution, so size of our all images and
      depth maps is (W × H) = (4096 × 2160) pixels each.
    • Subdivision parameters: w = W/8, h = H/8.
    • Number of key points for ORB detector KN = 100.
    • Number of discretized grid cells along axis: A = L = 20.
     Result of depth map rectification by our algorithm is demonstrated in Fig. 2.


Fig. 2. Key point pairs after FLANN matching (top) and after rectification (bottom).
4    Conclusions

In our paper we demonstrate an approach for estimation of strong correspon-
dences between the key points in two sequential video frames. The proposed
algorithm can eliminate false-positive occurrences from an array of key points.
This is done by finding of major direction of key point movement in parameter
space.
    The limitation of our approach is that it can delete positive correspondences
between the key points, thus, the future work is aimed to improving the quality
of correct pairs matching. Another disadvantage is the ability of our algorithm
to match key points only in the case of camera or scene movement.


References
1. Bouaziz, S., Tagliasacchi, A., Pauly, M.: Sparse Iterative Closest Point. Computer
   Graphics Forum. 32, 113–123 (2013). https://doi.org/10.1111/cgf.12178
2. Gold, S., Rangarajan, A., Lu, C.-P., Suguna, P., Mjolsness, E.: New algorithms for 2d
   and 3d point matching: pose estimation and correspondence. Pattern Recognition.
   38 (8), pp. 1019–1031 (1998).
3. Evangelidis, G.D., Horaud, R.: Joint Alignment of Multiple Point Sets
   with Batch and Incremental Expectation-Maximization. IEEE Transac-
   tions on Pattern Analysis and Machine Intelligence. 40, 1397–1410 (2018).
   https://doi.org/10.1109/TPAMI.2017.2717829
4. Myronenko, A., Song, X.: Point-Set Registration: Coherent Point Drift. IEEE Trans-
   actions on Pattern Analysis and Machine Intelligence. 32, pp. 2262–2275 (2010).
   https://doi.org/10.1109/TPAMI.2010.46
5. Assalih, H.: 3D reconstruction and motion estimation using forward looking sonar.
   Doctoral dissertation, Heriot-Watt University. 208 p. (2013).
6. Park, S.-Y., Subbarao, M.: An accurate and fast point-to-plane reg-
   istration technique. Pattern Recognition Letters. 24, 2967–2976 (2003).
   https://doi.org/10.1016/S0167-8655(03)00157-0
7. Rusinkiewicz, S., Levoy, M.: Efficient variants of the ICP algorithm. In: Proceedings
   Third International Conference on 3-D Digital Imaging and Modeling. pp. 145–152.
   IEEE Comput. Soc, Quebec City, Que., Canada (2001).
8. Gelfand, N., Ikemoto, L., Rusinkiewicz, S., Levoy, M.: Geometrically stable sampling
   for the ICP algorithm. In: Fourth International Conference on 3-D Digital Imaging
   and Modeling, 2003. 3DIM 2003. Proceedings. pp. 260–267. IEEE, Banff, Alberta,
   Canada (2003).
9. Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings
   of the Seventh IEEE International Conference on Computer Vision. pp. 1150–1157
   vol.2. IEEE, Kerkyra, Greece (1999).
10. Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: An efficient alterna-
   tive to SIFT or SURF. In: 2011 International Conference on Computer Vision. pp.
   2564–2571. IEEE, Barcelona, Spain (2011).
11. Muja, M., Lowe, D.G.: Fast Approximate Nearest Neighbors with Automatic Al-
   gorithm Configuration. In: International Conference on Computer Vision Theory
   and Application VISSAPP’09). pp. 331–340. INSTICC Press (2009).