The 1st International Workshop on the Semantic Descriptor, Semantic Modelingand Mapping for Humanlike Perceptionand Navigation of Mobile Robots toward Large Scale Long-Term Autonomy (SDMM19) Dense-Loop: A Loop Closure Detection Method for Visual SLAM using DenseNet Features Chao Yu ZuXin Liu Tsinghua University, Beijing, China 100084 Beihang University, Beijing, China, 100191 yc19@mails.tsinghua.edu.cn xinye@buaa.edu.cn Xin-Jun Liu*, Fei Qiao*, Yu Wang, Fugui Xie, Qi Wei,Yi Yang Tsinghua University, Beijing, China 100084 {xinjunliu, qiaofei, yu-wang, xiefg, weiqi, yangyy}@mail.tsinghua.edu.cn Abstract Loop closure detection (LCD) is an important part in SLAM for the autonomous mobile robot. A recent trend is to employ off-the-shelf networks’ features to address LCD problem, which outperform tradi- tional hand-crafted features. However, what kind of network is more suitable in LCD and how to use their CNN features have not been well- studied. In this paper, we compare many popular networks and intro- duce DenseNet in this field. The features extracted by DenseNet, which preserve both semantic information and structure details, outweigh other popular CNN features signicantly. Then a DenseNet feature- based framework (Dense-Loop) is proposed to address the LCD prob- lem. We use the Weighted Vector of Locally Aggregated Descriptor (WVLAD) method to encode the local descriptors as the final global de- scriptor, which could resist geometry structure and viewpoint changes. Furthermore, 4 max-pooling by channel and locality-sensitive hashing (LSH) are adopted to ensure the real-time search. Extensive experi- ments are conducted on public datasets using Precision-Recall Curve evaluation method. The results demonstrate Dense-Loop could achieve state-of-the-art performance. 1 Introduction In recent years, the combination of semantics and SLAM has become a research hotspot, and many related works have appeared, such as DS-SLAM[YLL+ 18], DA-RNN[XF17] and so on. Most of these SLAM systems utilize semantics in Visual Odometry (VO) and Mapping, while introducing semantic information into loop closure detection (LCD) is indispensable and requires further research. Visual place recognition is a basic part in re-localization and loop closure detection for mobile robots[LSN+ 16]. If the robot could determine whether an image of a place has been visited before, then this information could help the robot re-localize itself, or correct the error and drift accumulated in the simultaneous localization and mapping (SLAM) process[LM13, MAT17]. Copyright © 2019 by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 27 The 1st International Workshop on the Semantic Descriptor, Semantic Modelingand Mapping for Humanlike Perceptionand Navigation of Mobile Robots toward Large Scale Long-Term Autonomy (SDMM19) However, this problem is very challenging. On the one hand, the same place may have different appearances at different time due to the illumination or viewpoint changes. On the other hand, two different places may have the similar texture and appearance. A false positive recognition of a place may corrupt the global optimization process and cause severer unrecoverable localization and mapping failure[Cum08]. Many effective methods have been proposed to solve loop closure detection problem in robotics field. One of the most prevalent methods is visual bag-of-words (BoWs)[MAT17, Cum08], which treats descriptors of local features as visual words. This kind of method can achieve good performance on place recognition, and it is robust against viewpoint changes. However, the hand-crafted features can hardly deal with environment changes, such as the illumination changes and similar textured regions[Cum08, UMCM14, GSM18]. Recently, many researchers have found the features extracted from off-the-shelf convolutional neural networks (CNN) have better performance than hand-crafted features[KSH12] and began to investigate how to use CNN features in LCD[CLJM14, SSD+ 15, AGT+ 18, SSJ+ 15, BWZ+ 16]. Even so, the research in this field is preliminary and incomplete, partially because of the weak interpretability of neural networks. Before delving into the paper, we first see some frequently asked questions when people want to employ CNN in LCD. First of all, there are numerous outstanding neural network architectures, which one is more suitable for LCD and what is the reason? Secondly, CNN features vary from hand-crafted features in respect of the quantity and dimension. Is traditional loop closure detection framework (such as BoWs) suitable for CNN features? If not, do we have better solutions? In this paper, we try to explore these problems in depth and give corresponding explanations. The main contributions include: 1. We compare many off-the-shelf networks and find DenseNet outweighs other popular networks in loop closure detection, because this dense-connected network could preserve both semantic information and structure details of the input image. 2. A loop closure detection framework (Dense-Loop) using DenseNet features is proposed in this paper. Decoupling by feature-maps (DBF) and Weighted Vector of Locally Aggregated Descriptor (WVLAD) method is utilized to make full use of DenseNet features according to its own distinctions. 3. Extensive experimental results show Dense-Loop approach could achieve state-of-the-art performance on public datasets. In the rest of the paper, the structure is as follows. Section 2 briefly introduces some current accomplishments of loop closure detection. Section 3 presents the proposed framework in detail. Subsequently, extensive compar- ative experiments and evaluation are presented in Section 4. Finally, a brief conclusion and the future work are summarized in Section 5. 2 Related works We categorize current accomplishments on loop closure detection into three groups: traditional hand-crafted feature-based approaches, end-to-end training approaches, and approaches based on the CNN features extracted from off-the-shelf networks. Many well-designed local features are widely used in place recognition and loop closure detection tasks because their ability to resist scale changes or orientation changes. One of the most successful use is FAB-MAP, which employs SUFT[BETG08] and BoWs for place recognition and demonstrates robust performance against viewpoint changes[Cum08].[MAT17] integrate ORB[RRKB11] and BoWs in SLAM. This kind of method becomes the most popular framework to detect loop closure in real-time visual SLAM systems. However, these hand-crafted features only care about low-level information of the image and can hardly deal with environment changes, such as illumination changes. Furthermore, these statistics based methods’ performance depends heavily on the quality of the features and may be easily deceived by the textured dynamic objects in the environment. Considering the shortcomings of the hand-crafted features, a recent trend in loop closure detection is to train a CNN network in an end-to-end manner. NetVLAD[AGT+ 18] is a novel architecture which aims to minimize the distance of two image representations of the same place. The training images are categorized into many tuples, where each training query image has corresponding potential positive samples and definite negative samples.[LAGOPGJ17] adopt the similar triplet training scheme and could produce a 128 dimension descriptor vector for each image. However, all of these supervised learning approaches require a large amount of labeled datasets to train. It is also a bottleneck for others to use the network for their own needs. Another trend is to exploit the learned features of the off-the-shelf networks with pre-trained weights. [CLJM14] employ CNN features based on OverFeat for place recognition. The performance of feature-maps 28 The 1st International Workshop on the Semantic Descriptor, Semantic Modelingand Mapping for Humanlike Perceptionand Navigation of Mobile Robots toward Large Scale Long-Term Autonomy (SDMM19) DBF WVLAD 4 max-pooling 4 Max-pooling Weighting W DenseNet by channel relu5_blk DBF x x H=7 H=7 H=7 Image i C=1024 C =256 C=256 Feature map weight Channel W=7 W=7 C=256 W=7 weight Loop closure VLAD detection I Image i Encoding Calculate score LSH Add image i into the K=512 Image j Image k hash table Image i Image j LSH D=256 Figure 1: The pipeline of Dense-Loop of different layers is explored. [HZZ15] focus on using AlexNet to generate an image representation appropriate for visual loop closure detection in SLAM. They find CNN features outperform hand-crafted features when illu- mination changes signicantly.[SSD+ 15] deploy pre-trained AlexNet as CNN features and using locality-sensitive hashing and semantic search space partitioning optimization techniques to ensure real-time search. These kind of methods do not require specific end-to-end training and thus are more convenient. The feature could be extracted without interference to the pre-trained networks that designed for other tasks. However, since there are numerous outstanding network architectures in recent years, which one is better and how to make good use of its inner features have not been fully explored. In this paper, we will explore what kind of network is more suitable in LCD and how to use them to achieve better performance without specific supervised training. 3 Framework of Dense-Loop In the proposed framework, the output of ReLu layer in the last dense block of DenseNet is adopted as the initial features and decoupling by feature-maps (DBF) is utilized to decompose the global feature into local descriptors. Then, 4 max-pooling by channel is adopted to reduce the computational complexity. Finally, Weighted Vector of Locally Aggregated Descriptor (WVLAD) method is proposed to improve the ability of resisting scale or viewpoint changes. To accelerate the searching process, locality-sensitive hashing (LSH)[RPH05] is employed according to the characteristic of Dense-Loop descriptors. The pipeline of Dense-Loop is shown in Figure 1, where C, H, W represent the dimension of the channel, the height and weight of feature-maps. K is the number of cluster centers and D represents the dimension of one cluster center. 3.1 Image descriptors extraction In the traditional BoWs, a lot of disordered local descriptors with low dimensions are extracted and they are designed to resist scale or viewpoint changes. However, CNN features are ordered and 3-dimension. Therefore, the first thing is to exact good features from CNN and map them to 2-dimension. 3.1.1 DenseNet features DenseNet is a compact network and made up of dense blocks. All layers in one dense block are directly connected to ensure maximum information flow between feature-maps. The input of each layer is all the preceding layers’ output, and thus, the block’s final classifier could obtain all the information of the previous feature-maps. This kind of compact internal representation could reduce feature redundancy and help to solve vanishing-gradient problem. The architecture of a 5-layer block in DenseNet is shown in Figure 2(a). DenseNet adopted in Dense- Loop is made up of 5 dense blocks. The output of ReLu layer in the last dense block is used as the raw features of the input image, where 7 × 7 is the size of feature-maps and 1024 is the number of channels. The reason for choosing the ReLu layer is that it is cleaner and contains less noise. 29 The 1st International Workshop on the Semantic Descriptor, Semantic Modelingand Mapping for Humanlike Perceptionand Navigation of Mobile Robots toward Large Scale Long-Term Autonomy (SDMM19) The reason of using DenseNet is its reuse of feature-maps. The features of low layers contain more structural information and measure fine-grained similarity, which is similar to hand-crafted features. While the features of higher layers care more about semantic information and measure semantic similarity. A natural idea is to utilize the complementary of high-layer and low-layer features. The outputs of last few layers preserve all extracted features of preceding layers, which means, the low-level features and high-level features are merged together in an efficient way. It is helpful for more fine-grained features expression of an image. The superiority of DenseNet will be illustrated in the experiment section in detail. 3.1.2 DBF and 4 max-pooling by channel Here are two ways to map these features to 2-dimension, as shown in Figure 2(b). One is decomposing the global feature into 49 local descriptors with 1024 dimensions, called decoupling by feature-maps (DBF). Anther way is to decompose 1024 local descriptors with 49 dimensions, called decoupling by channel (DBC). The former plan is chosen because it is of physical meaning, and it has better performance than DBC. Each pixel in the feature- map is corresponding to a receptive field in the input image, and all the channels of the pixel could describe the distinctions of the corresponding receptive field. As for DBC, it’s more like using many global descriptors to describe an image. But image’s viewpoint change may cause a shift in the feature-maps and thus the ability to resist geometry structure or viewpoint changes will be weaken. In order to ensure the real-time search, a method called 4 max-pooling by channel is proposed to reduce the descriptors’ dimensions with minimal accuracy reduction. 1024-dimension descriptors are divided into 256 groups and the maximum value of each group is used as the final descriptor. Compared with PCA, which is widely used to reduce dimensions, 4 max-pooling by channel has less computational complexity but similar performance. More results can be found in the experimental part. 3.2 WVLAD method In the traditional BoWs, BoW encoding method is used to measure the similarity of two images. BoWs is a statistical method and usually needs a large number of visual words (e.g. 106 ) in the dictionary. A lot of local descriptors with low dimensions are more suitable in this situation, while the CNN descriptors, which are decoupled by feature-maps, often have small quantity but large dimensions. Besides, it is hard to train such a huge BoW dictionary. Instead, Weighted Vector of Locally Aggregated Descriptor (WVLAD) is proposed in this paper to encode the 49 × 256 local descriptors of an image. WVLAD could ignore the geometric structure of the image via clustering and care more about the dis- tinctions via weight. Therefore, it’s more resistant to viewpoint and scale changes than calculating euclidean distance of CNN features. It is an improved method of famous Vector of Locally Aggregated Descriptor (VLAD)[JDSP10] method and inspired by Cross-dimensional Weighting for Aggregated Deep Convolutional Features (CROW)[KMO16] method. Usually we want the descriptors care more about the distinctions of the image and reduce the importance of the plain areas (e.g. sky). It’s similar to the human perception system, which is conducive to improving resistance to environment changes. One way is to use region proposal methods and compute regions’ descriptors respectively. Another way is to adopt the self-adaptive weight methods to adjust the importance of the textured regions and ordinary areas. The first way is computational expensive. Considering the need for real time, the second way is integrated in Dense-Loop. Figure 3 shows the detailed process of calculating the feature-maps weight (FW) and the channel weight (CW). The strong response of convolution is usually corrsponding to the region of objects. FW can force features to focus on the textured regions and help solving scale changes. Let F ∈ R(C×H×W ) denotes the 3-dimension features of the inner layer. X ∈ R(H×W ) represents one feature-map. c, h, w is the location of the feature vector. F W ∈ R(H×W ) can be calculated by summing feature-maps of all channels. Then L2-norm and a power normalization with power 0.5 are utilized to get aggregated feature-maps weight. X S= Xc (1) c sX S0 = 2 Sh,w (2) h,w p FW = S/S 0 (3) 30 The 1st International Workshop on the Semantic Descriptor, Semantic Modelingand Mapping for Humanlike Perceptionand Navigation of Mobile Robots toward Large Scale Long-Term Autonomy (SDMM19) (a) A 5-layer dense block DBF 49 local descriptors 1024-dimension DBC H=7 1024 local C=1024 descriptors W=7 49-dimension (b) DBF and DBC Figure 2: (a) A 5-layer dense block with a growth rate of k = 4. The figure is reproduced from[HLvMW17]. (b) The description of DBF and DBC. CW ∈ R(1×C) is similar to the idea of inverse documentary frequency (IDF) in BoWs, that is, reducing the importance of high-frequent features. P Xh,w >0 1 Tc = (4) H ×W ( PC Tc CWc = log( c=1 Tc ), Tc > 0 (5) 0, Tc = 0 Then, we can calculate the weighted feature-maps Fweight ∈ R(C×H×W ) . And decompose it into weighted local descriptors L, which means 49 local features with 256 dimensions. Fc0 = Fc × F W (6) 0 Fweight = Fc,h,w × CWc (7) In order to improve the ability of resisting geometry structure or viewpoint changes, VLAD is used to encode weighted local descriptors as a global descriptor. Firstly K-means is used to cluster all the weighted local descriptors of the datasets and get the codebook {u1 , ..., uK }, where K is the number of cluster centers. Each local descriptor Li has its corresponding cluster center uj : N N (Li ) = argminj kLi − uj k, where NN represents nearest neighbor. VLAD is denoted as a set of vector V = [v1T , ..., vK T ], where each vi is associated with a cluster center ui and has the same size. Then V is calculated by the concatenation of the residual of each Li and N N (Li ): X vi = Lt − ui (8) Lt :N N (Lt )=i Finally, a power normalization with power 0.5 and L2-norm is utilized to normalize V . 31 The 1st International Workshop on the Semantic Descriptor, Semantic Modelingand Mapping for Humanlike Perceptionand Navigation of Mobile Robots toward Large Scale Long-Term Autonomy (SDMM19) Feature-maps L2-Norm weight(FW) + ĊĊ + Power normalization FW H=7 C=256 ĊĊ Channel W=7 weight(CW) Input features ĊĊ CW C=256 Figure 3: The detailed process of calculating FW and CW. 3.3 Locality-Sensitive Hashing An important feature of loop closure detection for robotic application (e.g. SLAM) is real-time. In the traditional BoWs, K-D tree is adopted as the nearest neighbor search. However, the spatial dimension of Dense-Loop descriptors is far more than the number of words in the codebook, K-D tree will be unsuitable in such case. Instead, locality-sensitive hashing (LSH) is employed to speed-up the search with minimal accuracy degradation. The detailed process is shown in Figure 1. The Hamming distance between the respective hashed bit vectors, which is a cheap operation, is used to evaluate the similarity. According to our test, using 1024 bits retains approximately 99% performance but much more quick than brute search. 4 Experimental Results and Explanations 4.1 Datasets and evaluation method City Center dataset[Cum08] and New College dataset[Cum08] are widely used in visual SLAM research and loop closure detection evaluation in particular. The former dataset has many dynamic objects like pedestrians and vehicles. Besides, the sunlight, wind and viewpoint change may cause the features like shadow unstable. The latter New College dataset has many dynamic elements and repeated elements, such as similar walls and bushes. Ground truth are given in two datasets. Figure 4 shows the ground truth and the results of Dense-Loop. (a) Ground truth (b) Results of Dense-Loop Figure 4: The ground truth and the results of Dense-Loop on New College Dataset. Pixel (i, j) represents the relationships of image i and image j. However, the provided ground truth can’t be used directly. It’s inconsistent with the goal of loop closure detection because we only need to identify one loop in the same place. Therefore, new definition of the true loop are made based on the original ground truth. The images in one dataset are divided into two groups, named left 32 The 1st International Workshop on the Semantic Descriptor, Semantic Modelingand Mapping for Humanlike Perceptionand Navigation of Mobile Robots toward Large Scale Long-Term Autonomy (SDMM19) and right, and so is the ground truth. If a loop is detected, we will stop searching loop in 10 images (according to GPS) to avoid getting the same loop. When we vary the threshold if a loop closure is accepted, the precision and recall value will change and the PR-Curve can be gained. 4.2 Experiments and evaluation Some comparative experiments are conducted to explore the validity of Dense-Loop. Dense-Loop could achieve state-of-the-art performance on public datasets, The reason can be summarized as two points. One is excellent features from DenseNet, which take high-level semantic information and fine-grained information into account. Another is WVLAD method, which could ignore the geometric structure of the image via clustering and care more about the distinctions via weight. 4.2.1 Why DenseNet In recent years, there are many prevalent and excellent convolutional networks showing up, such as ResNet50[HZRS16], VGG[SZ14], DPN[CLX+ 17], SENet[HSS17], ResNeXt[XGD+ 17], NasNet[ZVSL17], SqueezeNet[IMA+ 16], Xception[Cho17], Inceptionv3[SVI+ 15], Inceptionv4 and Inception-ResNet[SIV17]. To ver- ify the excellent features of DenseNet, extensive comparative experiments were conducted. Figure 5 exhibits the PR-Curves of different networks on New College dataset. Curves are named by the following formats: network name layer name. For example, DenseNet relu5 blk represents the features extracted from relu5 blk layer of DenseNet. All the networks are pre-trained on the ImageNet2012 dataset and euclidean distance is adopted as the similarity score. The layer with best performance in each network is chosen to draw in the figure and it is apparent that DenseNet outweighs other popular network architectures. 1.0 1.0 0.8 0.8 0.6 0.6 Precision Precision DenseNet121_relu5_blk 0.4 DenseNet121_relu5_blk 0.4 ResNet50_activation47 DPN107_conv5_bn_ac_act Xception_block14_sepconv2_act SENet154_layer4_2_se_relu VGG16_pool5 0.2 se_resnext101_64_features_7_2 0.2 Inceptionv3_mixed10 SqueezeNet_11_expand3x3_activation InceptionResnetv2_conv_7b_ac NasNetlarge_normal_add_5_18 Inceptionv4_concatenate_25 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall Figure 5: The PR-Curves of different networks on New College dataset. Figure 6 shows the euclidean distance of images on New College dataset when employing DenseNet and Xception respectively. The high-level features of Xception, which care more about semantic information, have a poorer discrimination on images than those of DenseNet. A common method to combine various levels’ features is to concatenate them directly, but DenseNet aleady did this during the forward processing. The output of the last few layers integrate both low-level and high-level features naturally. 4.2.2 Why DBF and 4 max-pooling Figure 7(a) shows the PR-Curves of DBF and DBC on City Center dataset. In order to make a quick comparison, euclidean distance is adopted as the similarity score. It is obvious that DBF far outweighs DBC and similar results can be gained on New College dataset. Figure 7(b) illustrates the PR-Curves of different dimensionality reduction methods on City Center dataset. The label named relu5 blk means the original features without dimensionality reduction. The label named 4 max-pooling by channel represents applying 4 max-pooling to the feature’s channel dimension. The label named 256 PCA means reducing the channel dimension to 256 through PCA method. We can observe that utilizing 4 max-pooling by channel can maintain 99% accuracy and have alomost the same performance as PCA. Considering the processing time, 4 max-pooling by channel is adopted finally. 33 The 1st International Workshop on the Semantic Descriptor, Semantic Modelingand Mapping for Humanlike Perceptionand Navigation of Mobile Robots toward Large Scale Long-Term Autonomy (SDMM19) (a) DenseNet (b) Xception Figure 6: The euclidean distance of images on New College Dataset. 1.0 DBF 1.0 DBC 0.8 0.8 0.6 0.6 Precision Precision 0.4 0.4 0.2 0.2 4 max-pooling by channel 256 PCA relu5_blk 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall Figure 7: The PR-Curves of DBF V.S. DBC and different dimensionality reduction methods on City Center dataset. 4.2.3 Why WVLAD In order to compare the performance with traditional methods, two hand-crafted features (ORB and SIFT) and two encoding methods (BoW and VLAD) are adopted. The VLAD codebooks have 512 cluster centers, just the same as Dense-Loop, while BoW codebooks have 10000 visual words. The results on two datasets are shown in Figure 8. It’s clear that WVLAD could achieve better performance than BoW and VLAD encoding method based on DenseNet. And we can notice Dense-Loop far outweighs hand-crafted features. Here are two typical examples. In Figure 9(a) and 9(b), high similarity score is obtained based on hand-crafted features because of similar textured regions on the trees and sky, while score of Dense-Loop is close to zero in this case. This is because Dense-Loop could utilize high-level semantic and global information to judge the similarity. In Figure 9(c) and 9(d), Dense-Loop can recognize the two images as the same place with high score but hand-crafted features can’t achieve that due to illumination changes. Besides, in this case, we can also find Dense-Loop can resist the viewpoint changes. As for WVLAD and VLAD, WVLAD can reduce channel redundancy by CW and focus on the distinguished and unique parts of the image by FW. Therefore, better performance can be obtained in some cases by solving the problem of scale and viewpoint changes. 5 Conclusion Loop closure detection is used to detect if the robot has passed through the same place. It’s crucial for the robot to establish a globally consistent map, especially for large and long-term scenes. A framework of loop closure detection based on CNN features is proposed in this paper. We find that features extracted from DenseNet outweigh hand-crafted features and other popular networks’ features. The reason is DenseNet can preserve both semantic information and structure details of the input image via dense connection. In order to improve the ability of resisting scale or viewpoint changes, decoupling by feature-maps (DBF) and Weighted Vector of Locally Aggregated Descriptor (WVLAD) method is utilized to make full use of DenseNet features according to its own distinctions. Locality-sensitive hashing (LSH) and 4 max-pooling by channel are adopted to ensure the 34 The 1st International Workshop on the Semantic Descriptor, Semantic Modelingand Mapping for Humanlike Perceptionand Navigation of Mobile Robots toward Large Scale Long-Term Autonomy (SDMM19) 1.0 1.0 0.8 0.8 0.6 0.6 Precision Precision ORB_BoW ORB_BoW 0.4 SIFT_BoW 0.4 SIFT_BoW DenseNet_BoW DenseNet_BoW ORB_VLAD ORB_VLAD 0.2 SIFT_VLAD 0.2 SIFT_VLAD DenseNet_VLAD DenseNet_VLAD Dense-Loop Dense-Loop 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall (a) City Center dataset (b) New College dataset Figure 8: The PR-Curves of DenseNet V.S. (ORB, SIFT) and Dense-Loop V.S. (BoW, VLAD) (a) (b) (c) (d) Figure 9: Picture (a) and (b) with ORB features come from different scenes, but they share similar textured regions (e.g. trees and sky). Picture (c) and (d) with ORB features come from the same place, but they have different apparences, such as illumination changes. real-time search for robotic application. Extensive experiments illustrate Dense-Loop approach could achieve state-of-the-art performance on public datasets. However, the impact of the training datasets on the network’s performance has not been investigated. In the future, we will conduct more extensive experiments to explore the generalization ability of Dense-Loop, which is important in real-world robot applications. Besides, we would consider to utilize semantic information of the network’s prediction results and establish a multi-level semantic knowledge base to speed up the search and improve the loop closure detection performance. Acknowledgement This work was supported in part by the National Natural Science Foundation of China under Grant 91648116 and 51425501. References [AGT+ 18] R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. IEEE Transactions on Pattern Analysis and Machine In- telligence, 40(6):1437–1451, June 2018. [BETG08] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (surf). Computer Vision and Image Understanding, 110(3):346–359, 2008. [BWZ+ 16] Dongdong Bai, Chaoqun Wang, Bo Zhang, Xiaodong Yi, and Yuhua Tang. Matching-range- constrained real-time loop closure detection with cnns features. Robotics and Biomimetics, 3(1):15, Sep 2016. [Cho17] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 00, pages 1800–1807, July 2017. 35 The 1st International Workshop on the Semantic Descriptor, Semantic Modelingand Mapping for Humanlike Perceptionand Navigation of Mobile Robots toward Large Scale Long-Term Autonomy (SDMM19) [CLJM14] Zetao Chen, Obadiah Lam, Adam Jacobson, and Michael Milford. Convolutional neural network- based place recognition. CoRR, abs/1411.1509, 2014. [CLX+ 17] Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path networks. CoRR, abs/1707.01629, 2017. [Cum08] M Cummins. Fab-map : Probabilistic localization and mapping in the space of appearance. The International Journal of Robotics Research, 27(6):647–665, 2008. [GSM18] S. Garg, N. Suenderhauf, and M. Milford. Don’t look back: Robustifying place categorization for viewpoint- and condition-invariant place recognition. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 3645–3652, May 2018. [HLvMW17] G. Huang, Z. Liu, L. v. Maaten, and K. Q. Weinberger. Densely connected convolutional net- works. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, July 2017. [HSS17] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. CoRR, abs/1709.01507, 2017. [HZRS16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016. [HZZ15] Y. Hou, H. Zhang, and S. Zhou. Convolutional neural network-based image representation for visual loop closure detection. In 2015 IEEE International Conference on Information and Automation (ICInfA), pages 2238–2245, Aug 2015. [IMA+ 16] Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR, abs/1602.07360, 2016. [JDSP10] H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 3304–3311, June 2010. [KMO16] Yannis Kalantidis, Clayton Mellina, and Simon Osindero. Cross-dimensional weighting for ag- gregated deep convolutional features. In Gang Hua and Hervé Jégou, editors, Computer Vision – ECCV 2016 Workshops, pages 685–701, Cham, 2016. Springer International Publishing. [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS), NIPS’12, pages 1097–1105, USA, 2012. Curran Asso- ciates Inc. [LAGOPGJ17] Manuel Lopez-Antequera, Ruben Gomez-Ojeda, Nicolai Petkov, and Javier Gonzalez-Jimenez. Appearance-invariant place recognition by discriminatively training a convolutional neural net- work. Pattern Recognition Letters, 92:89–95, 2017. [LM13] Mathieu Labbe and Francois Michaud. Appearance-based loop closure detection for online large- scale and long-term operation. IEEE Transactions on Robotics, 29(3):734–745, June 2013. [LSN+ 16] Stephanie Lowry, Niko Sünderhauf, Paul Newman, John J. Leonard, David Cox, Peter Corke, and Michael J. Milford. Visual place recognition: A survey. IEEE Transactions on Robotics, 32(1):1–19, 2016. [MAT17] Raúl Mur-Artal and Juan D. Tardós. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017. [RPH05] Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. Randomized algorithms and nlp: Using locality sensitive hash function for high speed noun clustering. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pages 622–629, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics. 36 The 1st International Workshop on the Semantic Descriptor, Semantic Modelingand Mapping for Humanlike Perceptionand Navigation of Mobile Robots toward Large Scale Long-Term Autonomy (SDMM19) [RRKB11] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In Proceedings of the 2011 International Conference on Computer Vision (ICCV), ICCV ’11, pages 2564–2571, Washington, DC, USA, 2011. IEEE Computer Society. [SIV17] Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. Multi-scale orderless pooling of deep convolutional activation features. In Proceeding of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI), pages 4278–4284, 2017. [SSD+ 15] N. Sünderhauf, S. Shirazi, F. Dayoub, B. Upcroft, and M. Milford. On the performance of convnet features for place recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4297–4304, Sept 2015. [SSJ+ 15] Niko Sünderhauf, Sareh Shirazi, Adam Jacobson, Feras Dayoub, Edward Pepperell, Ben Upcroft, and Michael Milford. Place recognition with convnet landmarks: Viewpoint-robust, condition- robust, training-free. In Robotics: Science and Systems (RSS), Auditorium Antonianum, Rome, July 2015. [SVI+ 15] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015. [SZ14] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. [UMCM14] B Upcroft, C Mcmanus, W Churchill, and W Maddern. Lighting invariant urban street classifica- tion. In IEEE International Conference on Robotics and Automation (ICRA), pages 1712–1718, Hong Kong, China, 2014. IEEE. [XF17] Yu Xiang and Dieter Fox. DA-RNN: semantic mapping with data associated recurrent neural networks. CoRR, abs/1703.03098, 2017. [XGD+ 17] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 00, pages 5987–5995, July 2017. [YLL+ 18] C. Yu, Z. Liu, X. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei. Ds-slam: A semantic visual slam towards dynamic environments. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1168–1174, Oct 2018. [ZVSL17] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable archi- tectures for scalable image recognition. CoRR, abs/1707.07012, 2017. 37