-

UAIC participation at Robot Vision @ 2012 An updated vision

Emanuela Boros

emanuela.boros@info.uaic.ro 0

Alexandru Lucian G

lucian.ginsca@info.uaic.ro 0

Adrian Iftene

adiftene@info.uaic.ro 0 0 Alexandru Ioan Cuza University, Faculty of Computer Science General Berthelot , 16, 700483, Iasi , Romania

2012

In this paper we describe a system that participated in the fourth benchmarking activity ImageCLEF, in the Robot Vision task, for which we approach the task of topological localization without using a temporal continuity of the sequences of images. We provide details for the state-of-the-art methods that were selected: Color Histograms, SIFT (Scale Invariant Feature Transform), ASIFT (A ne SIFT) and RGBSIFT, Bag-of-Visual-Words strategy inspired from the text retrieval community. We focused on nding the optimal set of features and a deepened analysis was carried out. We o er an analysis of the di erent features, similarity measures and a performance evaluation of combinations of the proposed methods for topological localization. Also, we detail a genetic algorithm that was used for eliminating the false positives results. In the end, we draw several conclusions targeting the advantages of using proper con gurations of visual-based appearance descriptors, similarity measures and classi ers.

Robot Topological Localization Global Features Invariant Local Features Visual Words SVMs Genetic Algorithm

In this paper, we present an approach to vision-based mobile robot localization that uses a single perspective camera taken within an o ce environment. The robot should be able to answer the question where are you? when presented with a test sequence representing a room category seen during training [ 30, 33, 25 ]. We analyze the problem without taking in consideration the use of the temporal continuity of the sequences of images. We perform an exhaustive evaluation and introduce a new analysis statistic between quantization techniques of a large set of features, from which di erent system con gurations are picked and tested.

Traditionally, robot vision systems heavily relied on di erent methods for robotic topological localization such as topological map building which makes good use of temporal continuity [ 37 ], panoramic vision creation [ 38 ], simultaneous localization and mapping [ 7 ], appearance-based place recognition for topological localization [ 38 ], Monte-Carlo localization [ 41 ].

The problem of topological mobile localization has mainly three dimensions: a type of environment (indoor, outdoor, outdoor natural), a perception (sensing modality) and a localization model (probabilistic, basic). Numerous papers deal with indoor environments [ 37, 38, 10, 21 ] and a few deal with outdoor environments, natural or urban [ 36, 13 ].

Current work on robot localization in indoor environments has focused on introducing probabilistic models to improve local feature matching and the integration of speci c kernels. Experimental results for wide baseline image matching suggest the need for local invariant descriptors of images. Invariant features have achieved relative success with object detection and image matching. There has also been research into the development of fully invariant features [ 4, 26, 27 ]. In his milestone paper [ 23 ], D. Lowe has proposed a scale invariant feature transform (SIFT) that is invariant to image scaling and rotation, illumination and viewpoint changes. Lately, a new method has been proposed, A ne-SIFT (ASIFT) that simulates all the views obtainable by varying the two camera axis orientation parameters, namely the latitude and the longitude angles [ 29 ].

The Bag-of-Visual-Words [ 8, 12 ] model is a great addition to place recognition and was initially inspired by the bag-of-words models in text classi cation where a document is represented by an unsorted set of the contained words. This data modeling technique was rst been introduced in the case of video retrieval [ 35 ]. Due to its e ciency and e ectiveness, it became very popular in the elds of image retrieval and classi cation [ 20, 43 ].

The classi cation level of images relies more on unsupervised then supervised learning techniques. Categorizing in unsupervised learning scenarios is a much harder problem, due to the absence of class labels that would guide the search for relevant information. In supervised learning scenarios, image categorizing has been studied widely in the literature. Among supervised learning techniques, the most popular in this context are Bayesian classi ers [ 8, 18, 12, 19 ] and Support Vector Machines (SVM) [ 39, 8, 18, 44 ]. [ 3 ] also uses random forests. Actually, state-of-the-art results are due to SVM classi ers: the method described in [ 44 ] combines a local matching of the features and speci c kernels based on the Earth Movers Distance [ 32 ] or 2 [ 28 ] yielded the best results.

Our approach represents an extension of our previous work [ 1, 2 ] where each RGB image is processed to extract sets of SIFT keypoints from where the descriptors are de ned. Making use of global and local features, a quantization technique, SVMs and a genetic algorithm that aims at eliminating the false positives, we approached the task of recognition with di erent con gurations and the one that got the best results has been reviewed in the 2012 Robot Vision task in ImageCLEF international campaign. 2

Image Analysis

In this section, we describe the image features that have been used in this work in order to obtain a precise and e ective model for the topological localization task. In order to obtain an image representation which captures the essential appearance of the location and is robust to occlusions and changes in image brightness, we compare two di erent image descriptors and their associated distance measure. In the rst case, we use color histograms integrated and in the second case each image is represented by a set of local scale-invariant features, quantized in bags of visual words. 2.1

Global Features

Many recognition systems based on images use global features that describe the entire image, an overall view of the image that is transformed in histograms of frequencies. Adopting the analysis of global features has brought great improvement in robot localization systems as in [ 33 ] or in content based image retrieval systems as in medical related images analysis in [ 34 ]. Such features are important because they produce very compact representations of images, where each image corresponds to a point in a high dimensional feature space.

In the following, we attempt to model image densities using two di erent color spaces, RGB and HSV.

RGB (Red, Green, and Blue) Color Model is composed of the primary

colors Red, Green, and Blue. They are considered the additive primaries since the colors are added together to produce the desired color. White is produced when all three primary colors at the maximum light intensity (255). The RGB space has the major de ciency of not being perceptually uniform, this being the motivation of adding HSV color histograms.

HSV (Hue, Saturation, and Value) Color Model de nes colors in terms of three constituent components: hue, saturation and value or brightness. The hue and saturation components are intimately related to the way human eye perceives color because they capture the whole spectrum of colors. The value represents intensity of a color, which is decoupled from the color information in the represented image. This color model is attractive because color image processing performed independently on the color channels does not introduce false colors (hues). However, it has also inconvenient due to the necessary nonlinearity in forward and reverse transformations with RGB space.

A color histogram denotes the joint probabilities of the intensities of the three color channels and is computed by discretizing the colors within the image and counting the number of pixels of each color. Since the number of colors is nite, it is usually more convenient to transform the three channel histogram into a single variable histogram, therefore a quantization of the histograms is needed. The histogram dimension (the number of histogram bins) n is determined by the color representation scheme and quantization level. Most color spaces represent a color as a three-dimensional vector with real values (e.g. RGB, HSV). We quantize the color space of three axes into k bins for the rst axis, l bins for the second axis and m bins for the third axis. The histogram can be represented as an n-dimensional vector where n = k l m. Because the retrieval performance is saturated when the number of bins is increased beyond some value, normalized color histogram di erence can be a satisfactory measure of frame dissimilarity, even when colors are quantized into only 64 bins (4 Green 4 Red 4 Blue). As a conclusion, we chose a 18 10 10 multidimensional HSV histogram, and a 10 10 10 multidimensional RGB histogram, as di erences between colors of the o ce environment have a high level of similarity and have slight changes in hues. 2.2

Local Features

A di erent paradigm is to use local features, which are descriptors of local image neighborhoods computed at multiple interest points. There are many local features developed in the last years for image analysis, with the outstanding SIFT as the most popular. In the literature, there are several works studying the di erent features and their descriptors, for instance [ 22 ] evaluates the performance of local descriptors, and [ 44 ] shows a study on the performance of di erent feature for object recognition.

The three types of features used in our experiments are SIFT (Scale Invariant Feature Transform), ASIFT (A ne Scale Invariant Feature Transform) and RGB-SIFT (RGB Scale Invariant Feature Transform).These features were extracted using [ 14 ]. Also, the localization experiments using these features show advantages and disadvantages of using one or another.

SIFT (Scale Invariant Feature Transforms)[23, 4, 24] features corre

spond to highly distinguishable image locations which can be detected e ciently and have been shown to be stable across wide variations of viewpoint and scale. The algorithm basically extracts features that are invariant to rotation, scaling an partially invariant to changes in illumination an a ne transformations. This feature has been explained in our previous work being one of our key level of our systems [ 1, 2 ].

ASIFT (A ne Scale Invariant Feature Transforms), as described in

[ 29 ], simulates with enough accuracy all distortions caused by a variation of the camera optical axis direction. Then it applies the SIFT method. In other words, ASIFT simulates three parameters: the scale, the camera longitude angle and the latitude angle and normalizes the other three (translation and rotation), what SIFT lacked.

RGB-SIFT (RGB Scale Invariant Feature Transforms) descriptors are computed for every RGB channel independently. Therefore, each channel is normalized separately which brings another important aspect for SIFT, the invariance to light color change. For a color image, the SIFT descriptions independently from each RGB component and concatenated into a 384-dimensional local feature (RGB-SIFT) [ 5 ]. 2.3

Feature Matching

In this subsection we introduce di erent dissimilarity measures to compare features. That is, a measure of dissimilarity between two features and thus between the underlying images is calculated. Many of the features presented are in fact histograms (color histograms, invariant feature histograms). As comparison of distributions is a well known problem, a lot of comparison measures have been proposed and compared before [ 31 ].

In the following, dissimilarity measures to compare two histograms H and K are proposed. Each of these histograms has n bins and Hi is the value of the i-th bin of histogram H.

{ Minkowski-form Distance (L1 distance is often used for computing dissimilarity between color images, also experimented in color histograms comparison [ 17 ]):

D 2 (H; K) =

X (Hi i=1

Ki)2

Hi DB(H; K) = ln X p(HiKi)

i=1 DLr(H; K) = (X jHi i=1

1 Kij) r { Jensen Shannon Divergence (also referred to as Je rey Divergence [ 9 ], is an empirical extension of the Kullback-Leibler Divergence. It is symmetric and numerically more stable):

DJSD(H; K) = X Hi log i=1

2Hi Hi + Ki + Ki log

2Ki Ki + Hi {

2 Distance (measures how unlikely it is that one distribution was drawn from the population represented by the other, [ 28 ]): (1) (2) (3) (4) { Bhattacharyya Distance [ 6 ] (measures the similarity of two discrete or continuous probability distributions). For discrete probability distributions H and K over the same domain, it is de ned as: 3

Classi cation

Many of the features presented in Section 2 are in fact histograms (color histograms, invariant feature histograms, texture histograms, local feature histograms). As comparison of distributions is a well known problem, a lot of comparison measures have been proposed in Section 2.3. To analyze the di erent measure distances we summarize a well known choice for supervised classi cation.

Support Vector Machines are the state-of-the-art large margin classi ers which recently gained popularity within visual pattern and object recognition [ 15, 8, 18, 44, 40, 42 ]. Choosing the most appropriate kernel highly depends on the problem at hand - and ne tuning its parameters can easily become a tedious task. For our experimental setup, we chose the linear kernel (which is trivial and won't be presented), the radial basis function and the 2 kernel, presented below. The Gaussian Kernel is an example of radial basis function kernel. The 2 Kernel comes from the 2 distribution.

Kg(x; y) = exp kx Recent advances in the image recognition eld have shown that bag-of-visualwords [ 8, 12 ] - a strategy that draws inspiration from the text retrieval community - approaches are a good method for many image classi cation problems. BoVWs representations have recently become popular for content based image classi cation because of their simplicity and extremely good performance.

Basically, to give an estimation of the distribution we create histograms of the local features. The key idea of the bag-of-visual-words representation is to quantize each keypoint into one of the visual words that are often derived by clustering. Typically k-means clustering is used. The size of the vocabulary k is a user-supplied parameter. The visual words are the k cluster centers. The baseline of our tests are based on a bag-of-visual-words with a 100 visual words, meaning a 100-means clustering. The resulting k n-dimensional cluster centers cj represent the visual words. 4

Experimental Setup

In this section, we explain the experimental setup, then we present and discuss the results. The di erent choices of distance measures and classi cation parameters are analyzed performing also a comparison with previous work results. Conclusions are drawn in bene t of an accurate solution for topological localization, data modeling and classi cation. 4.1

Datasets (Benchmark)

The chosen dataset contains images from nine sections of an o ce obtained from

CLEF (Conference on Multilingual and Multimodal Information Ac

cess Evaluation). Detailed information about the dataset are in the overviews and ImageCLEF publications [ 30, 33, 25 ]. The dataset has already been split into three training sets of images, as shown in Table 1 one di erent from another. The provided images are in the RGB color space. The sequences are acquired within the same building and oor but there can be variations in the lighting conditions (sunny, cloudy, night) or the acquisition procedure (clockwise and counter clockwise).

Areas Corridor ElevatorArea LoungeArea PrinterRoom ProfessorO ce StudentO ce TechnicalRoom Toilet VisioConference Finally, a method for the elimination of the unwanted results is performed, therefore the retrieved classes for images (Corridor, LoungeArea etc.) depend on a threshold, those below this value being rejected, with the meaning that the system doesn't recognize the image. This becomes an optimization problem of nding the best value that will cut the unwanted results, considering that it is better to have no results than inconsistent results.

We adapted the implementation of the genetic algorithm described in [ 11 ]. In order to capture the particularities of the distance measures that are correlated with the rooms on which they are used, we considered a di erent threshold for each room. As a justi cation for choosing multiple thresholds rather than a single one, let us consider the case in which we are trying to classify images taken from a room that is more distinguishable from the others. The values returned by the similarity measures when comparing these images to others taken from the same room are further apart from the values returned in the case of comparing these images with others taken from di erent rooms. In contrast, if we consider a room that is visually similar to others, these values will be closer on the real axis. This is why it is harder to correctly separate erroneous classi cations for the good ones with a single threshold.

For the genetic algorithm, the chromosomes are vectors of length 9, representing the thresholds for the 9 rooms. For the genetic operators we used the binary representation of these vectors. The tness function evaluates the quality of the thresholds and it is the measure used to score runs in the Robot Vision task. As a selection strategy, we used the rank selection, which sorts the chromosomes accordingly to their value given by the tness function. In the crossover process, we don't allow the parent chromosomes which are the input for the crossover to be the same individual as it could lead to early convergence. To prevent this from happening, we rst select one chromosome from the population and then run the selection process in a loop until a di erent chromosome is returned. We also used elitism in order to assure the survival of the best chromosomes of each generation. In order to balance the diversity of the population, this method is accompanied by a slightly increased mutation probability.

For these experiments, we used a population of 200 individuals, the mutation probability of 0 : 15, and the crossover, of 0 : 7. The optimization process is stopped after 1000 generations. 4.3

Results Interpretation

We are interested in observing the performances of the nal con gurations to see which features/dissimilarity measures lead to good results and which do not. As it is well known that combinations of di erent methods lead to good results [ 16 ], an objective is to combine the brie y presented features. However, it is not obvious how to combine the features. To analyze the characteristics of Method RGB-Only HSV-Only RGB-HSV Basic-BoVW-SIFT Basic-BoVW-SIFT+HSV+RGB Basic-BoVW-ASIFT+HSV+RGB SVM-RBF-BoVW-SIFT+HSV+RGB SVM-LINEAR-BoVW-SIFT+HSV+RGB SVM- 2-BoVW-SIFT+HSV+RGB features and which features have similar properties, we perform an evaluation on selected con gurations as shown in Table 2. The evaluation was performed choosing Training 1 and 3 (Table 1) for training and Training 2 for testing.

The rst column gives a description of the used training method. The descriptions of the con gurations are straight forward, for example, Basic-BoVWSIFT+HSV+RGB means a con guration of a combination of RGB and HSV color histograms and Basic-BoVW-SIFT a bag of visual words formed with SIFT feature vectors. The chosen measure distances were decided like this: Jeffrey Divergence for RGB histograms, Bhattacharyya for HSV histograms and Minkowski for SIFT feature vectors. The second column gives the recall values for the training data, the third - the precisions. The F-measure is computed and represented in the fourth column of the table. The table also shows that feature selection only is not su cient to increase the recognition rate but more exibility is needed here and this fact led to di erent combinations.

The results are improved by the addition of the SVM classi cation step. We also add the observation that a SVM classi cation of SIFT mapped on visual words can get to a maximum of 52% accuracy, but these results are very assuring in the context of a con guration in which are implied the usage of other feature descriptors. Thereby, the con guration that combines SIFT words, HSV and RGB histograms and a classi cation with a SVM with a RBF kernel yielded the most satisfying result. 4.4

ImageCLEF 2012 Robot Vision Task

The fourth edition of the Robot Vision challenge focused on the problem of multi-modal place classi cation. We had to classify functional areas on the basis of image sequences, captured by a perspective camera and a kinect mounted on a mobile robot within an o ce environment with nine rooms. We ranked third out of seven registered groups. # Group Score 1 CIII UTN FRC, Universidad Tecnologica Nacional, Ciudad Universitaria, Cor- 2071.0 doba, Argentina 2 NUDT, Department of Automatic Control, College of Mechatronics and Au- 1817.0 tomation, National University of Defense Technology, China 3 Faculty of Computer Science, Alexandru Ioan Cuza University 1348.0 (UAIC), Iasi, Rom^ania 4 USUroom409, Yekaterinburg, Russian Federation 1225.0 5 SKB Kontur Labs, Yekaterinburg, Russian Federation 1028.0 6 CBIRITU, Istanbul Technical University, Turkey 551.0 7 Intelligent Systems and Data Mining Group (SIMD), University of Castilla-La 462.0

Mancha, Albacete, Spain 8 Bu aloVision, University at Bu alo, NY, United States -70.0 Our approach on topological localization is currently applied on an o ce environment of nine sections: Corridor, ProfessorO ce, StudentO ce, LoungeArea, PrinterRoom, Toilet, VisioConference, ElevatorArea and TechnicalRoom. To address the problem of recognizing these sections separately, we approached the classi cation with speci c thresholds in taking the nal decision over the selected room. These thresholds create constraints that have to be loosened in order to obtain an accurate result in treating situations of great similarity between two di erent rooms. As an example, note that one of the main inconvenient that can appear in this case is that the rooms are very connected and di cult situations can rise as the robot moves around the o ce. For example, if the robot is in the Corridor, it looks to its right and sees the LoungeArea but its position is still in the Corridor. This type of situation creates noise that cannot be neglected, therefore a proper threshold needs to treat these results that correspond to a humanized interaction with the medium. The threshold on the nal decision quality was chosen to avoid erroneous localizations, thus favoring a result that doesn't specify any room and giving less correct localizations but also, less false assumptions. 6

Conclusions

In this work, we approached the task of topological localization without using a temporal continuity of the images and involving a broad variety of features for image recognition. The provided information about the environment is contained in images taken with a perspective color camera mounted on a robot platform and it represents an o ce environment dataset o ered by ImageCLEF.

The main contribution of this work stays in quanti able examinations of a wide variety of di erent con gurations for a computer vision-based system and signi cant results. The experiments show that the con gurations from di erent feature descriptors and distance measures depend on the proper combinations.

From the fact that most of the works cited are from the last couple of years, topological localization is a new and active area of research, which is increasingly producing interest and enforces further development. An important contribution to this eld is given in this paper, along with notable experimental results, but there is still room for improvement and further research. 7

Acknowledgement

The research presented in this paper was funded by the Sector Operational Program for Human Resources Development through the project \Development of the innovation capacity and increasing of the research impact through postdoctoral programs"POSDRU/89/1.5/S/49944.

Boros , G.

Rosca, and

Iftene . Uaic: Participation in imageclef 2009 robotvision task . Proceedings of the CLEF 2009 Workshop.

Boros , G.

Rosca, and

Iftene . Using sift method for global topological localization for indoor environments . Multilingual Information Access Evaluation II. Multimedia Experiments [Lecture Notes in Computer Science Volume 6242 Part II] , 6242 : 277 { 282 , 2009 .

Bosch ,

Zisserman , and

Munoz . Image classi cation using random forests and ferns . ICCV, 2007 .

Brown and D.G Lowe. Invariant features from interest point groups . The 13th British Machine Vision Conference , Cardi University, UK, pages 253 { 262 , 2002 .

G. J.

Burghouts and

J. M.

Geusebroek . Performance evaluation of local color invariants . CVIU , 13 ( 113 ): 4862 , 2009 .

Choi and

Lee . Feature extraction based on the bhattacharyya distance . Pattern Recognition , 36 : 1703 { 1709 , 2003 .

Choset and

Nagatani . Topological simultaneous localization and mapping (slam): toward exact localization without explicit localization . IEEE Trans. Robot . Automat., 17 ( 2 ): 125 { 137 , 2001 .

Dance ,

Willamowski ,

Fan ,

Bray , and

Csurka . Visual categorization with bags of keypoints . ECCV International Workshop on Statistical Learning in Computer Vision , Prague, 2004 .

Deselaers ,

Keysers , and

Ney . Features for image retrieval: An experimental comparison . Information Retrieval , 2008 .

10.

Dudek and

Jugessur . Robust place recognition using local appearance based methods . IEEE Intl. Conf. on Robotics and Automation (ICRA) , pages 1030 { 1035 , 2000 .

11. A. L. G

^nsca and

Iftene . Using a genetic algorithm for optimizing the similarity aggregation step in the process of ontology alignment . Proceedings, of 9th International Conference RoEduNet IEEE.

12.

Gokalp and

Aksoy . Scene classi cation using bag-of-regions representations . Proceedings of CVPR, pages 1{8 , 2007 .

13. J.-J. Gonzalez-Barbosa and S. Lacroix . Rover localization in natural environments by indexing panoramic images . Proceedings of the 2002 IEEE International Conference on Robotics and Automation (ICRA) , pages 1365 { 1370 , 2002 .

14. J. Hare , S.

Samangooei , and D.

Dupplaw . Openimaj and imageterrier: Java libraries and tools for scalable multimedia analysis and indexing of images . ACM Multimedia 2011 .

15.

Ke and

Sukthankar . Pca-sift: A more distinctive representation for local image descriptors . Proceedings of the Conference on Computer Vision and Pattern Recognition , pages 511 { 517 , 2004 .

16. J. Kittler , M.

Hatef , R.P.W.

Duin , and J.

Matas . On combining classi ers . IEEE Trans. Pattern Anal. Mach . Intell., 20 ( 3 ): 226 { 239 , 1998 .

17. A. B. Kurhe , S. S.

Satonka , and P. B.

Khanale . Color matching of images by using minkowski- form distance . Global Journal of Computer Science and Technology, Global Journals Inc. (USA) , 11 , 2011 .

18.

Larlus and

Jurie . Latent mixture vocabularies for object categorization . BMVC , 2006 .

19.

Larlus and

Jurie . Latent mixture vocabularies for object categorization and segmentation . Journal of Image & Vision Computing , 27 ( 5 ): 523 { 534 , April 2009 .

20.

Lazic and

Aarabi . Importance of feature locations in bag-of-words image classi cation . Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing , 1 :I641{ I644 , 2007 .

21.

Ledwich and

Williams . Reduced sift features for image retrieval and indoor localisation . Australasian Conf. on Robotics and Automation , 2004 .

22. H. Lejsek , F.H.

Asmundsson , B.

Thor-Jonsson , and L.

Amsaleg . Scalability of local image descriptors: A comparative study . ACM Int. Conf. on Multimedia , 2006 .

23.

Lowe . Distinctive image features from scale-invariant keypoints . International Journal of Computer Vision , 2 ( 60 ): 91 { 110 , 2004 .

24.

D. G.

Lowe. Object recognition from local scale-invariant features . Proceedings of the 7th International Conference on Computer Vision , pages 1150 { 1157 , 1999 .

25.

Lucetti and

Luchetti . Combination of classi ers for indoor room recognition, cgs participation at imageclef2010 robot vision task . Conference on Multilingual and Multimodal Information Access Evaluation (CLEF 2010 ), 2010 .

26.

Mikolajczyk and

Schmid . An a ne invariant interest point detector . Proceedings of the 7th European Conference on Computer Vision , pages 128 { 142 , 2002 .

27.

Mikolajczyk and

Schmid . Scale & a ne invariant interest point detectors . IJCV , 60 ( 1 ), 2004 .

28.

Mikolajczyk ,

Schmid ,

Harzallah , and J. van de Weijer. Learning object representations for visual object class recognition . Visual Recognition Challange , 2007 .

29.

Morel and

Yu. Asift : A new framework for fully a ne invariant image comparison . SIAM Journal on Imaging Sciences , 2 ( 2 ): 438 { 469 , 2009 .

30.

Pronobis ,

O. M.

Mozos ,

Caputo , and

Jensfelt . Multi-modal semantic place classi cation . Int. J. Robot. Res. , 29 ( 2-3 ): 298320 , February 2010 .

31. J. Puzicha , Y.

Rubner , C.

Tomasi , and J.

Buhmann . Empirical evaluation of dissimilarity measures for color and texture . Proc. International Conference on Computer Vision , Vol. 2 , page

11651173

, 1999 .

32.

Rubner ,

Tomasi , and L. Guibas. The earth mover's distance as a metric for image retrieval . International Journal of Computer Vision , 40 ( 2 ): 99 { 121 , 2000 .

33.

Saurer ,

Fraundorfer , and

Pollefeys . Visual localization using global visual features and vanishing points . Conference on Multilingual and Multimodal Information Access Evaluation (CLEF 2010 ), 2010 .

34. C. R. Shyu , C. E.

Brodley , A. C.

Kak , A.

Kosaka , A.

Aisen , and L.

Broderick . Local versus global features for content-based image retrieval . Proc. IEEE Workshop of Content-Based Access of Image and Video Databases , pages 30 { 34 , June 1998 .

35.

Sivic and

Zisserman . Video google: A text retrieval approach to object matching in videos . Proceedings of the 9th International Conference on Computer Vision , pages 1470 { 1478 , 2003 .

36.

Takeuchi and

Hebert . Finding images of landmarks in video sequences . Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition , 1998 .

37.

Thrun . Learning metric-topological maps for indoor mobile robot navigation . Arti cial Intelligence , 99 : 21 { 71 , February 1998 .

38. I. Ulrich and I. Nourbakhsh. Appearance-based place recognition for topological localization . IEEE Intl. Conf. on Robotics and Automation , 2000 .

39.

Vapnik . Statistical learning theory . 1998 .

40. C. Wallraven , B. Caputo , and

Graf . Recognition with local features: the kernel recipe . Proc. ICCV , pages 257 { 264 , 2003 .

41. J. Wolf , W.

Burgard , and H.

Burkhardt . Robust vision-based localization for mobile robots using an image retrieval system based on invariant features . Proc. of the IEEE International Conference on Robotics and Automation (ICRA) , 2002 .

42.

Wolf and

Shashua . Kernel principal angles for classi cation machines with applications to image sequence interpretation . Proc. CVPR, I:635{640 , 2003 .

43. J. Yang , Y. G.

Jiang , A. G.

Hauptmann , and C. W.

Ngo . Evaluating bag-of-visualwords representations in scene classi cation . ACM MIR , 2007 .

44.

Zhang ,

Marszalek ,

Lazebnik , and

Schmid . Local features and kernels for classi cation of texture and object categories: A comprehensive study . IJCV , 73 ( 2 ): 213 { 238 .