Visual Language Modeling for Mobile
                         Localization
               LIG participation at RobotVision’09
                   Trong-Ton Pham1 , Loı̈c Maisonnasse2 , Philippe Mulhem1
                          1
                            Laboratoire Informatique de Grenoble (LIG)
           2
             Laboratoire d’InfoRmatique en Image et Systemes d’information (LIRIS)
            ttpham@imag.fr, loic.maisonnasse@insa-lyon.fr, mulhem@imag.fr


                                            Abstract
     This working note presents our novel approach for scene recognition (i.e. localization
     of mobile robot using visual information) in the RobotVision task [1] based on lan-
     guage model [2]. Language model has been successfully used for information retrieval
     (specifically for textual retrieval). In recent study [3], this model has also showed a
     good performance on modeling the visual information. For this reason, it can be used
     to address several problems in image understanding such as: scene recognition, im-
     age retrieval, etc. We have developed a visual language framework to participate in
     RobotVision’09 task this year. This framework consists of 3 principal components: a
     training step, a matching step and a post-processing step. Finally, we present the re-
     sults of our approach on both validation set and test set released by the ImageCLEF’s
     organizer.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software;

General Terms
Algorithms, Theory

Keywords
Information retrieval, visual language model, late fusion


1    Introduction
This year is the first year of RobotVision track [1] and of LIG participation in this track. The
main task is to exploit the location information within a known environment of a mobile robot
based on the visual information. The difficulty of this task is that the robot has to recognize the
room in different illumination conditions and adapt as the environment changes (such as moving
people or objects, new furniture added over the time, etc.). This might pose a problem for a visual
recognition system as the trained data usually obtained at a fixed time. In the meanwhile, the
system has to provide the location of the robot in real-time and in different time spans (from 6
months to 20 months).
    Over the years, several classical approaches in computer vision have been proposed for this
problem. In [4], the authors suggested an appearance-based method using Support Vector Ma-
chine (SVM) to cope with illumination and pose changes. This method achieved a satisfactory
performance when considering a short time interval between training and testing phrases. Another
possible approach is to detect the interest point (such as SIFT, Harris-Laplace, etc.) and do a
topological matching of these points [5]. This is simple approach but quite effective for recognizing
some type of non rigid object (for example: building, car, motorbike, etc.). However, this method
is heavily based on the quality of the interest points detected.
    To participate in this competition, we reuse our visual language approach presented in [3]
with the enhancement to cope with specific conditions of this task. Our model has showed a
good robustness and adaptability with different kind of image representations as well as different
type of visual features. With the graph representation, we have presented another layer of image
understanding that is closer to the semantic layer. Moreover, graph model is integrated well with
the foundation of standard language model [2] as showed in [6]. The prominent class for each image
is computed based on their likelihood value. We also employed the Kullback-Leibler divergence
as proposed in the classical language model approaches. In order to enhance the classification
quality, we performed some post processing of the ranked result based on their relevance values.
The validating process has shown a good performance of our system on different weather conditions
over the time span of 6 months.
    In the next section, this paper is organized as follows: Section 2 presents our visual language
approach for modeling a scene. Section 3 describes the validating process based on the training
and validation set. Section 4 reports our submitted run to the ImageCLEF for evaluation. The
paper then concludes with the discussion and future direction of our work.


2       Our approach
We have applied our visual language modeling framework for the competition this year. Our model
has been well demonstrated to work with scene recognition as it takes the advantage of a robust
platform from the standard language model in IR fields.

2.1     Image modeling
2.1.1    Image representation
We used 2 types of image representation in order to capture different visual information in image
content
    • Regular patch: images are divided into regular size patches. In order to make image
      representation robust with changing of the camera zoom, we have applied a multi-partition
      of images into 5x5 patches and 10x10 patches.
    • Interest point: invariant keypoints are detected using Lowe’s interest point detector. These
      keypoints are invariant with affine transformation and illumination. Local features are then
      extracted for each keypoint.

2.1.2    Feature extraction
From the validation set, we have learned that the color information performs quite badly in the
case of changing illumination. In the same lighting condition, the color histogram could give some
good results. However, in the case with a brutal changing of light condition (such as training in
night condition and testing with the sunny condition) the system fails to make a satisfied judgment.
So we decided to use only some features that are less sensitive with the illumination to represent
the visual feature. We have extracted the following features in our experiment:
    • HSV color histogram: we extract the color information from HSV color space. Each
      patch is represented by a vector of 512 dimensions.
   • Multi-scale canny edge histogram: we used canny operator to detect the contour of
     objects as presented in [7]. An 80-dimensional vector was used to capture magnitudes and
     gradient of the contours for each patch. We have captured this information in 2 different
     scales of image (10x10 patches and 5x5 patches).
   • Color SIFT: SIFT features are extracted using D. Lowe’s detector [8]. Region around the
     keypoint is described by a 128-dimensional vector for each R, G, B channel.

2.1.3   Visual vocabulary construction
Based on the analogy of image and text (i.e. visual word - word), for each feature, we construct a
visual vocabulary of 500 visual words using k-means clustering algorithm. Each visual word will
be designated to a concept c. Each image will then be represented using theses concepts and we
used them to build our language model in the next step.

2.2     Visual language modeling
In [3], we have presented the image as a probabilistic graph which allows capturing the visual com-
plexity of an image. Images are represented by a set of weighted concepts, connected through a set
of directed associations. The concepts aim at characterizing the content of the image whereas the
associations express the spatial relations between concepts. Our assumption is that the concepts
are represented by non-overlapping regions extracted from images.
    In this competition we used a reduced version of this model. We do not take into account
the relationship between concepts. We thus assume that each document image d (equivalent
each query image q) is represented by a set of weighted concepts WC . The concepts correspond
to a visual word used to represent the image. The weight of concepts captures the number of
occurrences of this concept in image. Denoting C the set of concepts over all the whole collection,
WC can be define as a set of pairs (c, w(c, d)), where c is an element of C and w(c, d) is the number
of times c occur in the document image i.

2.2.1   Language model
We rely on a language model defined over concepts, as proposed in [6], which we refer to as
Conceptual Unigram Model. We assume that a query q or a document d is composed of a set WC
of weighted concepts, each concept being conditionally independent to the others.
    Contrary to [6] that compute a query likelihood, we compute the relevance status value rsv
of a document image d for query q by using Kullback-Leiber divergence between the document
model Md computed over the document image d and the query model Mq computed over the
query image q. By relying on the concept independence hypothesis, this leads to:


        RSVkld (q, d)   = −D (Mq kMd )                                                              (1)
                                                             
                           X                      P (ci |Mq )
                        =       P (ci |Mq ) log                                                     (2)
                                                  P (ci |Md )
                          ci ∈C
                           X                                     X
                        =       log(P (ci |Mq ) ∗ P (ci |Md )) −   log(P (ci |Mq ) ∗ P (ci |Mq ))   (3)
                            ci ∈C                                  ci ∈C

   where P (ci |Md ) and P (ci |Mq ) are the probability of the concept ci in the model estimated
over the document d and query q respectively. Since the last element of the decomposition cor-
respond to query entropy and does not affect documents ranking, we only compute the following
decomposition:

                                              X
                         RSVkld (q, d) ∝              log(P (ci |Mq ) ∗ P (ci |Md ))                (4)
                                              ci ∈C
   where the quantity P (ci |Md ) is estimated through maximum likelihood (as is standard in the
language modeling approach to IR), using Jelinek-Mercer smoothing:

                                                      Fd (ci )      Fc (ci )
                            P (ci |Md ) = (1 − λu )            + λu                              (5)
                                                       Fd            Fc
    where Fd (c), representing the sum of the weight of c in all graphs from document image d and
Fd the sum of all the document concept weights in d. The functions Fc are similar, but defined
over the whole collection (i.e. over the union of all the images from all the documents of the
collection). The parameter λu corresponds to the Jelinek-Mercer smoothing. It plays the role of
an IDF parameter, and helps taking into account reliable information when the information from
a given document is scarce. For this part, the quantity P (ci |Mq ) is estimated through maximum
likelihood without smoothing on the query:

                                                        Fq (ci )
                                       P (ci |Mq ) =                                             (6)
                                                         Fq

    where Fq (c), representing the sum of the weight of c in all graphs from query image q and Fq
the sum of all the query concept weights in q. The final result of each query image is a ranked list
of documents associated with their rsv value.

2.2.2     Querying
Using this model, we query the training set with each test image using one type of concepts (i.e.
concepts obtains with one feature). Thus for each test image we obtain a list, standard in IR, that
contains all the training set images ranked according to the rsv defined in the previous part. This
list can be represented as:


                                      ILq = [(d, rsv(q, d))]                                     (7)

    Where ILq is a ranked list of image for query q, d is one image of the training set and rsv(q, d)
is the rsv computed for this query and document images.
    Assuming a function that, for each training image given its room id, we can obtain the room
id of any image from the ranked list. Then, in our basic approach, we associate the query image
with the room id of the best ranked image. As we can represent one image with different features
and as we have more than one images of each room in the training, we will present in the following
a post-processing steps to take advantage of these considerations.

2.3      Post-processing of the results
We perform some fine-tuning steps of this results in order to enhance the accuracy of our system
as presented in Figure 1.

   • Linear fusion: we take the advantage of the different features extracted from the images.
     We represent an image by a set of concept sets Ci , each Ci corresponding to a visual feature.
     Assuming that all the concepts sets are independent one to another, we fuse the Kullback-
     Leiber divergence of individual sets of concepts using a sum:

                                                      X
                                  RSV (Q, D) =             RSVkld (qi , di )                     (8)
                                                       i


        where Q = qi and D = di are the set of concept sets corresponding to the query image and
        to the document image respectively.
Figure 1: Post-processing step of results. (1) is the scheme for the obligatory track and (2) is the
scheme for the optional track


   • Regrouping training image by their room: On the basis that using only the closest
     training image to determine the room of a query image is not enough, we proposed to group
     the results of the n-best images for each room. We compute a ranked list of room RL instead
     of an image list:


                                       RLq = [(r, RSVr (q, r)]                                  (9)

     with

                                                    X
                                RSVr (q, r) =                   RSV (q, d)                     (10)
                                                fn−best (q,r)


     where r correspond to a room and fn−best is a function that select the n images with the
     best RSV belonging to the room r.
   • Filtering the unknown room: we measured a difference from the score of the 4th room
     to the 1st room in the room list RL. If the difference is big enough (> threshold β) we keep
     this image. Otherwise we remove it from the list (or consider as an unknown room). In our
     experiment, we fixed the value β = 0.003.
   • Smoothing window: we exploited the continuity in a sequence of images by smoothing
     the result in the temporal direction. To do that, we use a smoothing window sliding on the
     classified image sequences. Here, we choose the width of window w = 40 (i.e. 20 images
     before and after the classified image). As the result, the score of the smoothed image is the
     mean value of their neighborhood images.

                                              P
                                                 j∈[j−w/2;j+w/2] RSV (Qj , R)
                       RSVwindow (Qi , R) =                                                    (11)
                                                                  w
     where w is the width of the smoothing window. In the real case, we could only use a
     semi-window which considers only the images before the current classified image. This leads
     to:

                                                    P
                                                      j∈[j−w;j] RSV (Qj , R)
                        RSVsemi−window (Qi , R) =                                             (12)
                                                              w

     where w is the width of the semi-window.


3    Validating process
The validation aims at evaluating robustness of the algorithms to visual variations that occur over
time due to the changing conditions and human activity. We trained our system with the night3
condition set and tested against all the other conditions from validation set. Our objective was
to understand the behavior of our system with the changing conditions and with different types
of features. Moreover, the validation process can help us to fine-tune the model parameters that
the latter will be used for the official test.
    We built 3 different language models corresponding with 3 types of visual features. The
training set used is night3 set. Model Mc and Me correspond with the color histogram and the
edge histogram extracted from image with the division of 5x5 patches. Model Ms corresponds
with the SIFT color feature extracted from interest points. We measure the precision of system
using the accuracy rate. Summary of the results is reported in Table 1.


 Table 1: Results obtained on different conditions with 3 visual language models (Mc, Me, Ms)
                  Train     Test      HSV(Mc) Edge(Me) SIFT color(Ms)
                  night3   night2     84.24%       59.45%           79.20%
                  night3 cloudy2       39.33%      58.62%          60.60%
                  night3 sunny2        29.04%      52.37%          54.78%

    We noticed that, in the same condition (e.g. night-night), the HSV color histogram Mc out-
performed all the other models. However, in different conditions, the result of this model dropped
significantly (from 84% to 29%). It showed that the color information is very sensitive with the
changing of illumination condition. On the other hand, the edge model (Me) and the SIFT color
model (Ms) are practically robust with the changing of condition. In the worst condition (night-
sunny), we still obtained a quite good recognition rate of 52% for Me and 55% for Ms. As the
result, edge histogram and SIFT feature are shosen as the appropriate features for our recognition
system.
    Follow is the results of the post-processing step based on the ranked list of Me and Ms (Table
2).


           Table 2: Result of the post-processing step based on 2 models Me and Ms
            Train     Test     Fusion Regrouping          Filtering     Smoothing
            night3 sunny2       62%     67% (n=15) 72% (β=0.003) 92%(k=20)

   The fusion of these 2 models gives overall 8% of improvement. The regrouping step (as ex-
pected) helped to pop-up some prominent rooms from the score list by averaging room’s n-best
scores. The filtering takes part in eliminating some of the uncertain decisions base on the differ-
ence of their score after the regrouping step. Finally, the smoothing step (which is an optional
step) helps to increase the performance of a sequence of images significantly by 20% more.
4     Description of submitted runs
For the official test, we have constructed 3 models based on the validating process. We eliminated
the HSV histogram model because of its poor performance on different lighting conditions and
there was a little chance to have the same condition. We used the same visual vocabulary of 500
visual concepts generated for night3 set. Each model provided a ranked result corresponding with
the test sequence released. The post-processing steps were performed similar to the validating
process employing the same parameters. Follows are the visual language models built for the
competition:

    • Me1: visual language model based on edge histogram extracted from 10x10 patches division
    • Me2: visual language model based on edge histogram extracted from 5x5 patches division

    • Ms: visual language model based on color SIFT local features

    Our test has been performed on a quad core 2.00GHz computer with 8Gb of memory. The
training took about 3 hours on a whole night3 set. Classification of the test sequence executed in
real time.
    Based on the 3 visual models constructed, we have submitted 6 runs to the ImageCLEF
evaluation.
    • 01-LIG-Me1Me2Ms: linear fusion of the results coming from 3 models (Score = 328)
    • 02-LIG-Me1Me2Ms-Rk15: re-ranking the result of 01-LIG-Me1Me2Ms with the regroup-
      ing of top 15 scores for each room (Score = 415)
    • 03-LIG-Me1Me2Ms-Rk15-Fil003: if the result of the 1st and the 4th in the ranked list
      is too small (i.e. β = 0.003), we remove image that from the list. We refrain the decision
      from some cases other than to mark them as unknown room (Score = 456.5)
    • 04-LIG-Me1Me2Ms-Rk2-Diff20: re-ranking the result of 01-LIG-Me1Me2Ms with the
      regrouping of top 2 scores for each room and using smoothing window (±20 images/frame)
      to update the room-id from image sequences (Score = 706)
    • 05-LIG-Me1Ms-Rk15: same as 02-LIG-Me1Me2Ms-Rk15 but with the fusion of 2 types
      of image representation. (Score = 25)
    • 06-LIG-Me1Ms-Rk2-Diff20: same as 04-LIG-Me1Me2Ms-Rk2-Diff20 but with the fusion
      of 2 model Me1 and Ms (Score = 697)
   Note: run 04-LIG-Me1Me2Ms-Rk2-Diff20 and run 06-LIG-Me1Ms-Rk2-Diff20 are invalid as
we used the images after the classified image for the smoothing window.
   Our best run 03-LIG-Me1Me2Ms-Rk15-Fil003 for the obligatory track is ranked at 12th place
among 21 runs submitted in overall. Although, run 04-LIG-Me1Me2Ms-Rk2-Diff20 had not met
the criteria of the optional task which only used the sequence before the classified image. Nerver-
theless, it has increased by roughly 250 points from the best obligatory run. It means that we still
have room to improve the performance of our systems with the valid smoothing window.


5     Conclusion
In this paper, we have presented a novel approach for localization of a mobile robot using visual
language modeling. Theorically, this model fits within the standard language modeling approach
which is well developed for IR. On the other hand, this model helps to capture in the same time
the generality of the visual concepts associated with the regions from a single image or sequence
of images.
    The validation process has proved a good recognition rate of our system against different
illumination conditions. We believe that a good extension of this model is possible in the real
scenario of scene recognition (more precisely for robot self-localization). With the addition of
more visual features, enhancement of system robustness and choosing the right parameter, this
could be the solution to the future recognition system.


Acknowledgment
This work was partly supported by the French National Agency of Research (ANR-06-MDCA-002)
and by the Rgion Rhones Alpes (projet LIMA).


References
[1] B. Caputo, A. Pronobis, and P. Jensfelt. Overview of the clef 2009 robot vision track. In
    CLEF working notes 2009, Corfu, Greece, 2009.
[2] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In
    Research and Development in Information Retrieval, 1998.
[3] T. T. Pham, L. Maisonnasse, P. Mulhem, and E. Gaussier. Visual language model for scene
    recognition. In In Proceedings of SinFra’2009, Singapore, 2009.
[4] A. Pronobis, O. Martnez Mozos, and B. Caputo. Svm-based discriminative accumulation
    scheme for place recognition. In In Proceedings of the IEEE International Conference on
    Robotics and Automation (ICRA08), 2008.
[5] David G. Lowe. Object recognition from local scale-invariant features. In International Con-
    ference on Computer Vision, 1999.
[6] L. Maisonnasse, E. Gaussier, and J.P. Chevalet. Model fusion in conceptual language modeling.
    In In 31st European Conference on Information Retrieval (ECIR09), pages 240–251, 2009.
[7] Chee Sun Won, Dong Kwon Park, and Soo-Jun Park. Efficient use of mpeg-7 edge histogram
    descriptor. In ETRI Journal, pages vol.24, no.1, 2002.
[8] David G. Lowe. Distinctive image features from scale-invariant keypoints. In International
    Journal of Computer Vision, pages 91–110, 2004.