Combining text/image in WikipediaMM task
                      2009

Christophe Moulin, Cécile Barat, Cédric Lemaı̂tre, Mathias Géry, Christophe
                        Ducottet, Christine Largeron

                   Université de Lyon, F-69003, Lyon, France
           Université de Saint-Étienne, F-42000, Saint-Étienne, France
                  CNRS UMR5516, Laboratoire Hubert Curien
     {christophe.moulin, cecile.barat, cedric.lemaitre, mathias.gery,
                   ducottet,largeron}@univ-st-etienne.fr


      Abstract. This paper reports our multimedia information retrieval ex-
      periments carried out for the ImageCLEF track 2009. In 2008, we pro-
      posed a multimedia document model defined as a vector of textual and
      visual terms weighted using a tf.idf approch [5]. For our second partic-
      ipation, our goal was to improve this previous model in the following
      ways: 1) use of additional information for the textual part (legend and
      image bounding text extracted from the original documents, 2) use of
      different image detectors and descriptors, 3) new text / image combina-
      tion approach. Results allow to evaluate the benefits of these different
      improvements.


1   Introduction
ImageCLEFwiki is a multimedia collection where each document is composed of
text and one image. User needs are represented by queries (”topics”), which are
also multimedia. Therefore, a multimedia document model is necessary to handle
such a collection. In 2008, we proposed a first model that combines text and
image information for multimedia retrieval [5]. This year, we improve our model
adding textual information and using different detectors and descriptors for the
visual information. Moreover we use a linear combination to merge our textual
and visual results. After presenting our model, we will explain the submitted
runs and the obtained results. We will finish by introducing our future work.


2   Visual and textual document model
The document model we defined for ImageCLEF 2008 lets us rank documents
depending on the query using different methods. Firstly, we explain the key
features of our approach to rank documents according to a query using only
textual information. Secondly, we describe how we extend the method to handle
the visual information. Finally, we present our method for combining textual
and visual results.
2.1   Textual representation model
As in the vector space model introducted by Salton et al. [7], we represent a
document di as a vector of weights wi,j . Each wi,j corresponds to the importance
of the term tj in the document di computed by multiplying tfi,j and idfj , where
tfi,j is the term frequency that characterizes the frequency of the term tj in
the document di . The idfj is the inverse document frequency that quantifies the
importance of the term tj over the corpus of documents. wi,j is high when the
term tj is frequent in the document di but rare in the others. We use tfi,j and
idfj defined in the Okapi formula by Robertson et al [6] by :

                                              k1 ni,j
                         tfi,j =
                                    ni,j + k2 (1 − b + b d|davg
                                                             i|
                                                                )

     where ni,j is the occurrence of the term tj in the document di , |di | the size
of the document di and davg the average size of all documents in the corpus and
k1 , k2 and b are three constants.

                                    |D| − |{di |uj ∈ di }| + 0.5
                       idfj = log
                                       |{di |tj ∈ di }| + 0.5
    where |D| is the size of the corpus and |{di |tj ∈ di }| the number of documents
where the term tj occurs at least one time.
    If we consider a query qk as a short document, we can represent it as a vector
of weights. A score is then computed between the query qk and a document
di as shown in table 1. The main difference between score1 and score2 is the
representation of the query. In the first score, the weight of the query is defined
by its tf only while in the second score the weight is equal to tf.idf . Note that
for tfk,j , b = 0 because |dk | and |davg | are not defined for a query.


                           Scoring                  Parameters Parameters
                                                    tfi,j      tfk,j
                             1
                   P  score    (q k , d i ) =       k 1 = 2.2  k1 = 8
                      tj ∈qk
                             tf i,j idf  j tf k,j   k2  = 1.2  k2 = 7
                                                    b = 0.75   b=0
                             2
                 P    score    (q k , d i ) =       k 1 = 1    k1 = 1
                   tj ∈qk
                           tfi,j idf  j tf k,j idfj k2  = 1    k2 = 1
                                                    b = 0.5    b=0
            Table 1. Scoring equations and their default parameters[8].


   Different sources of text are available. The legend provided with images is of-
ten very short and sometimes useless: for example, when the text deals with the
copyright of the image or when it gives details about the user who uploaded the
image. In order to gain information, we aim at using the original text extracted
from the wikipedia documents in which images appear. We consider a text frag-
ment aroung the image. The size of the window is tuned using wikipediaMM
2008 collection as a training collection. We add this text to the legend of the
image and we index both the added text and the original legend. The indexing
is performed with the Lemur software[8].


2.2   Visual representation model

In order to combine the visual information with the textual one, we also rep-
resent images as a vector of weights. Provided we are able to extract visual
words from images, it is possible to use the tf.idf formula in the same way
as in the textual model. It is therefore necessary to create a visual vocabulary
V = {v1 , ..., vj , ..., v|V | } as in [2]. For that purpose, we use 3 different descrip-
tors. The first one (meanstd) is the same as in [5]. Each image is partitioned
into 16x16 cells. Each cell is described by a six dimensional vector which corre-
                                                              R        G
sponds to the mean and the standard deviation of R+G+B            , R+G+B    and R+G+B
                                                                                  3∗255
where R, G and B are the red, green and blue components of the cell. The sec-
ond (named sif t1 ) and third (named sif t2 ) descriptors are based on the well
known sift descriptor [3]. The sif t1 firstly detects regions of interest using the
MSER method as in [4] while the sif t2 one uses a regular partitioning as in the
meanstd descriptor.
    For each of our 3 descriptors, we apply a k-means algorithm [1] to obtain
a vocabulary of 10’000 visual terms. Each visual term represents a cluster of
feature vectors.
    Then, each image can be represented using a vector of visual terms. Local
features are first calculated using one of the 3 descriptors. Then visual terms
are determined by seeking, for each feature, the closest visual term (according
to the euclidian distance) in the corresponding visual vocabulary.
    In the same way as for textual words, the weight of each visual term is
computed using a tf.idf approach.


2.3   Combination

In order to combine textual and visual results we use two different methods.
The first one is a simple intersection between the results obtained with the
textual query and with the visual one. The second one corresponds to a linear
combination between the textual and the visual scores.

              score(qk , di ) = αscoreV (qk , di ) + (1 − α)scoreT (qk , di )

    The α parameter lets us add more or less visual information. We calculate
its optimal value using the queries from the ImageCLEFwiki 2008 track as a
training set.
3     Experiments

Using the model described in the previous section, we present runs submitted to
ImageCLEFwiki and the results we obtained.


3.1   Submitted runs


  run id                run            score   text    image combination
 LaHC 1         LaHC TXT okapi        score1 legend       -          -
 LaHC 2          LaHC TXT tfidf       score2 legend       -          -
 LaHC 3 run inter TXT IMG Meanstd score2 legend meanstd intersection
 LaHC 4      run inter TXT IMG Sift   score2 legend     sif t1 intersection
 LaHC 5 run TXTIMG Meanstd 0.015 score2 legend meanstd α= 0.015
 LaHC 6 run TXTIMG Meanstd 0.025 score2 legend meanstd α= 0.025
 LaHC 7      run TXTIMG Sift 0.012    score2 legend     sif t1   α= 0.012
 LaHC 8 run inter TXT IMG Siftdense score2 legend       sif t2 intersection
 LaHC 9        run TXT 100 3 1 5      score2 100 char legend         -
 LaHC 10        run TXT 50 3 1 5      score2 50 char legend          -
 LaHC 11 run TXTIMG 100 3 1 5 meanstd score2 100 char meanstd α= 0.025
 LaHC 12 run TXTIMG 50 3 1 5 meanstd score2 50 char meanstd α= 0.025
 LaHC 13 run TXTIMG Siftdense 0.084 score2 legend       sif t2   α= 0.084

 – meanstd: regular partitioning + color descriptor
 – sif t1 : MSER detector + sift descriptor
 – sif t2 : regular partitioning + sift descriptor
 – legend: text of the image document
 – n char: size (n characters) of the text window around the image in the original
   wikipedia documents

                        Table 2. Presentation of the runs


     All the runs are entirely automatic and are summarized on table 2. We define
a baseline, LaHC 1, that corresponds to a pure text model. It uses only textual
terms for the query and scoring of documents. We calculate the score1 for each
image using terms of the textual content. The image name or bounding char-
acters are not considered. We do not use neither feedback nor query expansion.
Since score1 is applied, the query terms are weighted with their frequency tf .
     Using only the text, we perform 3 other runs: the LaHC 2 is the same as
the baseline except that the query is represented by its tf.idf rather than its
tf . The LaHC 9 and LaHC 10 are two other text only runs that make use of
the bounding text around the image in the original wikipedia document. The
LaHC 9 adds 100 characters before and after the image while the LaHC 10 adds
50 characters.
    All other runs exploit both the textual and the visual information of doc-
uments. The LaHC 3, LaHC 4 and LaHC 8 are obtained after an intersec-
tion of the text only query results (LaHC 2) and the image query using the
meanstd, the sif t1 and the sif t2 descriptors. The other runs are obtained from
a linear combination of the textual and the visual scores. LaHC 5, LaHC 6,
LaHC 7 and LaHC 13 use the textual scores of LaHC 2 and the visual scores
of a visual descriptor (meanstd, sif t1 and sif t2 ). LaHC 11 and LaHC 12 com-
bine the textual scores of LaHC 9 and LaHC 10 with the visual scores of the
meanstd descriptor.


4     Results

All the obtained results are summarized in Table 3. On the whole results, our
team ranks 2nd on 8 participants1 . As we can see, the best results are obtained
when we combine the image bounding text and the meanstd descriptor. We
could have obtained better results if we had combined the image bounding text
and the sif t2 descriptor. Indeed, comparing results of text-image runs LaHC 6,
LaHC 7 and LaHC 13, we can notice that the best visual descriptor is sif t2 ,
followed by meanstd and sif t1 . The last three results obtained after an inter-
section are the worst results, but if we compare them in term of precision, they
are the best ones. Indeed, one document over six is relevant. As a rule, combin-
ing the textual information with the visual one always improves the results and
return more relevant documents which is very encouraging.


          rank               run                map num ret num rel ret
            5 run TXTIMG 100 3 1 5 meanstd 0.2178 44993       1213
            6 run TXTIMG 50 3 1 5 meanstd 0.2148 44993        1218
           14   run TXTIMG Siftdense 0.084 0.1903 44993       1212
           15        run TXT 100 3 1 5         0.1890 38004   1205
           16         run TXT 50 3 1 5         0.1880 37041   1198
           20   run TXTIMG Meanstd 0.025 0.1845 44993         1208
           21      run TXTIMG Sift 0.012       0.1807 44995   1200
           24   run TXTIMG Meanstd 0.015 0.1792 44993         1213
           33          LaHC TXT tfidf          0.1667 35611   1192
           44         LaHC TXT okapi           0.1432 35611   1164
           52   run inter TXT IMG Siftdense 0.0365 619         142
           53   run inter TXT IMG Meanstd 0.0338 574           76
           54     run inter TXT IMG Sift       0.0321 637      120
                        Table 3. Presentation of the results


1
    http://imageclef.org/2009/wikiMM-results
5    Conclusion
In this article we proposed improvements to our multimedia model we introduced
in [5]. The first one was to use image bounding text extracted from the original
documents, the second was to use sift based image descriptors for the visual
part and the third one was to add a text/image combination approach. A series
of thirteen runs was submitted using the ImageCLEFwiki 2009 collection. The
first analysis of the results allowed to make the three following remarks. It’s
better to use the image bounding text than the legend only. The sift descriptor
is better than our previous color descriptor provided it is calculated on a regular
partitioning. The text-image combination is a winning strategy which can be
implemented by linear combination of textual and visual scores.
    For future work, we aim to combine the textual information with more than
just one visual descriptor. Moreover as the visual information importance de-
pends on the query, we also plan to learn a different α parameter for each query.


6    Acknowledgements
This work was partly supported by the LIMA project2 and the Web Intelligence
project3 which are 2 Rhône-Alpes region projects of ISLE cluster4 .


References
1. Halil Bisgin. Parallel clustering algorithms with application to climatology. Tech-
   nical report, Informatics Institute, Istanbul Technical University, Turkey., 2008.
2. Frédéric Jurie and Bill Triggs. Creating efficient codebooks for visual recognition.
   In International Conference on Computer Vision, 2005.
3. David G. Lowe. Object recognition from local scale-invariant features. In
   Proceedings of the International Conference on Computer Vision ICCV, Corfu,
   pages 1150–1157, 1999.
4. K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky,
   T. Kadir, and L. Van Gool. A comparison of affine region detectors. International
   Journal of Computer Vision, 65(1-2):43–72, 2005.
5. Christophe Moulin, Cecile Barat, Mathias Gry, Christophe Ducottet, and Christine
   Largeron. Ujm at imageclefwiki 2008. In ImageCLEF 2008, 2008.
6. Stephen E. Robertson, Steve Walker, Micheline Hancock-Beaulieu, Aarron Gull, and
   Marianna Lau. Okapi at trec-3. In Text REtrieval Conference, pages 21–30, 1994.
7. Gerard Salton, A. Wong, and C. S. Yang. A vector space model for automatic
   indexing. Communations of the ACM, 18(11):613–620, 1975.
8. C. Zhai. Notes on the lemur tfidf model. Technical report, Carnegie Mellon Uni-
   versity, http://www.lemurproject.com, 2001.


2
  LIMA project: http://liris.cnrs.fr/lima/
3
  WI project: http://www.web-intelligence-rhone-alpes.org/
4
  ISLE cluster: http://ksup-gu.grenet.fr/isle