Combining text/image in WikipediaMM task 2009 Christophe Moulin, Cécile Barat, Cédric Lemaı̂tre, Mathias Géry, Christophe Ducottet, Christine Largeron Université de Lyon, F-69003, Lyon, France Université de Saint-Étienne, F-42000, Saint-Étienne, France CNRS UMR5516, Laboratoire Hubert Curien {christophe.moulin, cecile.barat, cedric.lemaitre, mathias.gery, ducottet,largeron}@univ-st-etienne.fr Abstract. This paper reports our multimedia information retrieval ex- periments carried out for the ImageCLEF track 2009. In 2008, we pro- posed a multimedia document model defined as a vector of textual and visual terms weighted using a tf.idf approch [5]. For our second partic- ipation, our goal was to improve this previous model in the following ways: 1) use of additional information for the textual part (legend and image bounding text extracted from the original documents, 2) use of different image detectors and descriptors, 3) new text / image combina- tion approach. Results allow to evaluate the benefits of these different improvements. 1 Introduction ImageCLEFwiki is a multimedia collection where each document is composed of text and one image. User needs are represented by queries (”topics”), which are also multimedia. Therefore, a multimedia document model is necessary to handle such a collection. In 2008, we proposed a first model that combines text and image information for multimedia retrieval [5]. This year, we improve our model adding textual information and using different detectors and descriptors for the visual information. Moreover we use a linear combination to merge our textual and visual results. After presenting our model, we will explain the submitted runs and the obtained results. We will finish by introducing our future work. 2 Visual and textual document model The document model we defined for ImageCLEF 2008 lets us rank documents depending on the query using different methods. Firstly, we explain the key features of our approach to rank documents according to a query using only textual information. Secondly, we describe how we extend the method to handle the visual information. Finally, we present our method for combining textual and visual results. 2.1 Textual representation model As in the vector space model introducted by Salton et al. [7], we represent a document di as a vector of weights wi,j . Each wi,j corresponds to the importance of the term tj in the document di computed by multiplying tfi,j and idfj , where tfi,j is the term frequency that characterizes the frequency of the term tj in the document di . The idfj is the inverse document frequency that quantifies the importance of the term tj over the corpus of documents. wi,j is high when the term tj is frequent in the document di but rare in the others. We use tfi,j and idfj defined in the Okapi formula by Robertson et al [6] by : k1 ni,j tfi,j = ni,j + k2 (1 − b + b d|davg i| ) where ni,j is the occurrence of the term tj in the document di , |di | the size of the document di and davg the average size of all documents in the corpus and k1 , k2 and b are three constants. |D| − |{di |uj ∈ di }| + 0.5 idfj = log |{di |tj ∈ di }| + 0.5 where |D| is the size of the corpus and |{di |tj ∈ di }| the number of documents where the term tj occurs at least one time. If we consider a query qk as a short document, we can represent it as a vector of weights. A score is then computed between the query qk and a document di as shown in table 1. The main difference between score1 and score2 is the representation of the query. In the first score, the weight of the query is defined by its tf only while in the second score the weight is equal to tf.idf . Note that for tfk,j , b = 0 because |dk | and |davg | are not defined for a query. Scoring Parameters Parameters tfi,j tfk,j 1 P score (q k , d i ) = k 1 = 2.2 k1 = 8 tj ∈qk tf i,j idf j tf k,j k2 = 1.2 k2 = 7 b = 0.75 b=0 2 P score (q k , d i ) = k 1 = 1 k1 = 1 tj ∈qk tfi,j idf j tf k,j idfj k2 = 1 k2 = 1 b = 0.5 b=0 Table 1. Scoring equations and their default parameters[8]. Different sources of text are available. The legend provided with images is of- ten very short and sometimes useless: for example, when the text deals with the copyright of the image or when it gives details about the user who uploaded the image. In order to gain information, we aim at using the original text extracted from the wikipedia documents in which images appear. We consider a text frag- ment aroung the image. The size of the window is tuned using wikipediaMM 2008 collection as a training collection. We add this text to the legend of the image and we index both the added text and the original legend. The indexing is performed with the Lemur software[8]. 2.2 Visual representation model In order to combine the visual information with the textual one, we also rep- resent images as a vector of weights. Provided we are able to extract visual words from images, it is possible to use the tf.idf formula in the same way as in the textual model. It is therefore necessary to create a visual vocabulary V = {v1 , ..., vj , ..., v|V | } as in [2]. For that purpose, we use 3 different descrip- tors. The first one (meanstd) is the same as in [5]. Each image is partitioned into 16x16 cells. Each cell is described by a six dimensional vector which corre- R G sponds to the mean and the standard deviation of R+G+B , R+G+B and R+G+B 3∗255 where R, G and B are the red, green and blue components of the cell. The sec- ond (named sif t1 ) and third (named sif t2 ) descriptors are based on the well known sift descriptor [3]. The sif t1 firstly detects regions of interest using the MSER method as in [4] while the sif t2 one uses a regular partitioning as in the meanstd descriptor. For each of our 3 descriptors, we apply a k-means algorithm [1] to obtain a vocabulary of 10’000 visual terms. Each visual term represents a cluster of feature vectors. Then, each image can be represented using a vector of visual terms. Local features are first calculated using one of the 3 descriptors. Then visual terms are determined by seeking, for each feature, the closest visual term (according to the euclidian distance) in the corresponding visual vocabulary. In the same way as for textual words, the weight of each visual term is computed using a tf.idf approach. 2.3 Combination In order to combine textual and visual results we use two different methods. The first one is a simple intersection between the results obtained with the textual query and with the visual one. The second one corresponds to a linear combination between the textual and the visual scores. score(qk , di ) = αscoreV (qk , di ) + (1 − α)scoreT (qk , di ) The α parameter lets us add more or less visual information. We calculate its optimal value using the queries from the ImageCLEFwiki 2008 track as a training set. 3 Experiments Using the model described in the previous section, we present runs submitted to ImageCLEFwiki and the results we obtained. 3.1 Submitted runs run id run score text image combination LaHC 1 LaHC TXT okapi score1 legend - - LaHC 2 LaHC TXT tfidf score2 legend - - LaHC 3 run inter TXT IMG Meanstd score2 legend meanstd intersection LaHC 4 run inter TXT IMG Sift score2 legend sif t1 intersection LaHC 5 run TXTIMG Meanstd 0.015 score2 legend meanstd α= 0.015 LaHC 6 run TXTIMG Meanstd 0.025 score2 legend meanstd α= 0.025 LaHC 7 run TXTIMG Sift 0.012 score2 legend sif t1 α= 0.012 LaHC 8 run inter TXT IMG Siftdense score2 legend sif t2 intersection LaHC 9 run TXT 100 3 1 5 score2 100 char legend - LaHC 10 run TXT 50 3 1 5 score2 50 char legend - LaHC 11 run TXTIMG 100 3 1 5 meanstd score2 100 char meanstd α= 0.025 LaHC 12 run TXTIMG 50 3 1 5 meanstd score2 50 char meanstd α= 0.025 LaHC 13 run TXTIMG Siftdense 0.084 score2 legend sif t2 α= 0.084 – meanstd: regular partitioning + color descriptor – sif t1 : MSER detector + sift descriptor – sif t2 : regular partitioning + sift descriptor – legend: text of the image document – n char: size (n characters) of the text window around the image in the original wikipedia documents Table 2. Presentation of the runs All the runs are entirely automatic and are summarized on table 2. We define a baseline, LaHC 1, that corresponds to a pure text model. It uses only textual terms for the query and scoring of documents. We calculate the score1 for each image using terms of the textual content. The image name or bounding char- acters are not considered. We do not use neither feedback nor query expansion. Since score1 is applied, the query terms are weighted with their frequency tf . Using only the text, we perform 3 other runs: the LaHC 2 is the same as the baseline except that the query is represented by its tf.idf rather than its tf . The LaHC 9 and LaHC 10 are two other text only runs that make use of the bounding text around the image in the original wikipedia document. The LaHC 9 adds 100 characters before and after the image while the LaHC 10 adds 50 characters. All other runs exploit both the textual and the visual information of doc- uments. The LaHC 3, LaHC 4 and LaHC 8 are obtained after an intersec- tion of the text only query results (LaHC 2) and the image query using the meanstd, the sif t1 and the sif t2 descriptors. The other runs are obtained from a linear combination of the textual and the visual scores. LaHC 5, LaHC 6, LaHC 7 and LaHC 13 use the textual scores of LaHC 2 and the visual scores of a visual descriptor (meanstd, sif t1 and sif t2 ). LaHC 11 and LaHC 12 com- bine the textual scores of LaHC 9 and LaHC 10 with the visual scores of the meanstd descriptor. 4 Results All the obtained results are summarized in Table 3. On the whole results, our team ranks 2nd on 8 participants1 . As we can see, the best results are obtained when we combine the image bounding text and the meanstd descriptor. We could have obtained better results if we had combined the image bounding text and the sif t2 descriptor. Indeed, comparing results of text-image runs LaHC 6, LaHC 7 and LaHC 13, we can notice that the best visual descriptor is sif t2 , followed by meanstd and sif t1 . The last three results obtained after an inter- section are the worst results, but if we compare them in term of precision, they are the best ones. Indeed, one document over six is relevant. As a rule, combin- ing the textual information with the visual one always improves the results and return more relevant documents which is very encouraging. rank run map num ret num rel ret 5 run TXTIMG 100 3 1 5 meanstd 0.2178 44993 1213 6 run TXTIMG 50 3 1 5 meanstd 0.2148 44993 1218 14 run TXTIMG Siftdense 0.084 0.1903 44993 1212 15 run TXT 100 3 1 5 0.1890 38004 1205 16 run TXT 50 3 1 5 0.1880 37041 1198 20 run TXTIMG Meanstd 0.025 0.1845 44993 1208 21 run TXTIMG Sift 0.012 0.1807 44995 1200 24 run TXTIMG Meanstd 0.015 0.1792 44993 1213 33 LaHC TXT tfidf 0.1667 35611 1192 44 LaHC TXT okapi 0.1432 35611 1164 52 run inter TXT IMG Siftdense 0.0365 619 142 53 run inter TXT IMG Meanstd 0.0338 574 76 54 run inter TXT IMG Sift 0.0321 637 120 Table 3. Presentation of the results 1 http://imageclef.org/2009/wikiMM-results 5 Conclusion In this article we proposed improvements to our multimedia model we introduced in [5]. The first one was to use image bounding text extracted from the original documents, the second was to use sift based image descriptors for the visual part and the third one was to add a text/image combination approach. A series of thirteen runs was submitted using the ImageCLEFwiki 2009 collection. The first analysis of the results allowed to make the three following remarks. It’s better to use the image bounding text than the legend only. The sift descriptor is better than our previous color descriptor provided it is calculated on a regular partitioning. The text-image combination is a winning strategy which can be implemented by linear combination of textual and visual scores. For future work, we aim to combine the textual information with more than just one visual descriptor. Moreover as the visual information importance de- pends on the query, we also plan to learn a different α parameter for each query. 6 Acknowledgements This work was partly supported by the LIMA project2 and the Web Intelligence project3 which are 2 Rhône-Alpes region projects of ISLE cluster4 . References 1. Halil Bisgin. Parallel clustering algorithms with application to climatology. Tech- nical report, Informatics Institute, Istanbul Technical University, Turkey., 2008. 2. Frédéric Jurie and Bill Triggs. Creating efficient codebooks for visual recognition. In International Conference on Computer Vision, 2005. 3. David G. Lowe. Object recognition from local scale-invariant features. In Proceedings of the International Conference on Computer Vision ICCV, Corfu, pages 1150–1157, 1999. 4. K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. International Journal of Computer Vision, 65(1-2):43–72, 2005. 5. Christophe Moulin, Cecile Barat, Mathias Gry, Christophe Ducottet, and Christine Largeron. Ujm at imageclefwiki 2008. In ImageCLEF 2008, 2008. 6. Stephen E. Robertson, Steve Walker, Micheline Hancock-Beaulieu, Aarron Gull, and Marianna Lau. Okapi at trec-3. In Text REtrieval Conference, pages 21–30, 1994. 7. Gerard Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communations of the ACM, 18(11):613–620, 1975. 8. C. Zhai. Notes on the lemur tfidf model. Technical report, Carnegie Mellon Uni- versity, http://www.lemurproject.com, 2001. 2 LIMA project: http://liris.cnrs.fr/lima/ 3 WI project: http://www.web-intelligence-rhone-alpes.org/ 4 ISLE cluster: http://ksup-gu.grenet.fr/isle