=Paper=
{{Paper
|id=Vol-1177/CLEF2011wn-ImageCLEF-SuEt2011
|storemode=property
|title=Semantic Contexts and Fisher Vectors for the ImageCLEF 2011 Photo Annotation Task
|pdfUrl=https://ceur-ws.org/Vol-1177/CLEF2011wn-ImageCLEF-SuEt2011.pdf
|volume=Vol-1177
}}
==Semantic Contexts and Fisher Vectors for the ImageCLEF 2011 Photo Annotation Task==
Semantic Contexts and Fisher Vectors for the ImageCLEF 2011 Photo Annotation Task Yu Su and Frédéric Jurie GREYC, University of Caen, France firstname.lastname@unicaen.fr Abstract. This paper describes the participation of UNI- CAEN/GREYC to the ImageCLEF 2011 photo annotation task. The proposed approach uses visual image features and binary annota- tions of concepts only. In this approach, the annotations are predicted by SVM classifiers trained separately for each concept. The classifiers take Bag-of-Words histograms and fisher vectors representations as inputs, both being combined at the decision level. Furthermore, contextual information is also embedded into the Bag-of-Words histograms to enhance their performance. The experimental results show that the combination of Bag-of-Words histograms and Fisher vectors brings significant performance increase (e.g. 4% for Mean Average Precision). Furthermore, the results of our best-run rank in top 3 for both concept and image level evaluations. Keywords: Image classification, Photo annotation, Bag-of-Words model, Semantic context, Fisher Vectors 1 Introduction The aim of the ImageCLEF 2011 photo annotation task is to automatically assign to each image a set of concepts taken from a list of 99 possible pre-defined visual concepts. In this task, the participants are given 8000 training images associated with the corresponding 99 binary labels, each of which corresponds to a visual concept, as well as the photo tagging ontology, EXIF data and Flickr user tags. In the test phase, the participants are requested to give to each test image the labels of all the visual concepts describing the image. The evaluation of performance is done at concept and image levels. For the former, Mean Average Precision (MAP) is computed for each concept. For the latter, F-Measure (F-ex) and the Semantic R-Precision (SR-Precision) are computed for each image. For more details on this task, please refer to [7] . In our participation, we did not use photo tagging ontology, EXIF data and Flickr user tags. Our results are only based on visual image features. Specif- ically, we extracted different types of local features (e.g. SIFT) from images and then adopted Bag-of-Words (BoW) model to aggregate local features into a global image descriptor. Our participation is mainly inspired by the work of Su and Jurie [11], which proposed to embed some contextual information into the BoW model. In addition, some improvements over [11] are also proposed. Indeed, Fisher Vectors (FV) have been reported to give good performance on both object recognition and image retrieval tasks [9]. Thus, we also computed FV from images and combined them with the context-embedded BoW histograms at decision level, i.e, training classifiers for Fisher Vectors and context-embedded BoW histograms separately and combining classifiers by averaging their out- puts. As to photo annotation, the above process is performed for each concept independently and the averaged classifier outputs are used as the confidences of concept occurrence. The organization of this paper is as follows: In section 2, we describe local features used in our method. Then, we explain how to combine BoW model with both semantic contexts (section 3) and FV (section 4). Experimental evaluation is given in section 5, followed by a conclusion in the last section. 2 Visual Features In our method, 6 kinds of visual features are extracted from each image, which are introduced in the following paragraph. Before feature computation, the im- ages are scaled to be at most 300 × 300 pixels, with their original aspect ratios maintained. Except for LAB features which encode color information, color im- ages are first transformed to grayscale. SIFT Vector quantized SIFT descriptors [6] are computed for 5000 image patches with randomly selected positions and scales (with scales from 16 to 64 pixels), and are quantized to 1024 k-means centers. HOG HOG descriptors [3] are densely extracted on a regular grid at step of 8 pixels. On each node of the grid a 31 dimensional descriptor is computed and then 2 × 2 neighboring descriptors are concatenated to form a descriptor of 124 dimensions. HOG features are finally vector quantized to 256 k-means centers. Textons Texton descriptors [12] are generated by computing the output of 36 Gabor filters with different scales and orientations for each pixel, and then quantized to 256 k-means centers. SSIM Self-similarity descriptors [10] are computed on a regular grid at step of five pixels. Each feature is obtained by computing the correlation map of a patch of 5 × 5 in a window with radius of 40 pixels, then quantizing it in 3 radial bins and 10 angular bins, obtaining 30 dimensional descriptor vectors. Self-similarity features are finally quantized to 256 k-means centers. LAB LAB descriptors [4] are computed for each pixel and then quantized to 128 k-means centers. Canny Canny edge descriptors [1] are computed for each pixel and then quan- tized to 8 orientation bins. Finally, concatenating all BoW histograms gives a 1928-dimensional feature vector which can describe an image or a image region. 3 Image Representation by Embedding Semantic Contexts into BoW Model In this section, we first review how do we define the semantic contexts and embed them into BoW model as introduced in [11]. Then we introduce our improvements over this method. train station lab zoo gym jail store casino church harbor kitchen highway cemetery bathroom global classroom industrial restaurant courtroom supermarket library tunnel suburb scene bedroom + theater laundry office tennis_court (35) dining_room swimming_pool hospital_room shopping_mall living_room conference_room parking_lot indoor/outdoor city/landscape sky local building coast desert forest grass lake mountain scene + ocean river road snow soil street wall tree (16) green color + red black blue gray orange white yellow (8) triangle shape + box cylinder circle cone oval pyramid (7) metal material ceramics cloth feather glass hairless leather + paper plastic rubber stone water wood fur (14) face animal_flipper animal_head animal_wing object door window hand hooves screen wheel motorbike + airplane bicycle bird boat bottle bus car cat (30) chair cow diningtable dog horse person pottedplant sheep sofa train tv/monitor Fig. 1. Grouped semantic contexts and some illustrative training images [11]. The values between brackets are the number of semantic contexts within corresponding groups. 3.1 Semantic Context In [11], 110 semantic contexts are defined by hand with the intention of providing abundant semantic information for image description. (see Fig. 1). Two types of contexts are distinguished: global contexts including global scenes and local contexts including local scenes, colors, shapes, materials and objects. For each semantic context, we learn a SVM classifier with linear kernel (here- after called as context classifiers). For the global contexts, the classifiers are learned on whole images described by BoW histograms. For the local contexts, the classifiers are learned on some randomly sampled image regions described again by BoW histograms. The training images are automatically downloaded from Google image search by using the name of context as query. After the man- ual annotation, about 400 relevant images are reserved for each context. They are used as positive images for the corresponding context while images from the other contexts are considered as negatives. In test phase, images (for global contexts) or regions (for local contexts) are input to context classifiers and a sigmoid function is used to transform the original decision values to probabilities (refer to [2]). 3.2 Embedding Semantic Contexts into BoW model Assume that, for an image I, a set of local features fi , i = 1, . . . , N are extracted from it, where N is the number of local features. The BoW model consists of V visual words vj , j = 1, . . . , V . The traditional BoW feature for vj measures the occurrence probability of vj on image I, say p(vj |I). In practice, p(vj |I) is usually computed by: N 1 X p(vj |I) = δ(fi , vj ), (1) N i=1 where ( 1 if j = arg min d(fi , vj ) δ(fi , vj ) = j=1,...,V (2) 0 else and d is a distance function (e.g., the L2 norm). Marginalizing p(vj |I) over different local contexts gives: C X p(vj |I) = p(vj |ck , I)p(ck |I), (3) k=1 where ck is the k-th context, C is the number of local contexts (75 in our case), p(vj |ck , I) is the context-specific occurrence probability of vj on image I, p(ck |I) is the occurrence probability of context ck on image I. On the other hand, the second term of Eq. 3, which gives the distribution of different contexts on image I, can also provide rich information to describe the image, as shown by [13]. For example, knowing an image is composed of one third of sky, one third of sea and one third of beach, brings a lot of information regarding the content of this image. At the end, images are eventually represented by multiple context-specific BoW histograms, i.e., p(vj |ck , I) and a vector of context-occurring probabilities, i.e., p(ck |I). In [11], p(vj |ck , I) is constructed by modeling the probabilistic distribution of context ck on image I which is estimated by dividing image I into a set of regions Ip and predicting the occurrence probabilities of ck for each region (by using context classifiers). By denoting Ip (fi ) the set of image regions which cover the local feature fi , we define: N 1 X p(vj |ck , I) = δ(fi , vj )p(ck |Ip (fi )), (4) N i=1 where p(ck |Ip (fi ) can be considered as the weight of local feature fi . In practice, p(ck |Ip (fi )) is computed by averaging the outputs of the context classifier (for ck ) on Ip (fi ). As to p(ck |I), it can be easily computed by averaging the outputs of the context classifiers (for ck ) on all image regions in Ip . This process is similar to the computation of p(ck |Ip (fi )) in previous subsection. In addition, we also represent image I by the occurrence probabilities of global contexts. These probabilities are computed by running the corresponding context classifiers on the whole image. Finally, an image is represented by concatenating the occurrence probabilities of both global and local contexts, i.e., (p(c1 |I), . . . , p(cC |I), p(cC+1 |I), . . . , p(cC 0 |I)), where C 0 is the number of all contexts (110 in our case) and C is the number of local contexts (75 in our case). We call this image descriptor as semantic features. 3.3 Our improvements over [11] The above subsection reviewed the process of constructing context-specific BoW histograms introduced in [11]. For our participation to the ImageCLEF 2011 photo annotation task, some improvements over this method are proposed. First, we learn a specific vocabulary for each semantic context rather than use a uni- form vocabulary for all contexts as in [11]. Second, instead of selecting a single context for each visual word as in [11], we train a classifier for each context- specific BoW histogram and then combine all the classifiers. Detailed implemen- tation of these two improvements are given in the following paragraph. In the traditional vocabulary learning process, local features extracted from a set of images are randomly (or uniformly) sampled and then vector quantized to get visual words. Differently, when learning our context-specific vocabulary, the sampling of local features is based on the distribution of this context on images. Specifically, more local features are sampled at the image regions with higher context-occurring probabilities (brighter image regions in Fig. 2). In prac- tice, this process is implemented by assigning each local feature fi a probability p(ck |Ip (fi )) (defined in section 3.2) and sampling local features based on their probabilities, which is formulated as follows. 1 if p(ck |Ip (fi ) ≥ ri s(fi ) = (5) 0 else where s(fi ) indicates whether the local feature fi is selected or not and ri are random numbers which are uniformly sampled between 0 and 1. After sampling local features for each context, k-means is used to build mul- tiple context-specific vocabularies. An image is then represented by multiple context-specific BoW histograms. The construction of context-specific BoW his- togram is the same as that in 3.2 (see Eq.4) Concatenating all the context-specific BoW histograms leads to a very high dimensional feature vector (in our case 1928 × 75=144,600D). Thus, we train a classifier for each context-specific BoW histogram and combine the classifiers by averaging their outputs. local features Chi2 SVM semantic contexts ... ... ... Chi2 SVM combination Chi2 SVM SPM channels ... Chi2 SVM 0.96 0.79 0.67 0.71 0.55 0.34 Chi2 SVM semantic features 0.23 0.26 … outdoor city sky street wood tree river blue Fig. 2. Combination of BoW model and semantic contexts. For an image, multiple saliency maps are generated by both context classifiers and SPM channels, with which multiple BoW histograms are constructed by weighting local features according to saliency maps. After that, multiple classifiers are learned, each of which corresponds to a BoW histogram. In addition, the occurrence probabilities of semantic contexts (also referred as semantic features) are predicted for the image, for which a classifier is learned. Finally, all the classifiers are combined by averaging their outputs. Recall that the way we embed contextual information into BoW model is based on weighting local features (see Eq.4). It is similar to the well-known spatial pyramid matching (SPM) [5] which divides an image into grids and build a histogram for each grid. This process can be also considered as weighting local features: for certain grid, the weights of the local features within it is set to 1 and the weights of other local features are set to 0. Although less flexible than context-based weights, the binary weights in SPM are more stable which is also favorable. Thus, we also train classifiers for BoW histograms of SPM channels. In our method, a three level pyramid, 1 × 1, 2 × 2, 3 × 1 (totally 8 channels) is used. It is worthwhile to point out that, different from traditional SPM, we learn a specific vocabulary for each SPM grid based on local features within this grid. Finally, we train a classifier for the semantic features and combine it with the classifiers for context-specific BoW histograms and SPM channels by averaging their outputs. For both BoW histograms and semantic features, classifiers are learned by SVM with chi-square kernel. The whole process is illustrated in Fig.2. 4 Image Representation by Fisher Vectors Similar to the BoW model, Fisher Vectors [8] can also be used to aggregate local features into a global descriptor which is called Fisher Vectors (FV). FV can be considered as an extension of BoW histograms. They encodes how the parameters of the model should be changed to represent the image, rather than only consider the number of occurrences of each visual word as in BoW model. In our participation, we adopted the improved FV as introduced in [9] which is shown to outperform BoW histogram on some large-scale image retrieval tasks. For an image, we computed the FV for each kind of local features except Canny for which no actual visual word exist. As in [9], a three level pyramid, 1 × 1, 2 × 2, 3 × 1 (8 channels in total) is used to enhance the performance of fisher vector. For SIFT and HOG descriptors, PCA is used to reduce the dimension of descriptors to 64. For SIFT descriptors, a 64-centroid Gaussian mixture model (GMM) is computed to construct fisher vector whose dimensionality is there- fore 64 × 64 × 8 × 2 = 65, 536. For HOG, Texton, SSIM and LAB descriptors, 64-centroid GMMs are learned therefore the dimensionalities of fisher vector for these descriptors are 16,384, 9,216, 7,680 and 768 respectively. Then we train a classifier (SVM with linear kernel) for each fisher vector and combine all classi- fiers by averaging their outputs. The whole process is illustrated in Fig.3. Please note that the semantic contexts are not used for this representation. 5 ImageCLEF Evaluation In our participation, we submitted four runs to the photo annotation task. In this section, we describe these runs and compare their performances with other visual-only runs. 5.1 Description of Our Runs Run 1: MultiFeat Chi2SVM In this run, context-specific BoW histograms, each of which corresponds to a semantic context, as well as the semantic features fisher vector for SIFT (65536D) Linear SVM fisher vector for HOG (16384D) Linear SVM combination fisher vector for Texton (9216D) Linear SVM fisher vector for SSIM (7680D) Linear SVM fisher vector for LAB (768D) Linear SVM Fig. 3. Image representation and classification based on fisher vectors of multiple types of local features. The values in brackets are the dimensionalities of corresponding Fisher Vectors. After training a classifier for each Fisher Vectors, multiple classifiers are com- bined by averaging their outputs. are used to describe images. As illustrated in Fig.2, we trained separated clas- sifiers (SVMs with chi-square kernel) for both context-specific BoW histograms and semantic features and then combine them by averaging their outputs. Run 2: BoW+FisherKernel In this run, we combined all the classifiers in run 1 and classifiers for fisher vectors of different features (refer to Fig.3). The combination is performed by averaging the outputs of all classifiers. Run 3: SVMOutput In this run, the confidences of all 99 concepts obtained from run 2 are used as a new image descriptor. A classifier (SVM with chi-square kernel) is learned on this descriptor and used to give the confidences of concepts. By doing so, we hope to benefit from the correlation of different concepts. Run 4: BoW+FisherKernel+SVMOuput In this run, we averaged the confidences obtained from run 1, 2 and 3. In our participation, we used the implementation of LIBSVM [2] to learn SVM classifier. The value of the SVM parameter C and the normalization factor γ of chi-square kernel are determined by fivefold cross-validation. As to the image regions used for learning local context classifiers and generating saliency maps, on each image we sampled 100 regions with random positions and scales (with scales from 20% to 40% of the image size). For concept level evaluation, the classifier outputs are used as confidences directly. For image level evaluation, the real valued confidences are binarized by a threshold which is determined by fivefold cross-validation for each run. 5.2 Results of Our Runs The performances (MAP, F-ex and SR-Precision) of our runs are listed in Table 1. It can be concluded that the performance of context-specific BoW histograms is significantly enhanced by combining them with fisher vectors. It is worthwhile to point out that, according to our experiments on training data, the performance of fisher vectors alone is comparable to that of context-specific BoW histograms. Another conclusion drawn from Table 1 is that using classifier outputs as new features does not bring any improvement. Thus we need to design more powerful methods to utilize the correlation of concepts. Runs MAP F-ex SR-Precision MultiFeat Chi2SVM 34.2 56.0 69.4 BoW+FisherKernel 38.2 60.0 72.7 SVMOutput 34.5 49.1 65.1 BoW+FisherKernel+SVMOuput 38.2 59.2 72.5 Table 1. MAP, F-ex and SR-Precision of our runs. For each measure, values in bold indicate the best performance of 4 runs. For more detailed result, Fig.4 gives the MAPs of 99 concepts obtained from Run 2. For some concepts, the MAPs are very low, e.g. less than 10%. The reason is either that the concept is hard to predict (e.g. abstract) or that the number of training samples is quite small (e.g. only 12 images are annotated with skateboard). 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Teenager Macro Snow Shadow MusicalInstrument Trees Mountains Flowers natural Neutral llumination Landscape ature Sunset unrise Partly lurred cute Single erson Food Sea Vehicle Park arden Big roup dog Small roup Underexposed Architecture Autumn Lake Aesthetic mpression bird Painting airplane train male Spring Travel Child Church ship Bridge horse boring old erson technical Rain skateboard No lur Outdoor Water Indoor car Summer Motion lur Winter No ersons Clouds Plants Building ights calm Animals Family riends melancholic Visual rts Out f ocus Toy Sports Work euphoric Portrait Night Adult Street unpleasant insect Desert cat abstract inactive female Still ife active bicycle Overexposed fish River artificial Sky Day Sunny happy Beach olidays Baby Overall uality Fancy scary Birthday bodypart Citylife Partylife Graffiti funny B B L A G G o f S N G Q p B P F S P H I I Fig. 4. The MAPs of all 99 concepts obtained from Run 2. Finally, we compare our best run (Run 2: BoW+FisherKernel) with the best runs (visual-only) of several competitors. It can be seen from Table 2 that no run gave the best result for both concept and image level evaluation. For concept level evaluation (MAP as performance measure), TUBFI scores performed best, while for image level evaluation (F-ex and SR-Precision as performance measures), ISIS runpa-UvA-coreA performed best. Our best run ranks in the second place for both MAP and F-ex and the third place for SR-Precision. Runs MAP F-ex SR-Precision BPACAD bpacad avg cns 36.7 56.8 72.9 ISIS runpa-UvA-coreA 37.5 61.2 73.4 LIRIS 4visual model 4 35.5 53.9 72.5 TUBFI scores 38.8 55.2 62.1 Our best run 38.2 59.2 72.5 Table 2. MAP, F-ex and SR-Precision of our runs. For each measure, values in bold indicate the best performance of 4 runs. Our best run ranks in the second place for both MAP and F-ex and the third place for SR-Precision. 6 Conclusion In our participation to the ImageCLEF photo annotation task, multiple visual features were used for representing the images. We embedded contextual infor- mation into the traditional Bag-of-Words model and further combined it with fisher vector which has been shown to have good performance on image classifi- cation and retrieval tasks. The evaluation results showed that the performance of the Bag-of-Words model can be significantly enhanced by combining it with semantic contexts and fisher vector. Our best run gave 38.2%, 59.2% and 72.5% for MAP and F-ex and SR-Precision respectively, while the best results of visual- only runs are 38.8%, 61.2% and 73.4% respectively. Acknowledgement This work was partly realized under the Quaero Programme, funded by OSEO, French State agency for innovation. References 1. Canny, J.: A computational approach to edge detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on (6), 679–698 (1986) 2. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011), software available at http://www.csie.ntu.edu.tw/c̃jlin/libsvm 3. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005) 4. Hunter, R.: Photoelectric color difference meter. JOSA 48(12), 985–993 (1958) 5. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: CVPR (2006) 6. Lowe, D.: Distinctive image features from scale-invariant keypoints. International journal of computer vision 60(2), 91–110 (2004) 7. Nowak, S., Nagel, K., Liebetrau, J.: The clef 2011 photo annotation and concept- based retrieval tasks. CLEF 2011 working notes (2011) 8. Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image catego- rization. In: CVPR (2006) 9. Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification (2010) 10. Shechtman, E., Irani, M.: Matching local self-similarities across images and videos. In: CVPR (2007) 11. Su, Y., Jurie, F.: Visual word disambiguation by semantic contexts. In: ICCV (2011) 12. Varma, M., Zisserman, A.: A statistical approach to texture classification from single images. International Journal of Computer Vision 62(1), 61–81 (2005) 13. Vogel, J., Schiele, B.: Semantic modeling of natural scenes for content-based image retrieval. International Journal on Computer Vision 72(2), 133–157 (2007)