=Paper=
{{Paper
|id=Vol-1176/CLEF2010wn-ImageCLEF-StanekEt2010
|storemode=property
|title=The Wroclaw University of Technology Participation at ImageCLEF 2010 Photo Annotation Track
|pdfUrl=https://ceur-ws.org/Vol-1176/CLEF2010wn-ImageCLEF-StanekEt2010.pdf
|volume=Vol-1176
}}
==The Wroclaw University of Technology Participation at ImageCLEF 2010 Photo Annotation Track==
The Wroclaw University of Technology Participation at ImageCLEF 2010 Photo Annotation Track? Michal Stanek, Oskar Maier, and Halina Kwasnicka Wrocław University of Technology, Institute of Informatics michal.stanek@pwr.wroc.pl, oskar.maier@student.pwr.wroc.pl, halina.kwasnicka@pwr.wroc.pl Abstract. In this paper we present three methods for image auto- annotation used by the Wroclaw University of Technology group at Im- ageCLEF 2010 Photo Annotation track. All of our experiments focus on robustness of the global color and texture image features in connection with different similarity measures. To annotate training set we use two version of PATSI algorithm which searches for the most similar images and transferring annotations from them to the target image by apply- ing transfer function. We use both the simple version of the algorithm working only on single similarity matrix, as well as multi-PATSI which uses many similarity measures in order to obtain the final annotations. As third approach to image auto-annotation we use Penalized Discrimi- nant Analysis to train multi class classifier in One-vs-All manner. During training and optimization process of all annotators we use F-measure as evaluation measure trying to achieve its highest value on a training set. Obtained results indicate that our approach achieved a high quality measure only for a small group of terms and it is necessary to take into account also local image characteristics. 1 Introduction Recently, Makadia et. al. [1] proposed a family of image annotation baseline methods that are build on the hypothesis that visually similar images are likely to share the same annotations. They treat image annotation as a process of transferring labels from nearest neighbours. Makadia’s method does not solve the fundamental problem of determining the number of annotations that should be assigned to the target image. Thus they assume a constant number of annotations per image. The transfer is performed in two steps: all annotations from the most similar image are rewritten and the most frequent words are chosen from the whole neighbourhood until a given annotation length has been achieved. We extend Makadia’s approach by constructing PATSI (Photo Annotation through Similar Images) annotator which introduce transfer function [2] as well ? This work is partially financed from the Ministry of Science and Higher Educa- tion Republic of Poland resources in 2008–2010 years as a Poland–Singapore joint research project 65/N-SINGAPORE/2007/0. as optimization algorithm which can be used to find optimal number of neigh- bours and the best transfer threshold according to the specified quality mea- sure [3]. During our experiments with different similarity metrics we extend this algorithm to multi-PATSI which perform annotation transfer process based onto many similarity matrices calculated using different feature sets and similarity measures and combine results into final annotation based on the quality of each annotator for specific words. At ImageCLEF 2010 photo annotation track [4] we evaluate PATSI and multi-PATSI approach with global image features. During experiments we use grid segmentation and statistical color informations as well as features extracted using LIRE package [5]. As third type of automatic image annotator we train PDA [6, 7] classifier onto CEDD [8] and Jpeg Coefficient Histogram [9] in One- vs-All manner. This paper is organized as follows. In the next section we describe used automatic image annotation methods with explanation of used features, distance measures and details of annotation algorithm. The third section describes the experiments and achieved results. The paper is finished with conclusions and remarks on possible further improvements of the method. 2 Annotation process In this section we describe automatic image annotation methods used during ImageCLEF 2010 Photo Annotation track [4] by our team. First we focus on types of visual features extracted from images and similarity measures used to build similarity matrices then we describe the annotation transfer process. 2.1 Visual Features The image I in a training dataset D is represented by a n-dimensional vector of visual features v I = (v1I , · · · , vnI ). All visual features are a m-dimensional vector of low level attributes viI = (xi,I i,I 1 , · · · , xm ). The visual features must be extracted from the image and can represent information about color and texture for the entire image, or only selected area of the image I. For all images in both training and tasting dataset we performed visual fea- ture extraction using self made feature extractor and the image descriptors con- tained in the LIRE package [5]. We focused mainly on global image character- istics, but we use also more local information obtained after splitting image by rectangular 5-by-5 and 20-by-20 grid. The list of extracted features include: 1. From MPEG-7 standard [9] we use following image descriptors calculated for the whole image: – Fuzzy Color Histogram – 125 dimensions – JPEG Coefficient Histogram – 192 dimensions – General Color Layout – 18 561 dimensions – Color and Edge Directivity Descriptor (CEDD)[8] – 120 dimensions – Fuzzy color and texture histogram (FCTH)[10] – 192 dimensions 2. Tamura features first three from six texture features corresponding to human visual perception [11]: – coarseness – size of the texture elements, – contrast – contrast stands for picture quality, – directionality – texture orientation. Tamura features vector has 16 dimensions. 3. Auto Color Correlogram features defined in [12, 13] – 256 dimensions 4. Gabor texture features [14] – 60 dimensions 5. Statistical color and edges information of image regions (5-by-5 and 20-by-20 grid) in two color spaces RGB and HSV: – x and y coordinates of the segment center – 2 dimensions, – the mean value of color in each channel of the color space – 3 dimensions, – standard deviations of color changes in each channel for a given color space – 3 dimensions, – mean eigenvalues of color Hessian in each channel for a given color space – 3 dimensions. 6. CoOccurance Matrix [15] calculated for each segment of 5-by-5 and 20- by-20 segmentation – 21 dimensions 2.2 Distance Metrics To obtain the similarity or rather dissimilarity between two images, we measure the distance between vectors in metric space and divergence between distribu- tions build onto visual vectors. In our experiments we use distance measures described below. Minkowski distance The Minkowski distance is widely used for measuring similarity between objects (e.g., images). The Minkowski metric between image A and B is defined as: n !1/p X p dMK (A, B) = viA − viB (1) i=1 where p is the Minkowski factor for the norm. Particularly, when p is equal one and two, it is the well known L1 and Euclidean distance respectively. Cosine distance is a measure of similarity between two vectors of n dimen- sions by finding the cosine of the angle between them, often used to compare documents in text mining 1 − (v A )(v B )T dCos (A, B) = . (2) kv A k2 kv B k2 Manhattan distance also called cityblock distance or the taxicab metric is the metric of the Euclidean plane defined by: X dManh (A, B) = (viA − viB ) (3) i Correlation distance measures the similarity in shape between vectors defined by T 1 − (v A − v̄ A )(v B − v̄ B ) dCorr (u, v) = T , (4) k(v A − v̄ A )k2 k(v B − v̄ B )k2 where k(u − ū)k2 is L2 distance between vector u and mean vector ū. Jensen–Shannon Divergence Based on the visual feature vectors v I one can build a model M I for the image I. We can assume that M I is a multi- dimensional random variable described by multi-variate normal distribution and all vectors viI are realizations of this model. The probability density function (PDF) for the model M I is defined as: I 1 1 M (x, µ, Σ) = N/2 1/2 exp − (x − µ)> Σ −1 (x − µ) (5) (2π) |Σ| 2 where x is the observation vector, µ the mean vector, and Σ the covariance matrix. Both µ and Σ are parameters of the model calculated using Expectation- Maximization -algorithm [16] on all visual features [v1I , · · · , vnI ] of the image I. In order to avoid problems of inverting covariance matrix (avoid matrix singularity) one may perform regularization of the covariance matrix. Models of images are build for all images in the training set, as well as for the query image. Distance between the models can be computed as Jensen–Shannon diver- gence, which is a symmetrized version of Kullback–Leibler divergence: 1 1 dJS (A, B) = DKL (M A kM B ) + DKL (M B kM A ), (6) 2 2 where M A , M B are models (PDF) for the images A and B, and DKL is the Kullback-Leibler distance which for multivariate-normal distribution takes the form: A 1 B det ΣB 1 −1 DKL (M kM ) = loge + tr ΣB ΣA 2 det ΣA 2 1 > −1 N + (µB − µA ) ΣB (µB − µA ) − , (7) 2 2 where ΣA , ΣB and µA , µB are covariance matrices and mean vectors from the respective image models A and B. 2.3 Automatic Image Annotation Methods We use three methods of automatic image annotation, such as PATSI (Photo An- notation through Similar Images) annotator, multi-PATSI annotator and multi- class PDA classificator. Details of all of those methods are described below. PATSI Annotator In the PATSI (Photo Annotation through Finding Similar Images) approach, for a query image Q, a vector of the k most similar images from the training dataset D needs to be found based on the similarity distance measure d. Let [r1 , · · · , rk ] be the ranking of k the most similar images ordered decreasingly by similarity. Based on the hypothesis that images similar in ap- pearance are likely to share the same annotation, keywords from the nearest neighbours are transferred to the query image. All labels for the image on po- sition r in the ranking are transferred with a value designated by the transfer function ϕ(ri ). To assure that labels from more similar images have a larger impact on resulting annotation we define ϕ as 1 ϕ(ri ) = , (8) i where ri is an image on position i in the ranking. All words associated with image ri are then transferred to the resulted annotation with the associated transfer value 1/i. If the words has been transferred before the transfer values are summed. The resulting query image annotation consists of all the words whose transfer values were greater than a specified threshold t. The threshold value t has an impact on the resulting annotation length and its optimal value as well as the optimal number of neighbours k which should be taken into account during the annotation process must both be found using an optimization process. The outline of the PATSI annotation method is presented in the figure 1 and is summarized in the Algorithm 1. Fig. 1. Schematic diagram of the PATSI algorithm The optimal parameters k ∗ and t∗ differ greatly not only for different databases, but also between feature sets, methods of distance measure and transfer func- tions. There exists no optimal choice of them that would be suitable in all cases. We need to adjusting them in each explicit case. Algorithm 1 PATSI image annotation algorithm Require: D – training dataset d – distance function Q – annotation quality function ϕ - transfer function 1: {Preparation Phase} calculate and store visual features all images in training dataset D 2: calculate similarity matrix using distance function d between all images in training dataset D 3: {Optimization Phase} choose values for k and t maximizing quality function Q on a training dataset 4: {Query Phase} calculate the visual features for query image Q 5: calculate the distance from query image to all other images in training database D. 6: take k images with the smallest distances between the models and create a ranking of those images. 7: transfer all words from the images in the ranking with the value ϕ(r), where r is the position of the image in the ranking. 8: as a final annotation take the words which transfer values sum is greater or equal to the provided threshold t value. Finding t∗ and k ∗ proves to be a non-trivial task. The commonly used op- timization solvers are inapplicable due to the non-linear character of quality function Q (discrete domain on k and continues on t). To efficiently find t∗ and k ∗ we propose and use the iterative refinement algorithm which is described in [3]. Multi-PATSI Annotator During experiments we spot that some of the fea- tures as well as distance metrics are more suitable to detect some groups of words, while showing a weak performs for others. By combining them together we can increase overall annotation performance. We propose the multi-PATSI method that take advantage of this observation by joining together the strengths of a number of annotation techniques. Fig. 2. Schematic diagram of the multi-PATSI algorithm. The overall schema of multi-PATSI approach is presented in figure 2. In the first step we run PATSI algorithm separately for each features sets and distance functions to obtain annotation vectors. Each element of those vectors represent whether word should be assigned or not to the query image Q (class {−1, 1}). For each of the PATSI annotators at learning stage the performance vector is calculated. The performance vector corresponds to the efficiency of the PATSI- annotator for each of the annotated words on the testing set. For each PATSI-annotator the resulted annotation vector is multiplied by a performance vector to obtain weighted annotator response. All weighted re- sponses are then summed together creating final annotation. All concepts which obtain value greater than a threshold tmulti are treated as a final annotation for a query image Q. Optimal threshold value t∗multi can be calculated using cross- validation method and optimization technique such as iterative refinement [3]. Multi-Class Classification As third annotation method we use Penalized Discriminant Analysis classifier[6, 7] from Python Machine Learning Module – MLPY [17] in One vs. All scenario. In this approach for each concept we train a separate PDA classifier using the extracted image features. We use all features from images annotated by a specific concept as positive examples and others as negative. In training we use four fold cross-validation. 3 Experimental Results We submitted five runs for the annotators and features sets described in previous section: 1. PATSI with Kullback Leibler divergence - hsv color space and grid 20-by-20, 2. PATSI with Kullback Leibler divergence - rgb color space and grid 20-by-20, 3. Multi-PATSI with features presented in Table 2, 4. PDA classifier with CEDD features, 5. PDA classifier with Jpeg Coefficient Histogram features. The official results of the five runs in terms of Average Precision (AP ), Av- erage Equal Error Rate (Avg. EER), Average Area Under Curve (Avg. AUC) are reported in Table 1. A detailed overview of the quality of annotations for each of the submitted methods for the 30 best-annotated words is presented in Table 3. 4 Conclusion During the training and optimization process the parameters of the classifiers was tuned using the F-measure (harmonic mean of precision and recall) instead of the Average Precision. F-measure resulted that in all submitted annotations Table 1. Official results of submitted runs on testing dataset Submited run AP EER AUC PDA + CEDD 0.188821 0.361605 0.419472 PDA + JpegCoefficientHistogram 0.186649 0.375593 0.398008 PATSI + RGB 0.183472 0.464731 0.125210 PATSI + HSV 0.180601 0.461858 0.128203 Multi-PATSI 0.174149 0.427712 0.240389 Table 2. Feature sets and distance metrics with its optimal transfer parameters used in multi-PATSI annotator L1 L2 Cosine Manhattan correlation Feature set F t∗ n∗ F t∗ n∗ F t∗ n∗ F t∗ n∗ F t∗ n∗ Auto Color Correlo- 0.211 0.1 30 0.1 29 0.1 24 0.10 30 0.10 29 gram CEDD 0.225 0.10 19 0.240 0.26 24 0.238 0.28 22 0.10 19 0.1 25 FCTH 0.221 0.10 20 0.10 24 0.10 23 0.10 20 0.219 0.1 26 Fuzzy Color His- 0.208 0.10 30 0.10 28 0.1 29 0.10 30 0.10 27 togram General Color Lay- 0.02 32 0.199 0.10 31 0.02 35 0.02 32 – – – out Jpeg Coefficient His- 0.229 0.10 9 0,243 0.34 19 0.236 0.18 13 0.10 9 – – – togram Gabor 0.02 35 0.02 35 0.196 0.015 36 0.02 35 0.015 36 Tamura 0.06 29 0.05 33 0.10 27 0.06 29 – – – Grid 20x20 – RGB 0.1 30 0.1 30 0.10 33 0.213 0.1 27 – – – Grid 20x20 – RGB + 0.1 27 – 0.1 29 0.1 30 – – – – – – dev. Grid 20x20 – RGB + 0.12 29 0.1 30 0.219 0.14 30 0.1 32 – – – dev. + hes Grid 20x20 – HSV 0.1 30 0.213 0.1 31 0.01 34 0.215 0.1 30 – – – Grid 20x20 – HSV + 0.1 27 0.1 30 0.216 0.14 36 0.1 27 – – – dev. Grid 20x20 – HSV + 0.1 30 0.1 29 0.219 0.14 36 0.1 30 – – – dev. + hes Grid 20x20 – CoOc- – – – 0.06 33 0.06 33 0.02 32 – – curanceMatrix Grid 5x5 – RGB 0.10 30 0.1 30 0.1 30 0.10 30 – – – Grid 5x5 – RGB + 0.10 30 0.1 30 0.1 30 0.10 30 – – – dev. Grid 5x5 – RGB + 0.10 30 0.1 30 0.217 0.1 30 0.10 30 – – – dev. + hes Grid 5x5 – HSV 0.10 30 0.1 29 0.1 33 0.1 29 – – – Grid 5x5 – HSV + 0.10 29 0.219 0.1 25 0.10 24 0.221 0.1 29 – – – dev. Grid 5x5 – HSV + 0.10 27 0.1 25 0.225 0.16 27 0.1 27 – – – dev. + hes Grid 5x5 – CoOccu- 0.18 32 0.1 32 0.10 28 0.219 0.18 32 – – – – – – ranceMatrix results we optimize annotation length by providing annotation vectors contained only {−1, 1} values. Using vectors prepared in such a way results in low Average Precision quality. The published results show that the highest measure of quality according to AP measure, reached the multi-class PDA classifier with CEDD features. On the other hand the worst in comparison was multi-PATSI annotator. Table 3. Average Precision for the 30 best annotated concepts in all submitted results PATSI (RGB) PATSI (HSV) multi-PATSI PDA (CEDD) PDA (JpegCoefficient) L.p. Concept AP Concept AP Concept AP Concept AP Concept AP 1 Neutral Illumination 0,947 Neutral Illumination 0,947 Neutral Illumination 0,947 Neutral Illumination 0,973 Neutral Illumination 0,965 2 No Visual Season 0,883 No Visual Season 0,883 No Visual Season 0,883 No Visual Season 0,921 No Visual Season 0,896 3 No Persons 0,726 No Blur 0,719 No Persons 0,717 No Persons 0,776 No Blur 0,833 4 No Blur 0,723 No Persons 0,718 No Blur 0,716 No Blur 0,760 No Persons 0,767 5 natural 0,635 natural 0,635 natural 0,635 Outdoor 0,672 natural 0,654 6 Outdoor 0,608 Outdoor 0,586 Outdoor 0,555 Day 0,669 Outdoor 0,644 7 Day 0,598 Day 0,578 Day 0,535 natural 0,663 Day 0,629 8 Sky 0,542 Sky 0,514 cute 0,511 cute 0,571 cute 0,565 9 cute 0,511 cute 0,511 No Visual Time 0,443 No Visual Time 0,554 No Visual Time 0,563 10 Plants 0,480 No Visual Time 0,431 Sky 0,369 Sky 0,504 Partly Blurred 0,530 11 Landscape Nature 0,438 Landscape Nature 0,409 Visual Arts 0,325 Plants 0,446 Sky 0,462 12 No Visual Time 0,431 Plants 0,389 male 0,307 male 0,387 male 0,432 13 Clouds 0,408 Clouds 0,369 Clouds 0,294 No Visual Place 0,365 Plants 0,375 14 Visual Arts 0,325 Visual Arts 0,325 No Visual Place 0,291 Partly Blurred 0,363 No Visual Place 0,351 15 Sunset Sunrise 0,315 Sunset Sunrise 0,312 Partly Blurred 0,286 Clouds 0,354 Clouds 0,346 16 Indoor 0,315 male 0,302 Plants 0,279 Visual Arts 0,330 Indoor 0,338 17 male 0,304 Indoor 0,299 Night 0,274 Indoor 0,316 Citylife 0,325 18 Citylife 0,296 Citylife 0,297 Sunset Sunrise 0,269 Landscape Nature 0,302 Visual Arts 0,320 19 Partly Blurred 0,284 Partly Blurred 0,283 Citylife 0,266 Adult 0,295 female 0,294 20 Night 0,281 No Visual Place 0,275 Indoor 0,255 Citylife 0,294 Landscape Nature 0,289 21 No Visual Place 0,275 Night 0,260 Adult 0,228 Night 0,272 Single Person 0,287 22 Sunny 0,271 Trees 0,252 Macro 0,222 female 0,261 Adult 0,284 23 Trees 0,268 Sunny 0,250 Building Sights 0,219 Single Person 0,249 Family Friends 0,276 24 female 0,255 Macro 0,235 Portrait 0,215 Sunny 0,244 Building Sights 0,270 25 Water 0,247 Aesthetic Impression 0,234 Water 0,211 Building Sights 0,243 Water 0,258 26 Park Garden 0,238 Park Garden 0,234 female 0,211 Aesthetic Impression 0,239 Portrait 0,241 27 Family Friends 0,235 Water 0,229 Single Person 0,210 Portrait 0,236 Macro 0,233 28 Aesthetic Impression 0,229 Vehicle 0,221 Family Friends 0,203 Family Friends 0,235 Aesthetic Impression 0,230 29 Macro 0,219 Adult 0,215 Landscape Nature 0,201 Park Garden 0,233 Vehicle 0,205 30 Adult 0,216 Single Person 0,197 Aesthetic Impression 0,199 Water 0,224 Sunny 0,196 The results show that the method of transferring annotations seems to be very interesting concept. However, it will be necessary to use outside the global characteristics of the image also the local features as well as adaptive metric functions. References [1] Makadia, A., Pavlovic, V., Kumar, S.: A new baseline for image annotation. In: ECCV ’08, Berlin, Heidelberg, Springer-Verlag (2008) 316–329 [2] Stanek, M., Broda, B., Kwasnicka, H.: Patsi — photo annotation through finding similar images with multivariate gaussian models. Lecture Notes in Computer Science, International Conference on Computer Vision and Graphics (2010) [3] Stanek, M., Maier, O., Kwasnicka, H.: PATSI - photo annotation through similar images with annotation length optimization. In: Intelligent information systems. Publishing House of University of Podlasie (2010) 219–232 [4] Nowak, S., Huiskes, M.: New strategies for image annotation: Overview of the photo annotation task at imageclef 2010. In: In the Working Notes of CLEF 2010. (2010) [5] Lux, M., Chatzichristofis, S.A.: Lire: lucene image retrieval: an extensible java cbir library. In: MM ’08: Proceeding of the 16th ACM international conference on Multimedia, New York, NY, USA, ACM (2008) 1085–1088 [6] Ghosh, D.: Penalized discriminant methods for the classification of tumors from gene expression data. Biometrics 59(4) (2003) 992–1000 [7] Hastie, T., Buja, A., Tibshirani, R.: Penalized discriminant analysis. Annals of Statistics 23 (1995) 73–102 [8] Chatzichristofis, S., Boutalis, Y.: Cedd: Color and edge directivity descriptor: A compact descriptor for image indexing and retrieval. Computer Vision Systems (2008) 312–322 [9] Chang, S.F., Sikora, T., Puri, A.: Overview of the MPEG-7 Standard. IEEE Trans. Circuits and Systems for Video Technology 11(6) (2001) 688–695 [10] Chatzichristofis, S.A., Boutalis, Y.S.: Fcth: Fuzzy color and texture histogram - a low level feature for accurate image retrieval. Image Analysis for Multimedia Interactive Services, International Workshop on 0 (2008) 191–196 [11] Tamura, H., Mori, S., Yamawaki, T.: Texture features corresponding to visual perception. IEEE Transactions on System, Man and Cybernatic 6 (1978) [12] Goodrum, A.: Image information retrieval: An overview of current research. In- forming Science 3 (2000) 2000 [13] Huang, J., Kumar, S.R., Mitra, M., Zhu, W.J., Zabih, R.: Image indexing using color correlograms. In: CVPR ’97: Proceedings of the 1997 Conference on Com- puter Vision and Pattern Recognition (CVPR ’97), Washington, DC, USA, IEEE Computer Society (1997) 762 [14] Zhang, D., Wong, A., Indrawan, M., Lu, G.: Content-based image retrieval using gabor texture features. In: IEEE Transactions PAMI. (2000) 13–15 [15] Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural features for image classi- fication. IEEE Transactions on Systems, Man, and Cybernetics 3(6) (November 1973) 610–621 [16] McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions (Wiley Series in Probability and Statistics). 2 edn. Wiley-Interscience (March 2008) [17] Albanese, D., Merler, S., Jurman, G., Visintainer, R., Furlanello, C.: Mlpy ma- chine learning py (2010) http://mloss.org/software/view/66/.