-

The Wroclaw University of Technology Participation at ImageCLEF 2010 Photo Annotation Track?

Michal Stanek

michal.stanek@pwr.wroc.pl 0

Oskar Maier

oskar.maier@student.pwr.wroc.pl 0

Halina Kwasnicka

halina.kwasnicka@pwr.wroc.pl 0 0 Wrocław University of Technology, Institute of Informatics

In this paper we present three methods for image autoannotation used by the Wroclaw University of Technology group at ImageCLEF 2010 Photo Annotation track. All of our experiments focus on robustness of the global color and texture image features in connection with different similarity measures. To annotate training set we use two version of PATSI algorithm which searches for the most similar images and transferring annotations from them to the target image by applying transfer function. We use both the simple version of the algorithm working only on single similarity matrix, as well as multi-PATSI which uses many similarity measures in order to obtain the final annotations. As third approach to image auto-annotation we use Penalized Discriminant Analysis to train multi class classifier in One-vs-All manner. During training and optimization process of all annotators we use F-measure as evaluation measure trying to achieve its highest value on a training set. Obtained results indicate that our approach achieved a high quality measure only for a small group of terms and it is necessary to take into account also local image characteristics.

Recently, Makadia et. al. [ 1 ] proposed a family of image annotation baseline methods that are build on the hypothesis that visually similar images are likely to share the same annotations. They treat image annotation as a process of transferring labels from nearest neighbours. Makadia’s method does not solve the fundamental problem of determining the number of annotations that should be assigned to the target image. Thus they assume a constant number of annotations per image. The transfer is performed in two steps: all annotations from the most similar image are rewritten and the most frequent words are chosen from the whole neighbourhood until a given annotation length has been achieved.

We extend Makadia’s approach by constructing PATSI (Photo Annotation through Similar Images) annotator which introduce transfer function [ 2 ] as well ? This work is partially financed from the Ministry of Science and Higher Education Republic of Poland resources in 2008–2010 years as a Poland–Singapore joint research project 65/N-SINGAPORE/2007/0. as optimization algorithm which can be used to find optimal number of neighbours and the best transfer threshold according to the specified quality measure [ 3 ]. During our experiments with different similarity metrics we extend this algorithm to multi-PATSI which perform annotation transfer process based onto many similarity matrices calculated using different feature sets and similarity measures and combine results into final annotation based on the quality of each annotator for specific words.

At ImageCLEF 2010 photo annotation track [ 4 ] we evaluate PATSI and multi-PATSI approach with global image features. During experiments we use grid segmentation and statistical color informations as well as features extracted using LIRE package [ 5 ]. As third type of automatic image annotator we train PDA [ 6, 7 ] classifier onto CEDD [ 8 ] and Jpeg Coefficient Histogram [ 9 ] in Onevs-All manner.

This paper is organized as follows. In the next section we describe used automatic image annotation methods with explanation of used features, distance measures and details of annotation algorithm. The third section describes the experiments and achieved results. The paper is finished with conclusions and remarks on possible further improvements of the method. 2

Annotation process In this section we describe automatic image annotation methods used during ImageCLEF 2010 Photo Annotation track [ 4 ] by our team. First we focus on types of visual features extracted from images and similarity measures used to build similarity matrices then we describe the annotation transfer process. 2.1

Visual Features

The image I in a training dataset D is represented by a n-dimensional vector of visual features vI = (v1I ; ; vnI ). All visual features are a m-dimensional vector of low level attributes viI = (xi1;I ; ; xim;I ). The visual features must be extracted from the image and can represent information about color and texture for the entire image, or only selected area of the image I.

For all images in both training and tasting dataset we performed visual feature extraction using self made feature extractor and the image descriptors contained in the LIRE package [ 5 ]. We focused mainly on global image characteristics, but we use also more local information obtained after splitting image by rectangular 5-by-5 and 20-by-20 grid. The list of extracted features include: 1. From MPEG-7 standard [ 9 ] we use following image descriptors calculated for the whole image: – Fuzzy Color Histogram – 125 dimensions – JPEG Coefficient Histogram – 192 dimensions – General Color Layout – 18 561 dimensions – Color and Edge Directivity Descriptor (CEDD)[ 8 ] – 120 dimensions – Fuzzy color and texture histogram (FCTH)[ 10 ] – 192 dimensions 2. Tamura features first three from six texture features corresponding to human visual perception [ 11 ]: – coarseness – size of the texture elements, – contrast – contrast stands for picture quality, – directionality – texture orientation.

Tamura features vector has 16 dimensions. 3. Auto Color Correlogram features defined in [ 12, 13 ] – 256 dimensions 4. Gabor texture features [ 14 ] – 60 dimensions 5. Statistical color and edges information of image regions (5-by-5 and 20-by-20 grid) in two color spaces RGB and HSV: – x and y coordinates of the segment center – 2 dimensions, – the mean value of color in each channel of the color space – 3 dimensions, – standard deviations of color changes in each channel for a given color space – 3 dimensions, – mean eigenvalues of color Hessian in each channel for a given color space – 3 dimensions. 6. CoOccurance Matrix [ 15 ] calculated for each segment of 5-by-5 and 20by-20 segmentation – 21 dimensions 2.2

Distance Metrics

To obtain the similarity or rather dissimilarity between two images, we measure the distance between vectors in metric space and divergence between distributions build onto visual vectors. In our experiments we use distance measures described below.

Minkowski distance The Minkowski distance is widely used for measuring similarity between objects (e.g., images). The Minkowski metric between image A and B is defined as: dCos(A; B) = 1

(vA)(vB)T kvAk2kvBk2

: dMK(A; B) = n X viA i=1 viB p !1=p where p is the Minkowski factor for the norm. Particularly, when p is equal one and two, it is the well known L1 and Euclidean distance respectively. Cosine distance is a measure of similarity between two vectors of n dimensions by finding the cosine of the angle between them, often used to compare documents in text mining (1) (2) Manhattan distance also called cityblock distance or the taxicab metric is the metric of the Euclidean plane defined by: dManh(A; B) =

X (viA i viB) Correlation distance measures the similarity in shape between vectors defined by dCorr(u; v) = 1 k(vA (vA

vA)(vB vA)k2k(vB vB)T vB)k2T ; where k(u

u)k2 is L2 distance between vector u and mean vector u.

Jensen–Shannon Divergence Based on the visual feature vectors vI one can build a model M I for the image I. We can assume that M I is a multidimensional random variable described by multi-variate normal distribution and all vectors viI are realizations of this model. The probability density function (PDF) for the model M I is defined as:

M I (x; ; ) =

1 (2 )N=2j j1=2 exp where x is the observation vector, the mean vector, and the covariance matrix. Both and are parameters of the model calculated using ExpectationMaximization -algorithm [ 16 ] on all visual features [v1I ; ; vnI ] of the image I. In order to avoid problems of inverting covariance matrix (avoid matrix singularity) one may perform regularization of the covariance matrix. Models of images are build for all images in the training set, as well as for the query image.

Distance between the models can be computed as Jensen–Shannon divergence, which is a symmetrized version of Kullback–Leibler divergence:

DKL(M BkM A); 1 2 where M A, M B are models (PDF) for the images A and B, and DKL is the Kullback-Leibler distance which for multivariate-normal distribution takes the form:

DKL(MA kMB) = + 1 2 loge 1 2 ( B det det

B A + tr where A, B and A, B are covariance matrices and mean vectors from the respective image models A and B. 2.3

Automatic Image Annotation Methods We use three methods of automatic image annotation, such as PATSI (Photo Annotation through Similar Images) annotator, multi-PATSI annotator and multiclass PDA classificator. Details of all of those methods are described below. PATSI Annotator In the PATSI (Photo Annotation through Finding Similar Images) approach, for a query image Q, a vector of the k most similar images from the training dataset D needs to be found based on the similarity distance measure d. Let [r1; ; rk] be the ranking of k the most similar images ordered decreasingly by similarity. Based on the hypothesis that images similar in appearance are likely to share the same annotation, keywords from the nearest neighbours are transferred to the query image. All labels for the image on position r in the ranking are transferred with a value designated by the transfer function '(ri).

To assure that labels from more similar images have a larger impact on resulting annotation we define ' as 1 '(ri) = ; (8) i where ri is an image on position i in the ranking. All words associated with image ri are then transferred to the resulted annotation with the associated transfer value 1=i. If the words has been transferred before the transfer values are summed.

The resulting query image annotation consists of all the words whose transfer values were greater than a specified threshold t. The threshold value t has an impact on the resulting annotation length and its optimal value as well as the optimal number of neighbours k which should be taken into account during the annotation process must both be found using an optimization process. The outline of the PATSI annotation method is presented in the figure 1 and is summarized in the Algorithm 1.

The optimal parameters k and t differ greatly not only for different databases, but also between feature sets, methods of distance measure and transfer functions. There exists no optimal choice of them that would be suitable in all cases. We need to adjusting them in each explicit case.

Algorithm 1 PATSI image annotation algorithm Require: D – training dataset d – distance function Q – annotation quality function ' - transfer function 1: fPreparation Phaseg

calculate and store visual features all images in training dataset D 2: calculate similarity matrix using distance function d between all images in training dataset D 3: fOptimization Phaseg

choose values for k and t maximizing quality function Q on a training dataset 4: fQuery Phaseg

calculate the visual features for query image Q 5: calculate the distance from query image to all other images in training database

D. 6: take k images with the smallest distances between the models and create a ranking of those images. 7: transfer all words from the images in the ranking with the value '(r), where r is the position of the image in the ranking. 8: as a final annotation take the words which transfer values sum is greater or equal to the provided threshold t value.

Finding t and k proves to be a non-trivial task. The commonly used optimization solvers are inapplicable due to the non-linear character of quality function Q (discrete domain on k and continues on t). To efficiently find t and k we propose and use the iterative refinement algorithm which is described in [ 3 ].

Multi-PATSI Annotator During experiments we spot that some of the features as well as distance metrics are more suitable to detect some groups of words, while showing a weak performs for others. By combining them together we can increase overall annotation performance. We propose the multi-PATSI method that take advantage of this observation by joining together the strengths of a number of annotation techniques.

The overall schema of multi-PATSI approach is presented in figure 2. In the first step we run PATSI algorithm separately for each features sets and distance functions to obtain annotation vectors. Each element of those vectors represent whether word should be assigned or not to the query image Q (class f 1; 1g).

For each of the PATSI annotators at learning stage the performance vector is calculated. The performance vector corresponds to the efficiency of the PATSIannotator for each of the annotated words on the testing set.

For each PATSI-annotator the resulted annotation vector is multiplied by a performance vector to obtain weighted annotator response. All weighted responses are then summed together creating final annotation. All concepts which obtain value greater than a threshold tmulti are treated as a final annotation for a query image Q. Optimal threshold value tmulti can be calculated using crossvalidation method and optimization technique such as iterative refinement [ 3 ]. Multi-Class Classification As third annotation method we use Penalized Discriminant Analysis classifier[ 6, 7 ] from Python Machine Learning Module – MLPY [ 17 ] in One vs. All scenario.

In this approach for each concept we train a separate PDA classifier using the extracted image features. We use all features from images annotated by a specific concept as positive examples and others as negative. In training we use four fold cross-validation. 3

Experimental Results We submitted five runs for the annotators and features sets described in previous section: 1. PATSI with Kullback Leibler divergence - hsv color space and grid 20-by-20, 2. PATSI with Kullback Leibler divergence - rgb color space and grid 20-by-20, 3. Multi-PATSI with features presented in Table 2, 4. PDA classifier with CEDD features, 5. PDA classifier with Jpeg Coefficient Histogram features.

The official results of the five runs in terms of Average Precision (AP ), Average Equal Error Rate (Avg. EER), Average Area Under Curve (Avg. AUC) are reported in Table 1. A detailed overview of the quality of annotations for each of the submitted methods for the 30 best-annotated words is presented in Table 3. 4

Conclusion During the training and optimization process the parameters of the classifiers was tuned using the F-measure (harmonic mean of precision and recall) instead of the Average Precision. F-measure resulted that in all submitted annotations results we optimize annotation length by providing annotation vectors contained only f 1; 1g values. Using vectors prepared in such a way results in low Average Precision quality.

The published results show that the highest measure of quality according to AP measure, reached the multi-class PDA classifier with CEDD features. On the other hand the worst in comparison was multi-PATSI annotator. P S

H e ( g a I r S e

T v

A A P . 3 e l b a T ) B G R ( I S T A P t p e c n o s n o r u ru rs l o

e ra o S s

n a o u s r r r u oo s n o S s

n a o r u s r l o r e lu a o e im ed

r T r

u la l u B s e m i T l a u e m i a u s T s l t e r e u

t im a T N l e a p e r u e t a im N T e l p a a u 5 6 3 7 4 4 9 5 3 0 2 2 5 1 6 8 5 0 4 9 7 4 6 0 8 1 3 0 5 6 ) P 6 9 3 6 5 4 2 6 6 3 6 3 7 5 4 3 2 2 9 8 8 8 7 7 5 4 3 3 0 9 t A ,9 ,8 ,8 ,7 ,6 ,6 ,6 ,5 ,5 ,5 ,4 ,4 ,3 ,3 ,3 ,3 ,3 ,3 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 6 0 2 9 3 1 4 4 6 7 5 3 4 0 6 2 5 4 2 1 9 4 3 9 6 5 3 4 P 7 2 7 6 7 6 6 7 5 0 4 8 6 6 5 3 1 0 9 9 7 6 4 4 4 3 3 3 3 2 7 ,6 ,6 ,6 ,5 ,5 ,5 ,4 ,3 ,3 ,3 ,3 ,3 ,3 ,3 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 A ,09 ,09 ,07 ,0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 l i y

l t e oV oB oP tau tud ay te V t u

t u o ra ky lea lan o lo

V u u A s is s r fe l d o li o y

s d it i n e sc m a i e a l u a d l n e lt li g u

l t t m i n d a u a r

r d e ro rca tse ie u

i t a o teh lce yn i h n C

r u ru rs l o

e ra o t p e l c t e V d o e oV oB oP tau tud ay ky tu o a l n u n u a is sc ts s l t

d a e n u a lo is u s e o nu la r fe o li l

y t d it

B u

s y i t s y o eh

V h e r a o ig e r nn rc t

G r l a se rak tea ie e c lt le h u g

d in u n C N N N N n O D S c N L P C V S m I C P N N T S M A P WV A S 7 3 6 3 5 8 8 2 1 0 8 1 8 5 5 5 4 6 4 1 5 1 8 5 7 8 5 9 9 6 P 4 8 2 2 3 0 9 4 1 8 3 3 0 2 1 1 0 9 8 8 7 7 6 5 4 3 3 2 1 1 A ,9 ,8 ,7 , 7 ,6 , 6 ,5 , 5 ,5 , 4 ,4 , 4 ,4 , 3 ,3 ,3 ,3 , 2 ,2 , 2 , 2 ,2 ,2 , 2 ,2 ,2 , 2 , 2 ,2 ,2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 r

V P B tu td y y t u e o o o a u a k u l N N N N n O D S c e n d a n

V u a o lo is ts sc is s l

t r d a e o u s n d u n r a i

u fe s o le ly lty th iV yn se le r

i a it

n e g o u r am tea rk e m C P N N S T f C . .p 1 2 3 4 5 6 7 8 9 01 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 L e c a l P l a e lca ed

r P r

u la l u B s i y e lca ed

r P r

u la l u B e s i ts rn r u A S e s i ts rn r u A S s t r s t r e r u t a N e p e r u t a

n N o e rs p e a P s ts d h n g e i i r S F g y in n o s r e s t h g i

S e s i r n u S t fe r

y n t u i

e ed lca r r P lu l

a d e r r u l B e c a l P l a t ep e l r is ts c o e oV oP oB tud ay tau teu oV k l n u l a

l n le V t

r y a a o a lo is

A a e P d a ro scd lt li t l l s l f u u o h a g nd a n du ity i g e e y i

h r i n ld t t em in u u e o am rak tea n i s r a ly G r 7 3 7 6 5 5 5 1 3 9 5 7 4 1 6 9 4 9 6 5 8 2 9 5 1 1 0 3 1 9 P 4 8 1 1 3 5 3 1 4 6 2 0 9 9 8 7 7 6 6 5 2 2 1 1 1 1 1 0 0 9 A ,9 ,8 ,7 , 7 ,6 , 5 ,5 , 5 ,4 , 3 ,3 ,3 , 2 , 2 ,2 ,2 , 2 , 2 , 2 ,2 , 2 ,2 , 2 , 2 , 2 ,2 , 2 , 2 ,2 ,1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S s

n a o r u s r l o r u a o

r t l p e e l i c t o e oV oP oB tau tud ay te V n u

y s u o k i r A l a u le u sd is

l V t

r a lo o a la ig y ts t e il n h s o lt ro i

u c li o d d a n g t

P n i du trrao trea le le li a g

h d t m n s em in a a e

F ap itc

e y sc 7 3 9 8 5 6 8 4 1 1 9 9 9 5 2 2 9 7 3 5 0 2 0 5 4 4 9 1 5 7 P 4 8 1 1 3 8 7 1 1 3 0 8 6 2 1 0 9 9 8 7 6 5 5 3 3 3 2 2 1 9 A ,9 , 8 ,7 , 7 ,6 , 5 ,5 , 5 , 5 , 4 ,4 , 3 ,3 , 3 ,3 ,3 , 2 , 2 , 2 ,2 , 2 ,2 , 2 , 2 , 2 ,2 , 2 , 2 ,2 ,1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 n o i s s ts re h p g m i I S

c gn ite it n o i s s e r p Im en

d c r i t a n o i s s e r p m I c s d n n e e i r rd F a

The results show that the method of transferring annotations seems to be very interesting concept. However, it will be necessary to use outside the global characteristics of the image also the local features as well as adaptive metric functions.

[1] Makadia , A. , Pavlovic , V. , Kumar , S.: A new baseline for image annotation . In: ECCV '08 , Berlin, Heidelberg, Springer-Verlag ( 2008 ) 316 - 329

[2] Stanek , M. , Broda , B. , Kwasnicka , H.: Patsi - photo annotation through finding similar images with multivariate gaussian models . Lecture Notes in Computer Science, International Conference on Computer Vision and Graphics ( 2010 )

[3] Stanek , M. , Maier , O. , Kwasnicka , H.: PATSI - photo annotation through similar images with annotation length optimization . In: Intelligent information systems . Publishing House of University of Podlasie ( 2010 ) 219 - 232

[4] Nowak , S. , Huiskes , M. : New strategies for image annotation: Overview of the photo annotation task at imageclef 2010 . In: In the Working Notes of CLEF 2010 . ( 2010 )

[5] Lux , M. , Chatzichristofis , S.A. : Lire: lucene image retrieval: an extensible java cbir library . In: MM '08: Proceeding of the 16th ACM international conference on Multimedia , New York, NY, USA, ACM ( 2008 ) 1085 - 1088

[6] Ghosh , D. : Penalized discriminant methods for the classification of tumors from gene expression data . Biometrics 59 ( 4 ) ( 2003 ) 992 - 1000

[7] Hastie , T. , Buja , A. , Tibshirani , R.: Penalized discriminant analysis . Annals of Statistics 23 ( 1995 ) 73 - 102

[8] Chatzichristofis , S. , Boutalis , Y. : Cedd: Color and edge directivity descriptor: A compact descriptor for image indexing and retrieval . Computer Vision Systems ( 2008 ) 312 - 322

[9] Chang , S.F. , Sikora , T. , Puri , A. : Overview of the MPEG-7 Standard . IEEE Trans. Circuits and Systems for Video Technology 11 ( 6 ) ( 2001 ) 688 - 695

[10] Chatzichristofis , S.A. , Boutalis , Y.S. : Fcth: Fuzzy color and texture histogram - a low level feature for accurate image retrieval. Image Analysis for Multimedia Interactive Services , International Workshop on 0 ( 2008 ) 191 - 196

[11] Tamura , H. , Mori , S. , Yamawaki , T. : Texture features corresponding to visual perception . IEEE Transactions on System, Man and Cybernatic 6 ( 1978 )

[12] Goodrum , A. : Image information retrieval: An overview of current research . Informing Science 3 ( 2000 ) 2000

[13] Huang , J. , Kumar , S.R. , Mitra , M. , Zhu , W.J. , Zabih , R. : Image indexing using color correlograms . In: CVPR '97: Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97) , Washington, DC, USA, IEEE Computer Society ( 1997 ) 762

[14] Zhang , D. , Wong , A. , Indrawan , M. , Lu , G.: Content-based image retrieval using gabor texture features . In: IEEE Transactions PAMI . ( 2000 ) 13 - 15

[15] Haralick , R.M. , Shanmugam , K. , Dinstein , I. : Textural features for image classification . IEEE Transactions on Systems, Man, and Cybernetics 3 ( 6 ) ( November 1973 ) 610 - 621

[16] McLachlan , G.J. , Krishnan , T. : The EM Algorithm and Extensions (Wiley Series in Probability and Statistics) . 2 edn. Wiley-Interscience ( March 2008 )

[17] Albanese , D. , Merler , S. , Jurman , G. , Visintainer , R. , Furlanello , C.:

Mlpy machine learning py (

2010 ) http://mloss.org/software/view/66/.