-

MMIS at ImageCLEF 2009: Non-parametric Density Estimation Algorithms

General Terms

0 1

Content Based Image Retrieval, Object Recognition, Thesauruses

0 Ainhoa Llorente, Suzanne Little and Stefan Ruger Knowledge Media Institute The Open University , Milton Keynes MK7 6AA , United Kingdom 1 Algorithms , Experimentation, Performance, Measurement

This paper presents the work of the MMIS group done at ImageCLEF 2009. We submitted ve di erent runs to the Photo Annotation task. These runs were based on two non-parametric density estimation models. The rst one evaluates a set of visual features and proposes a better, weighted set of features. The second approach uses keyword correlation to compute semantic similarity measures using several knowledge sources. The knowledge sources used are, the training set of the collection, Google Web search engine, WordNet and Wikipedia. Evaluation of results is done under two di erent metrics, one based on ROC curves and the other in a hierarchical measure proposed by the organisers. Our results are quite encouraging; under the rst metric our best run was located between the median and the top quartile and under the second metric our best run was between the rst quartile and the median.

H 3 [Information Storage and Retrieval] H 3 1 Content Analysis and Indexing H 3 3 Information Search and Retrieval H 3 4 Systems and Software H 3 7 Digital Libraries I 4 [Image Processing and Computer Vision] I 4 8 Scene Analysis I 4 9 Applications

In this paper, we describe the experiments performed by the MMIS group at ImageCLEF 2009. We participated in the Large Scale Visual Concept Detection and Annotation Task. The main goal of this task is, as described in [ 13 ], given a training set of 5,000 images manually annotated with words coming from a vocabulary of 53 visual concepts, to automatically provide annotations for a test set of 13,000 images. The visual concepts are organized in a small ontology so participants may take advantage of the hierarchical order of the concepts and the relations among them for better accomplishing the annotation task. Another important goal of this year competition is to re ect about the in uence of large amount of data and concepts in the annotation task and about whether or not an ontology can help.

We submitted ve runs in total. Each one of them is based on a di erent non-parametric density estimation model but placing emphasis on di erent aspect of the research eld. For instance, the run MMIS 33 2 1245434554581.txt is evaluating a sequence of possible image feature selections in order to propose a better, weighted set of features while the other four runs MMIS 33 2 1245586552541.txt, MMIS 33 2 1245601239738.txt, MMIS 33 2 1245611281967.txt, and, MMIS 33 2 1245674693001.txt attempt to improve a baseline probabilistic model taking advantage of the correlation between keywords computing semantic similarities measures using different knowledge bases.

Evaluation of results have been done under two di erent metrics, one is based on ROC curves [4] and proposes as measures Equal Error Rate (EER) and the Area under Curve (AUC) while the second metric is the hierarchical measure proposed by [ 14 ] that considers the relations between concepts and the agreement of annotators on concepts. All in all, our results are quite encouraging, under the rst metric our best run was located between the median and the top quartile and under the second metric our best run was between the rst quartile and the median.

The rest of this paper is organised as follows. Section 2 provides an introduction on nonparametric density estimation. Section 3 describes the rst approach followed while Section 4 illustrates the second one. Then, our evaluation results are discussed in Section 5. Finally, Section 6 shows our conclusions. 2

Non-parametric Density Estimation

Both approaches followed in this research are variations of the probabilistic framework developed by Yavlinsky et al. [ 16 ] who used global features together with a non-parametric density estimation. This approach is based on the Bayes rule being the ultimate goal to model f (xj!) for each annotation keyword !, being x a feature vector representing a test image. The non-parametric approach is employed because the distributions of image features have irregular shapes that do not resemble a priori any simple parametric form.

The function f (xj!) is estimated following a kernel k based approach as represented in: f (xj!) = 1 nC

Xn k x i=1 x(!i) ; h where x(!1), x(!2),..., x(!n), is a sample of feature vectors from the training set labelled with the keyword ! and x = (x1; :::; xd) is a vector of real-valued image features.

The approach explained in Section 3 places a d-dimensional Laplacian kernel over point x(i) as expressed in: while the approach described in Section 4 uses a Gaussian kernel as shown in: kL(t; h) = Yd 1 e htll ;

l=1 hl d 1 kG(t; h) = Y p2 hl l=1 e 12 ( htll )2 ; where t = x x(i) and hl is the bandwidth of the kernel which is set by scaling the sample standard deviation of feature component l by the same constant . 2.1

Image Features

A key aspect of the non-parametric density estimation approach is the global visual features used. The algorithm described in Section 3 used four features: CIELAB and HSV colour descriptors (1) (2) (3) combined with Tamura and Gabor texture descriptors while our second algorithm 4 combines the CIELAB color feature with the Tamura texture.

CIE L*a*b* (CIELAB) [ 7 ] is the most complete colour space speci ed by the International Commission on Illumination (CIE). Its three coordinates represent the lightness of the colour (L*), its position between red/magenta and green (a*) and its position between yellow and blue (b*). The histogram was calculated over two bins for each coordinate.

HSV is a cylindrical colour space with H (hue) being the angular, S (saturation) the radial and V (brightness) the height component. The H, S and V axes are subdivided linearly (rather than by geometric volume) into two bins each. The HSV colour histogram is normalised so that this components add up to one.

The Tamura texture feature [ 15 ] is computed using three main texture features called \contrast", \coarseness", and \directionality". Contrast aims to capture the dynamic range of grey levels in an image. Coarseness has a direct relationship to scale and repetition rates and it was considered by Tamura et al. as the most fundamental texture feature and nally, directionality is a global property over a region. The histogram was calculated over two bins for each feature.

The process for extracting each of these features is as follows, each image is divided into nine equal rectangular tiles, the mean and second central moment feature per channel are calculated in each tile. The resulting feature vector is obtained after concatenating all the vectors extracted in each tile.

The nal feature extracted is a texture descriptor produced by applying a Gabor lter to enable ltering in the frequency and spatial domain. Our implementation is based on [ 10 ]. To each image we applied a bank of four orientation and six scale sensitive lters that map each image point to a point in the frequency domain. This feature was calculated on the whole image rather than using the tiling approach. 3

Weighted Global Features

The original implementation of this algorithm used two features: CIELAB and Tamura. Subsequent work evaluated a sequence of possible feature selections [ 8 ] and proposed a better, weighted set of features. The feature sets to be evaluated were constructed based on information from existing literature about visual feature selection and attempted to avoid descreasing performance due to redundant features or multi-variate prediction.

The feature set proposed from this set of evaluations added two additional features, HSV colour and Gabor texture, to the original CIELAB and Tamura descriptors. These features were weighted at CIELAB - 0.75, HSV - 0.5, Tamura - 0.5 and Gabor - 0.5. This set improved the mean average precision when evaluated on the standard Corel5k dataset [3] and the IAPR TC12 dataset used for ImageCLEF 2006 [ 6 ].

The run is labelled MMIS 33 2 1245434554581.txt and is based on the approach used in [ 16 ]. The four chosen features were extracted from the training set to train the non-parametric density estimation annotator which then provided the probability of each concept being present in the test image. Manhattan distance was used for all features.

This algorithm represented a straight-forward approach that exploited only the global low-level features and the supervised learning of a prediction model. We predicted that this set of features would provide a good coverage of the colour and texture space and su cient details without placing an excessive calculation burden on the system. Initial tests using ten-fold cross-validation on the training set re-enforced this expectation. 4

Exploiting Word Correlations to Compute Semantic Similarities

The early attempts in automated image annotation were focused on algorithms that explored the correlation between words and image features. More recently, there are some e orts which attempt to bene t from exploiting the correlation between words computing semantic similarity measures. Among the many uses of the concept \semantic similarity", we refer to the de nition by Miller and Charles [ 11 ] who consider it as the degree of contextual interchangeability or the degree to which one word can be replaced by another in a certain context. Consequently, two words are similar if they refer to entities that are likely to co-occur together like \mountains" and \vegetation", \beach" and \water", \buildings" and \road", etc. In this research we will use indistinctly the term semantic similarity and semantic relatedness.

This non-parametric density estimation model exploits the statistical correlation between words by computing semantic similarity measures using di erent knowledge bases. We propose four versions of this model that di er on the knowledge base used as source of information and the semantic similarity measure employed. The knowledge bases used are, the training set of the collection, Google Web search engine, WordNet, and Wikipedia. The semantic similarity measures used are explained in Section 4.2.

The process can be described as follows. We calculate the probability value of each concept being present in each image of the test set following the non-parametric density estimation described in Section 2. Then, a statistical keyword correlation is computed using the corresponding knowledge base. With the help of the semantic similarity measures and applying some rules the accuracy of the nal annotations is improved. 4.1

Parameter Estimation

We divided the dataset into three parts: a training set, a validation set and a test set. The validation test is used to nd the parameters of the model. Thus, we performed a 10-fold cross validation on the training set. After that, the training and validation set are merged to form a new training set of 5,000 images that is used to predict the annotations in the test set of 13,000 images. 4.2

Submitted Runs

In this subsection, we describe the four submitted runs based on this approach: MMIS 33 2 1245586552541.txt This run is based on the approach developed in [ 9 ] where the training set is computed to generate a co-occurrence matrix that represents the probabilities of the frequency of two vocabulary words appearing together in a given image. This algorithm was previously tested on the Corel5k collection and in the collection provided by the last edition of ImageCLEF, in 2008.

MMIS 33 2 1245611281967.txt The semantic similarity measure used in this run is called web-based semantic relatedness measure as it uses Google Web search engine as knowledge base. It was developed by Gracia and Mena [5] who de ned the semantic relatedness between the concepts x and y, as:

rel(x; y) = e 2 NWD(x;y); where N W D stands for Normalized Web Distance which is a generalisation of the Normalized Google Distance (see Equation 5) extended to any web-based search engine as source of frequencies. The Normalized Google Distance (NGD) between two terms x and y, was expressed by Cilibrasi and Vitanyi [2] as:

NGD(x; y) = maxflog f (x); log f (y)g log f (x; y) log N minflog f (x); log f (y)g ; where f (x) and f (y) are the counts for search terms x and y using Google and f (x; y) is, the number of web pages found on which both x and y occur. N is the total number of web pages searched by Google which, in 2007, was estimated to be more than 8bn pages. (4) (5) MMIS 33 2 1245674693001.txt This run uses the adapted Lesk measure applied to WordNet proposed by Banerjee and Pedersen in [1]. They de ned the extended gloss overlap measure which computes the relatedness between two synsets c1 and c2 by comparing the glosses of synsets related to them through explicit relations provided by WordNet: rel(c1; c2) =

X score(R1(c1); R2(c2)); 8(R1; R2) 2 relP airs: Thus, the set relP airs is de ned as follows: relP airs = f(R1; R2) j R1; R2 2 rels; if (R1; R2) 2 relP airs; then(R1; R2) 2 relP airsg; (7) being rels a non-empty set of relations that consists of one or more of the following relations: rels fr j r is a relation def ined in W ordN etg: (6) (8) (9) (10) (11) (12) MMIS 33 2 1245601239738.txt This run computes the semantic relatedness between two concepts applying the Wikipedia measure de ned by Milne and Witten. In [ 12 ], they proposed their Wikipedia Link-based Measure (WLM) which extracts semantic relatedness measure between two concepts using the hyperlink structure of Wikipedia. The semantic relatedness between concepts x and y is estimated by the angle between the vectors of the links found between the Wikipedia articles whose title matches each one of the concepts: rel(x; y) =

~x ~y j~xj j~yj ; where the vectors for article x and y are built using link counts weighted by the probability of each link occurring, as seen in: and, in: ~x = (w(x ! l1); w(x ! l2); :::; w(x ! ln)) ; ~y = (w(y ! l1); w(y ! l2); :::; w(y ! ln)) : Thus, the weighted value w for the link a ! b can be de ned as: being t is the total number of articles within Wikipedia.

w(a ! b) = ja ! bj log

t X x=l jx ! bj t ! ; 5

Evaluation Measures and Results

We used two metrics to determine the quality of the annotations. The rst metric is based on ROC curves [4]. Initially, a receiver operating characteristic (ROC) curve was used in signal detection theory to plot the sensitivity versus (1 - speci city) for a binary classi er as its discrimination threshold is varied. Later on, ROC curves were applied to information retrieval in order to represent the fraction of true positives (TP) against the fraction of false positives (FP) in a binary classi er. The Equal Error Rate (EER) is the error rate at the threshold where FP=FN. The area under the ROC curve, AUC, is equal to the probability that a classi er will rank a randomly chosen positive instance higher than a randomly chosen negative one. Note that, the lower the EER, the better the annotations.

In Table 1, we show the results for all our submitted runs under the EER and AUC metric. Our best run corresponds to MMIS 33 2 1245434554581.txt that follows the rst approach of weighted global features. This run achieved a reasonable EER across all concepts of just over 31% which was consistent with the predicted performance from our earlier ten-fold cross validation. Table 2 shows that the ten best concepts identi ed by this annotator are those which have previously performed well using only global visual features. With the exception of \Underexposed", the best performing concepts belong to fairly common visual categories, primarily landscape elements.

Regarding the four runs based on keyword correlation, we observe that the best performance is achieved using the training set as corpora. Not surprisingly, the second best is the run based on Google Normalized Distance that uses Google Web search engine as knowledge source. This is due to the fact that both approaches do not reply on a prior disambiguation process like WordNet and Wikipedia.

The worst result corresponds to the run based on Wikipedia. The reasons behind it might be found in the strong dependency of the semantic relatedness measure on doing a proper word disambiguation. The disambiguation in Wikipedia is automatically performed by selecting the sense of the word more probable according to the content store on Wikipedia database.

The 53 concepts of the proposed vocabulary belong to one of the following categories: Scene description, Seasons, Place, Landscape Elements, Time of the day, Picture representation, Illumination, Quality Blurring, Picture Objects, and Quality Aesthetics.

Most of the categories do not correspond to real visual features and the best way of predicting them is making use of the \exif" metadata. As our focus is on visual features we have not incorporated them in any of our algorithms. Consequently, we predicted and posteriorly checked, lower results for concepts classi ed into categories such as Seasons, Time of the day, Picture representation, Illumination, Quality Blurring and specially, the most subjective one, Quality Aesthetics.

The second metric is the proposed hierarchical measure [ 14 ] that considers the relations between concepts and the agreement of annotators on concepts. In Table 3, the results of all our submitted runs are shown. The best run is the one based on Google Web search engine followed by the cooccurrence, and WordNet approaches. This makes sense as all these runs are employing semantic similarity measures on external ontologies, which is exactly the criteria that the hierarchical score attempts to evaluate. The run which applied weighted global features relied less on the hierarchical information and therefore did not perform as well using the metric. 6

Conclusions

While it is di cult to make conclusive statements about the submitted runs as the di erences in their performance are minimal, the results do re-enforce previous expectations.

The performance of our best run (according to EER) supports previous ndings about the impact of feature selection and weighting on the non-parametric density algorithm as it outperforms the other four runs which used the original features.

With respect to the second metric (hierarchical measure), the best runs are those that use as knowledge base the training set and Google Web search engine because the rest of the approaches (WordNet and Wikipedia) have been penalised as a result of the prior disambiguation process that follow.

Interestingly the metric to distinguish the performance of annotators based on measurement of the hierarchical distribution isolates the feature weighting run. This alternative method of ranking performance gives valuable insight into the in uence and impact of the analysis of hierarchical labels in image annotation. It is likely that annotators that achieve a higher ranking using the hierarchical measure have better distribution across the concepts. Further analysis is needed to determine if annotators with a better hierarchical measure are also more robust overall. Acknowledgments.

This work was partially funded by the EU Pharos project (IST-FP6-45035) and by Santander Corporation. [1] S. Banerjee and T. Pedersen. Extended gloss overlaps as a measure of semantic relatedness.

In Proceedings of the Eighteenth International Conference on Arti cial Intelligence, 2003. [2] Rudi Cilibrasi and Paul Vitanyi. The Google similarity distance. IEEE Transactions on

Knowledge and Data Engineering, 19(3):370{383, 2007. [3] P. Duygulu, Kobus Barnard, J. F. G. de Freitas, and David A. Forsyth. Object recognition as machine translation: Learning a lexicon for a xed image vocabulary. In Proceedings of the European Conference on Computer Vision, pages 97{112, 2002. [4] Tom Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861{874, 2006. [5] Jorge Gracia and Eduardo Mena. Web-based measure of semantic relatedness. In Proceedings of 9th International Conference on Web Information Systems Engineering, volume 5175, pages 136{150, 2008.

[6]

Michael

Grubinger . Analysis and Evaluation of Visual Information Systems Performance . PhD thesis , School of Computer Science and Mathematics, Faculty of Health, Engineering and Science, Victoria University, Melbourne, Australia, 2007 .

[7]

Hanbury and

Serra . Mathematical morphology in the CIELAB space . Image Analysis & Stereology , 21 : 201 { 206 , 2002 .

[8]

Suzanne

Little and

Stefan

Rueger . Conservation of e ort in feature selection for image annotation . In IEEE Workshop on Multimedia Signals Processing (MMSP2009) , Rio De Janeiro, Brazil, October 5 { 7 2009 .

[9]

Ainhoa

Llorente and Stefan Ruger. Using second order statistics to enhance automated image annotation . In Proceedings of the 31st European Conference on Information Retrieval , volume 5478 , pages 570 { 577 , 2009 .

[10]

Manjunath and

Ma . Texture features for browsing and retireval of image data . IEEE Transactions on Pattern Analysis and Machine Intelligence , 18 : 837 { 842 , 1996 .

[11] George

Miller and Walter G.

Charles . Contextual correlates of semantic similarity . Journal of Language and Cognitive Processes , 6 :1{ 28 , 1991 .

[12]

Milne and

I.H.

Witten . An e ective, low-cost measure of semantic relatedness obtained from wikipedia links . In Proceedings of the rst AAAI Workshop on Wikipedia and Arti cal Intellegence , 2008 .

[13]

Stefanie

Nowak and

Peter

Dunker . Overview of the CLEF 2009 Large Scale -Visual Concept Detection and Annotation Task . In Cross-Language Evaluation Forum Working Notes, Corfu, Greece, 2009 .

[14]

Stefanie

Nowak and

Hanna

Lukashevich . Multilabel classi cation evaluation using ontology information . In Proceedings of ESWC Workshop on Inductive Reasoning and Machine Learning on the Semantic Web , 2009 .

[15]

Tamura ,

Mori , and

Yamawaki . Textural features corresponding to visual perception . IEEE Transactions on Systems, Man and Cybernetics , 8 ( 6 ): 460 { 473 , 1978 .

[16] Alexei

Yavlinsky

, Edward Scho eld, and Stefan Ruger. Automated image annotation using global features and robust nonparametric density estimation . In Proceedings of the International ACM Conference on Image and Video Retrieval , pages 507 { 517 , 2005 .