Télécom Bretagne at ImageCLEF WikipediaMM
                       2010

                                 Adrian Popescu

          Institut Télécom/Télécom Bretagne, Département Informatique
                        adrian.popescu@telecom-bretagne.eu


      Abstract. In this paper, I describe the approach proposed by Télécom
      Bretagne for the WikipediaMM 2010 evaluation campaign [6]. One of
      the main challenges in large scale image retrieval is the mismatch be-
      tween query terms and image textual descriptions from the database.
      This mismatch can be reduced using query expansion and here I present
      a Wikipedia based query expansion approach. In order to boost results’
      accuracy, the expansion is followed by a reranking step which uses query
      models extracted from Flickr.


1   Introduction
Retrieving images from heterogeneous and noisy databases was thoroughly stud-
ied but remains an interesting research topic. Some open questions include:
 – how to deal with difficult queries?
 – how to perform query expansion in a manner that improves both precision
   and recall?
 – what resources to use in order to model query content?
In this paper, I present techniques which represent possible answers to these
questions. A query expansion technique is adapted from previous work, and
is augmented a query modeling module based on Flickr tags. The categorical
structure of Wikipedia is exploited in order to find and rank concepts from the
encyclopedia which are semantically similar to the initial query. The experiments
are focused on English queries but the same techniques are applicable to other
languages.
    With over 3,000,000 articles in its English version, Wikipedia is a rich re-
source and is used in a variety of research tasks, such as: sense disambiguation,
ontology extraction or semantic relatedness. The last problem can be formu-
lated as follows: given an input (a concept or a longer text), find the concepts
which are most closely related to the input. Wikipedia based techniques to find
semantic relatedness include WikiRelate! [5], Explicit Semantic Analysis (ESA)
[1] and Wikipedia Link-based Measure (WLM) [2]. WikiRelate! modifies tech-
niques previously applied to WordNet in order to suit Wikipedia’s structure.
The authors of [1] map queries to Wikipedia concepts representation in order to
find related concepts. ESA is interesting because it finds related concepts for any
2      Adrian Popescu

given query and not only for mono-conceptual queries and is thus suited for use
in Web information retrieval. WLM exploits only Wikipedia links to find related
concepts.A comparison of the three techniques [2] shows that ESA achieves the
best performances, followed by WLM and WikiRelate!. Modeling Flickr content
is another very interesting area of research. Related to the present paper are
techniques that analyze Flickr tags to find frequent topics and their correlations
[4] and [8]. Correlations are then used in order to suggest new tags based on
supplied tags. The authors of [4] also map tags into WordNet to extract the dis-
tribution of Flickr tags in different conceptual domains. They report that main
tag categories include artifacts and places. Wu et al. [8] analyze image content
in order to improve tag choice using textual correlations.


2     Query modeling

The proposed retrieval model has two main components: query modeling with
Flickr, respectively with Wikipedia. These two components are described in the
following subsections subsections.


2.1   Query modeling with Flickr

Flickr is a photo sharing site that contains over 4.8 billion photos as of August
2010. A part of these photos are tagged and these tags can be used to build
query models. The 2010 WikipediaMM textual topics, and – more generally –
image queries, contain visual cues (terms specific to photography such as close
up, black and white), which are not useful during the query modeling stage. A
list of photographic terms is extracted from http://en.wikipedia.org/wiki/
Category:Film_techniques and http://en.wikipedia.org/wiki/Category:
Photographic_processes (respectively the corresponding categories for French
and German) and these terms are removed from the queries. Prepositions are
also stripped from the queries and the remaining words are of each topic are
used to query the Flickr API and download metadata for up to top 20,000
images associated to the query. The relatedness is defined by counting the photos
that are annotated with the respective term. In table 1, we present the top 10
related terms for fractals, tennis player on court and cactus in desert. Most of
the terms presented in the table are closely related to the initial query and they
constitute an acceptable model of its content. They range from generic terms
such as abstract, digital, nature for fractals to specific terms such as wimbledon,
federer or centre court for tennis player on court.
    One known problem in information retrieval is that the word form in the
queries is not the same as their form in the database and this mismatch hurts
recall. One common solution to this problem is to use stemming but this solution
has its drawbacks since the stemmed forms of the words can match completely
different words, especially when dealing with multilingual datasets. Stemming
was only used for the words in the query and was followed by a look-up in the
query models in order to find word variants. Related words that have an edit
                       Télécom Bretagne at ImageCLEF WikipediaMM 2010               3

 Topic Topic text    Top related terms
 ID
 1     fractals     fractal, apophysis, abstract, mandelbrot, romanesco, art, green,
                    digital, cauliflower, nature
 8   tennis player wimbledon, tennis court, racket, players, sport, federer, ball,
     on court       wta, atp, centre court, tournament
 39 cactus       in arizona, cacti, saguaro, tucson, cholla, sonoran, phoenix, califor-
     desert         nia, barrel cactus, az
   Table 1. Flickr query models for English. Top 10 related terms are presented.


distance smaller than 3 with respect to words in the query or terms that begin
with the stem of the terms in the query were considered relevant. For instance,
the topic cactus in desert becomes cactus:cacti desert and each word variant
will be used in order to search relevant results. Although the discussion here
is focused on English queries, the same procedure was applied to French and
German versions of the topics.


2.2   Query modeling with Wikipedia

Words in the topic do not cover the entire semantic field of the underlying con-
cepts and many potentially relevant results are ignored. One particularly useful
relation for improving the conceptual coverage of the topic is the conceptual in-
heritance (X isA Y). For instance, Elena Dementieva or Rafael Nadal are tennis
players and images annotated with their names are potentially relevant the topic
tennis player on court. To discover semantically similar concepts, I rely on own
previous work [3], which exploits the categorical structure of Wikipedia.
     The main exploited resource is Wikipedia, which provides its dumps for free
use. I downloaded the English version of March 2010, split them into individual
files and search for articles which are related to the words in the WikipediaMM
topics. Prior to that, topics were preprocessed in order to remove photographic
terms and prepositions and to find term variants in the query models (see pre-
vious subsection). Also, topic words were run trough WordNet and synonyms
were extracted for unambiguous ones (words that have a single WordNet entry).
This extraction was limited to unambiguous terms because noise can be intro-
duced by secondary senses of polysemous words. The enriched versions of topics
were compared to Wikipedia categories in order to find articles categorized with
words in the topic. The number of common terms between the topic and an
article’s categories represents a rough measure of their semantic similarity. It
is defined here as a score between 0 (no terms in common) and 1 (all terms in
common) and used to propose a first ranking of Wikipedia articles (first column
in table 2). Since many articles share the same coarse grained similarity score,
ties are broke by introducing a second score which is directly dependent of the
number and frequency of terms from the Flickr query model that appear in the
target article and inversely proportional to the log of the length of the article.
4       Adrian Popescu

This second score was determined empirically and results would be probably
improved if a more principled similarity distance was used.


 Topic Topic text     Top related concepts (with coarse and fine grained similarity
 ID                   scores)
 1     fractals       Fractal antenna (1, 0.125); Fractal cosmology (1, 0.106); Man-
                      delbrot set (1, 0.092); Fractal dimension (1, 0.076); Barnsley
                      fern (1; 0.066); Fractal landscape (1, 0.065); Iterated function
                      system (1, 0.06); Newton fractal (1, 0.06); Fractal art (1, 0.06);
                      Lyapunov fractal (1, 0.051)
 8     tennis player Roger Federer (0.667, 1.771); Rafael Nadal (0.667, 1.423); Billie
       on court       Jean King (0.667, 0.9501); Serena Williams (0.667, 0.95); Juan
                      Martin del Potro (0.667, 0.936); Venus Williams (0.667; 0.859);
                      Jimmy Connors (0.667, 0.859); Ken Rosewall (0.667, 0.795);
                      Maria Sharapova (0.667, 0.777); Pete Sampras (0.667, 0.776);
                      Justine Henin (0.667, 0.734)
 39    cactus      in Saguaro (1, 0.143); Stenocereus thurberi (1, 0.099); Opuntia
       desert         ficus-indica (1, 0.091); Pachycereus pringlei (1, 0.08); Cylin-
                      dropuntia bigelovii (1, 0.068); Opuntia basilaris (1, 0.048);
                      Cylindropuntia fulgida (1, 0.044); Opuntia engelmannii (1,
                      0.04); Hylocereus undatus (1, 0.022); Wharram Percy (0.5,
                      4.58)
Table 2. Wikipedia query models obtained obtained using category look-up and Flickr
query models. Top 10 related terms are and their similarity scores are presented.


    The examples in 2 show that extracted Wikipedia concepts are generally
closely related to the initial topics. Also, the example for topic 39 illustrates
well the importance of the coarse similarity score because the top 9 concepts
(similarity = 1) are related to both terms in the query; whereas the last term,
Wharram Percy, a deserted village in England, is only related to desert. The
Wikipedia translation graph was used to also get similar Wikipedia concepts for
French and German.
    Images in the WikipediaMM collection come with brief textual descriptions.
In this setting, query expansion is an appealing way to improve recall and, if
performed in a judicious way, to also improve results precision.

3    Retrieval experiments
Both textual and multimodal runs were submitted this year. Multimodal runs in-
volved a k-NN inspired visual reranking of textual results and actually degraded
the final quality of results. Also, runs were submitted for English only queries
and for all three versions of each topic. The multilingual scores of images were
formed by adding up scores in individual languages. As with multimodal runs,
the overall performance was degraded with respect to English only runs. There-
fore, in this section we focus on textual English only runs. Unlike past years’
                      Télécom Bretagne at ImageCLEF WikipediaMM 2010           5

WikipediaMM campaigns, the texts of the articles that contained collection im-
ages were provided to participants. This thorough textual context of the images
facilitates the retrieval task. However, since the main purpose of Télécom was to
test a retrieval approach in a database with short and noisy textual descriptions,
submitted runs were based on the metadata files only and not on article texts.
    Given a query, image results from the Wikipedia collection are retrieved by
searching for images which are described either by terms in the initial query or
by related concepts from Wikipedia. We assume that the relatedness of an image
to a query is proportional to the number of words shared by its description in
the collection and the query (in its extended form). A score between 0 and 1
is attributed to the image, in function of the number of topic words found. For
instance, if the initial query contained three words (tennis player court) and
two of them were retrieved (tennis and player or tennis and court), the image
gets a score of 0.667. If a Wikipedia related concept is retrieved in the image’s
metadata, the corresponding coarse similarity score is attributed to the image,
with a small penalization to account for the rank of the Wikipedia concept. For
instance, an image annotated with Roger Federer gets a score of 0.667 while
another image annotated with Venus Williams gets a score of 0.666. Relevant
images are found by launching queries with the initial terms and the expanded
queries in the following order:

 – all terms in the initial query and a related concept
 – the initial query
 – parts of the initial query (starting with largest subparts - and favoring rare
   terms) and a related concept
 – related concept or parts of the initial query

    One effect of this type of scoring is that many images have the same ranking
score and they need to be separated. To break ties, Flickr query models are used
to search related terms in the image description. For two images are annotated
with Venus Williams, which both have an initial score of 0.666, an image also
annotated with wimbledon will be ranked higher than an image annotated with
wta because the first word is more closely related to Venus Williams than the
second.
    To study the influence of the use of Wikipedia related concepts, two runs
were submitted:

 – telecom en flickr wiki – involves searches with up to top 1000 Wikipedia
   related concepts, regardless of their coarse similarity score.
 – telecom en flickr wiki maj – is limited to search with only those concepts
   among the top 1000 Wikipedia related concepts that have a similarity score
   higher than 0.5. For instance, in the case of cactus in desert, only the top
   nine terms from table 2 are used. This limitation was imposed in order to
   study the effect of semantic similarity on the retrieval results.

    The MAP and precision (@10 and @20) results for the two analyzed runs
are presented in table 3. The run that uses only Wikipedia concepts with a high
6       Adrian Popescu

         Overall rank Run ID                        MAP P@10 P@20
         14             telecom en flickr wiki      0.2052 0.4786 0.4236
         18             telecom en flickr wiki maj 0.2227 0.4829 0.4407
Table 3. Wikipedia query models obtained obtained using category look-up and Flickr
query models. Top 10 related terms are and their similarity scores are presented.


similarity to the initial query has better scores for both types of measures. The
relative improvement of the mean average precision when limiting the usage of
related concepts is of 8.5%. This difference supports the hypothesis that query
expansion should be performed only with terms that are closely related to the
initial query.


4   Conclusion and future work
Here a query modeling technique that involves the use of Wikipedia related
concepts but also of Flickr tags is introduced. The role of the Wikipedia concepts
is to improve recall when working with scarce annotations. Flickr models are used
to enrich initial queries with words that derivations of the words in the initial
query, to propose a fine grained ranking of Wikipedia concepts and to break ties
between initial scores provided after searching the database with terms in the
enriched version of the initial query and Wikipedia concepts.
    One interesting future work direction is to replace the ad-hoc concept ranking
schemes used in the experiments with more principled methods. Also necessary
for the validation of the approach is the comparison to standard retrieval models
which are already implemented in freely available indexing frameworks. Third,
it would be interesting to evaluate the usage of richer textual descriptions (i.e
Wikipedia articles that contain the images or textual windows that surround
the images in these articles). As noted, only image metadata were used and, as
underlined in [6], these metadata provide only a scarce textual representation
of the images in the collection, which probably penalizes the quality of the final
results.


5   Aknowledgement
This research is part of the French National Agency for Research (ANR) project
Georama ( ANR-08-CORD- 009 ).


References
1. E. Gabrilovich, S. Markovich. Computing Semantic Relatedness using Wikipedia-
   based Explicit Semantic Analysis. In Proc. of IJCAI, 2007
2. D. Milne, I. H. Witten. An effective, low-cost measure of semantic relatedness ob-
   tained from Wikipedia links. In Proc. of WIKIAI, 2008
                      Télécom Bretagne at ImageCLEF WikipediaMM 2010            7

3. A. Popescu, H. Le Borgne, P.-A. Moëllic, Conceptual Image retrieval over a Large
   Scale Database. In Evaluating Systems for Multilingual and Multimodal Information
   Access, In Proc. of the 9th Workshop of the Cross-Language Evaluation Forum,
   Lecture Notes in Computer Science, 2009
4. Sigurbjornsson, B., van Zwol, R.. Flickr Tag Recommendation based on Collective
   Knowledge. In Proc. of WWW 2008 (Beijing, China).
5. M. Strube, S. P. Ponzetto. WikiRelate! Computing Semantic Relatedness Using
   Wikipedia. In Proc. of AAAI, 2006
6. A. Popescu, T. Tsikrika, J. Kludas. Overview of the WikipediaMM task at Image-
   CLEF 2010. CLEF working notes, 2010
7. R. H. Van Leuken, L. Garcia, X. Olivares, R. van Zwol. Visual Diversification of
   Image Search Results. In Proc. of WWW, 2009
8. Wu, L, Hua, X.-S., Yu, N., Ma, W.-Y. and Li, S. 2008. Flickr Distance. Proc. of
   ACM Multimedia 2008, Vancouver, Canada