MIRACLE evaluation of results for ImageCLEF 20031
           Julio Villena Román2,3, José Luis Martínez1, Jorge Fombella3, Ana G. Serrano4, Alberto Ruiz4, Paloma
                                       Martínez1, José M. Goñi5, José C. González3

                              1 Advanced Databases Group, Computer Science Department,
                                 Universidad Carlos III de Madrid, Avda. Universidad 30,
                                              28911 Leganés, Madrid, Spain
                                    {pmf,jlmferna}@inf.uc3m.es, jvillena@it.uc3m.es
                                          2 Department of Telematic Engineering,
                                 Universidad Carlos III de Madrid, Avda. Universidad 30,
                                              28911 Leganés, Madrid, Spain
                                                     jvillena@it.uc3m.es
                                  3 DAEDALUS – Data, Decisiond and Language, S.A.
                      Centro de Empresas “La Arboleda”, Ctra. N-III km. 7,300 Madrid 28031, Spain
                                        {jvillena,jfombella,jgonzalez}@daedalus.es
                     4 ISYS group, Artificial Intelligence Department, Technical University of Madrid
                             Campus de Montegancedo s/n, Boadilla del Monte 28660, Spain
                                             {agarcia,aruiz}@isys.dia.fi.upm.es
                           5 Department of Mathematics Applied to Information Techmologies,
                              E.T.S.I. Telecomunicación, Universidad Politécnica de Madrid,
                                               Avda. Ciudad Universitaria s/n,
                                                   28040 Madrid, Spain
                                                      jmg@mat.upm.es


           Abstract. ImageCLEF is a new pilot experiment introduced in CLEF 2003. It is devoted to the
           cross language retrieval of images using textual descriptions related to images contents. This paper
           presents MIRACLE research team experiments and results obtained for this track.


1       Introduction
There are several differences between CLIR (Cross Lingual Image Retrieval) for documents and images, due to
the unlike nature of both information structures. Although documents and images can be thought as being
similar, in a certain level of abstraction (in that both of them express ideas and concepts), practical applications
that deal with them must take into account their differences.
The main difference between a document and an image is that the latter can usually be interpreted in several
different ways, while a document could be, more or less, understood in a certain manner. This is one of the
beauties of images and non-verbal communication, but also one of its big problems while trying to work with it
in an ordered and structured manner.
During last years, great efforts have been made in the analysis and study of content-based image retrieval
research; to the time of this writing, it is almost clear that, for an important period of time, these kind of
techniques are not going to solve the problem. This is the main reason to focus on image retrieval based on text
descriptions and keywords. The idea is to use a textual description of each image as the base for the image
retrieval process. This approach has two main drawbacks:

            Image descriptions will be incomplete, as it happens for text documents.
            Image descriptions would be usually quite short, typically image captions and/or a few keywords
             referring the most relevant characteristics and components of the image.


1
    Part of this work has been supported by OmniPaper (IST-2001-32174) and CAM 07T/0055/2003 projects
        The multilingual dimension of the problem. Although images are not coupled to an specific language,
         image captions will be available in different languages, so some kind of multilingual approach must be
         considered.

But there are also some advantages related to the use of image descriptions in image retrieval applications. One
of them is extensibility to other multimedia information formats, like audio or video, or even other kind of
information, e.g., source code. On the other hand, image retrieval would be very interesting for appliance in
online newspapers, reviews, television, etc.

Techniques applied by the MIRACLE research team to this task go from relevance feedback to topic term
semantic expansion using WordNet. The main idea behind MIRACLE participation is to compare how these
different retrieval techniques affect retrieval performance.


2    ImageCLEF track description

In order to experiment with image CLIR a collection of nearly 30,000 black and white images from the
Eurovision St Andrews Photographic Collection where provided. Each image had a quite large English caption
(of nearly 50 words). On the other hand, a set of 50 queries in English, French, German, Italian, Spanish and
Norwegian was also provided. Non English queries have been obtained as a human translation of the original
English queries, which also included a narrative explanation of what should be considered relevant for each
image.

The tasks proposed were to retrieve the relevant images of the collection using different query languages.
Therefore, this year, the ImageCLEF track only dealt with monolingual and bilingual image retrieval.
Multilingual image retrieval is supposed to be close to multilingual document retrieval, both in techniques and
expected results, and so it has not been considered this year. After all, a first logical step to deal with
multilingual retrieval is trying to solve (up to a reasonable point) monolingual and bilingual retrieval problems.

Although there are clear limitations in current ImageCLEF track, both in the size of the collection and the
number of possible experiments to perform (six –one monolingual and five bilingual), it is an interesting starting
point to grasp an idea of how good (or bad) the performance of this kind of systems are, both in monolingual and
bilingual searches.

For this task, the MIRACLE team has submitted 25 runs, 5 for the monolingual English task, 6 for the bilingual
Spanish to English task, 6 for the bilingual German to English task, 4 for the bilingual French to English task
and, finally, 4 for the bilingual Italian to English task. All tasks submitted are automatic tasks.


3    MIRACLE experiments description

This section contains a description of the tools, techniques and experiments used by the MIRACLE team for the
different tasks attended for this ImageCLEF campaign.

As for the cross language tasks, the information retrieval engine used at the core of the system, has been Xapian
[5], based on the probabilistic retrieval model. This tool has a high configuration level, allowing the use of
different techniques related to information retrieval tasks, such as stemmers based on Porter algorithm [7].

In order to apply natural language processing to image descriptions and queries, ad hoc tokenizers have been
developed for each language, allowing to recognize some of the usual compound words each language has, in
addition to identify different kinds of alphanumerical tokens such as dates, proper nouns, acronyms, etc.
Standard stopwords lists have also been used and a special word decompounding module for German queries has
also been applied. WordNet [6] has been used to expand queries.

For translation purposes, two available translation tools have been considered: Free Translation [3] for full text
translations, and ERGANE [4] for word by word translations.
These tools have been coupled in different ways, in order to evaluate different approaches and compare the
influence of each one in the precision and recall of the image retrieval process.

In particular, the experiments submitted for the monolingual task have been the following:

OR: Intended as the baseline experiment, to compare with results of other expirements, consists on the
combination of all the stemmed words appearing in the title of the query, without stopwords, using an OR
operator between them.
ORlem: This experiment joins the original words of the query and its stems in a single query, using the OR
operator to concatenate them. The idea behind this experiment is to measure the effect of inadequate stemming
of words, by adding the original form of the query terms.
ORlemexp: The idea behind this experiment is to make synonym expansion of the terms and stems used in the
ORlem experiment, linking the obtained words with an OR operator. The pretended result is to retrieve a larger
number of documents (increase recall), despite the possible penalization in precision we could have.
Doc: For this experiment, a special feature of Xapian system is used, which allows execution of queries based on
documents against the indexed document collections. This approach is similar to the application of the Vector
Space Model. In order to carry out this experiment, the query is first indexed as if it was another image
description, and then similar documents are retrieved.
ORrf: This experiment performs a blind relevance feedback (based on the results of a simple OR query). The
process consists on executing a query, getting the first 25 documents, extracting the 250 most important terms
for those documents, and building a new query to execute against the index database, which would provide the
final results.

Bilingual experiments submitted have been the following:

TOR1: Similar to the monolingual OR experiment, but using the FreeTranslation tool to translate the complete
query. Therefore, the steps followed to build the query are: first, translate the full query using FreeTranslation,
then use the tokenizer to identify the different tokens in English, extract the stems of the tokens, remove stop
words (in these case stop stems) and generate an OR query using the resulting terms.
TOR3: In this case, in addition to the translation of the complete query, a word by word translation is added,
using ERGANE. The following steps (tokenizing, stemming and OR concatenation) are the same as TOR1
experiment. The idea is to improve retrieval performance by adding different translations for the words in the
query.
Tdoc: This is the bilingual equivalent of the monolingual Doc experiment. This time the query is first translated
using FreeTranslation and the result obtained is indexed in the system as if it were just another image
description. The information retrieval engine (Xapian) is then asked to retrieve similar images to this newly
added one.
TOR3exp: This is the bilingual equivalent of the monolingual ORlemexp experiment. It is basically the same as
the TOR3 experiment, but adding a synonym expansion (using Wordnet) of the translated terms.
TOR3full: Similar to the TOR3 experiment, but adding the original query (in the original language) to the terms
used in the OR query. This way, query terms incorrectly translated or that do not have a proper translation into
English are included in their original form (possibly being of little interest, but at least appearing somehow).
TOR3fullexp: This experiment is a combination of TOR3full and TOR3exp, using both translation engines
together with the original query, adding synonym expansion for all the terms obtained.


4    Tasks submitted and obtained results

In this section, results obtained by the MIRACLE team will be presented and compared, in order to infer some
conclusions relative to the different approaches tested.

To assess the defined experiments, CLEF evaluation staff used the first 100 results of each submission (45 in
total) to make a document pool (different for each query). In addition, the results of different interactive searches
manually performed by assessors were also added to each pool. Then, two different assessors evaluated all the
documents in the pools, considering a ternary scale: relevant, partially relevant and not relevant. The partially
relevant judgment was used to pick up images where the judges thought were in some way relevant, but could
not be entirely confident.
As a final step, four relevance sets were created using the relevance judgments of both judges:

Union-strict: The images of this set were the union of the ones judged as relevant by any assessor.
Union-relaxed: The images of this set were the union of the ones judged as relevant or partially relevant by any
assessor.
Intersection-strict: The images of this set were the ones judged as relevant by both assessors.
Intersection-relaxed: The images of this set were the ones judged as relevant or partially relevant by both
assessors.

This way, strict relevance and intersection sets can be considered as high-precision results, while relaxed
relevance and union sets can be thought as results which promote higher recall.


4.1    Monolingual task

As stated before, the monolingual task consists on a set of queries in English, against a collection of image
descriptions also in English.

In Figure 1 the recall vs. precision graph is presented for each of the five experiments submitted for this task.
The values presented correspond to the evaluation of the results, comparing them with the Intersection-Strict
relevance set (the more stringent one).


                                                Recall-Precision Monolingual

                            1
                          0,9
                          0,8                                                              enenQdoc
                          0,7
              Precision


                          0,6                                                              enenQor
                          0,5                                                              enenQorlem
                          0,4                                                              enenQorlemexp
                          0,3
                          0,2                                                              enenQorrf
                          0,1
                            0
                                0    0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9      1
                                                     Recall


                                    Figure 1. Recall-Precision graph for the Monolingual task

The first thing that can be noticed from this graph is the fact that the best runs have a quite high precision value,
specially taking into account that image retrieval is a difficult task. In fact too high if compared with
monolingual document retrieval results we have obtained in the monolingual tasks of CLEF2003. The
explanation we find is that this year only four groups have taken part in ImageCLEF, and due to the way relevant
sets were established (based basically on the submissions of each group), the actual cover of relevant documents
was not as complete as should have been. That could be why so high values of precision have been obtained.

Other interesting aspect of the presented results is the fact that the run using blind relevance feedback leads to
much worse results than all the other strategies. A possible explanation could be that the values used in the
automatic relevance feedback were not appropriate to the kind of documents we were trying to retrieve. In fact,
we used the top 250 terms of over the first 25 images retrieved. Given that each image has a mean length of the
description field of 50 words, it becomes apparent that the number of relevant terms retrieved could be
excessive. Therefore, instead of an aid to locate more relevant images, these terms only add noise that seriously
penalize the overall performance.
It is worth mentioning that, instead of increasing the performance of the system, using any kind of term
expansion (adding original query words or doing synonym expansion), only reduces the precision of the results.
This could be due to the relatively low number of images of the collection, that doesn’t make necessary to use
word expansion to minimize the effect of heterogeneous descriptions that would arise in larger collections from
different sources. Perhaps this strategy could be of interest in next ImageCLEF track, which, probably, will
include larger collections.

Figure 2 represents the average precision of each submitted run among all topics, ordered from better to worse.
This graph constitutes a simpler representation of the overall performance value for each experiment than the
recall-precision graph, allowing to grasp in a single sight the quantitative differences of each approach. Again,
the values presented are calculated considering the Intersection-Strict relevance set.


                                    Average Precisions Monolingual

                1
              0,9
              0,8
              0,7                                                                    enenQor
              0,6                                                                    enenQdoc
              0,5                                                                    enenQorlem
              0,4                                                                    enenQorlemexp
              0,3
                                                                                     enenQorrf
              0,2
              0,1
                0
                                        Average Precision


                               Figure 2. Precision comparison of different runs

As previously noticed in the recall–precision graph, it clearly states the poor performance of our relevance
feedback experiment, and the similarity of the remainder experiments, specially the simple OR query approach
(enenQor) and the query-indexing approach (enenQdoc).

Although only Intersection-Strict relevance sets have been mentioned in this section, differences with the other
ones is subtle, apart from a slight increase of the overall precision in all cases due to the larger number of
relevant documents they have.


4.2    Bilingual tasks

The bilingual tasks consist on the execution of queries in languages other than English, trying to retrieve relevant
documents from a set of images described in English. Although queries in Spanish, Italian, German, French and
Norwegian were available, we only took part in the first four languages.

Figure 1 shows the different precision vs. recall graphs obtained for each of the submitted runs and language
pairs treated. In every case the values were obtained using the Intersection – Strict relevance set, being the
strictest of all result sets provided.
                           Recall - Precision French - English                                             Recall - Precision German - English

               1                                                                                1
                                                                                                                                                  geenQTdoc
              0,8                                                                              0,8
                                                                   frenQTdoc                                                                      geenQTor1
  Precision


                                                                                   Precision
              0,6                                                  frenQTor1                   0,6                                                geenQTor3
              0,4                                                  frenQTor3                   0,4                                                geenQTor3exp
                                                                   frenQTor3full                                                                  geenQTor3full
              0,2                                                                              0,2
                                                                                                                                                  geenQTor3fullexp
               0                                                                                0
                    0   0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9   1                                      0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
                                      Recall                                                                         Recall


                            Recall - Precision Italian - English                                           Recall - Precision Spanish - English

               1                                                                                1
                                                                                                                                                  spenQTdoc
              0,8                                                                              0,8
                                                                   itenQTdoc                                                                      spenQTor1
  Precision


                                                                                   Precision
              0,6                                                  itenQTor1                   0,6                                                spenQTor3
              0,4                                                  itenQTor3                   0,4                                                spenQTor3exp
                                                                   itenQTor3full                                                                  spenQTor3full
              0,2                                                                              0,2
                                                                                                                                                  spenQTor3fullexp
               0                                                                                0
                    0   0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9   1                                      0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
                                      Recall                                                                         Recall


                                               Figure 3. Recall - Precision graphs for bilingual tasks

Several conclusions can be extracted from these graphs. The more remarkable could be the similarity of QTdoc,
QTor1, QTor3 and QTor3full experiments, being in all cases QTor1 and QTdoc the best. This is somehow
consistent with the results obtained in the monolingual task, where the best performance was obtained by simple
ORing the terms of the query (previously stemmed and removed stop words) –enenQor submission-, and by
indexing the query as other image description and searching similar documents in the system –enenQdoc
submision.

Another interesting aspect the graphs shows is that the use of more than one automatic translation has shown to
be worse in our case than just using one of some quality (as the FreeTranslation has proved to be). It should be
studied in more detail whether the use or ERGANE as the word by word translator was the cause of this lose of
quality or was the simple fact of including word by word translation instead of only complete query ones. It is
likely to be the second reason, since word by word translation always lead to wider queries that although
increase recall, making precision worst. An example of this can be found in German to English and Spanish to
English runs, in which synonym expansion in included (wider queries), leading, as expected, to worse precision
values.

Another fact to point out is that precision values obtained in each task are quite similar, except for the French to
English queries, which were slightly worse than the others. The explanation to this could be the worse French to
English translations provided by FreeTranslation, or the use of different terms (hardest to translate) in the French
queries.

Comparing the overall performance of the bilingual tasks with the monolingual one, a difference of about 10 to
15% arise, which is quite normal in typical CLIR nowadays. At least this is approximately the same value we
have obtained in bilingual tasks of CLEF this year (as could be expected).

Figure 4 shows the average precision of each of the different runs submitted, considering the Intersection-Strict
relevance set, as usual. The runs are ordered by descending precision and grouped by tasks.
                 Average Precisions French - English                          Average Precision German - English

      1                                                                   1
    0,9
    0,8                                                                 0,8                                     geenQTdoc
    0,7                                                 frenQTor1                                               geenQTor1
    0,6                                                 frenQTdoc       0,6
                                                                                                                geenQTor3full
    0,5
                                                        frenQTor3full                                           geenQTor3
    0,4                                                                 0,4
    0,3                                                 frenQTor3                                               geenQTor3exp
    0,2                                                                 0,2                                     geenQTor3fullexp
    0,1
      0                                                                   0
                       Average Precision                                         Average Precision


                  Average Precision Italian - English                         Average Precision Spanish - English

      1                                                                   1
    0,9                                                                 0,9
    0,8                                                                 0,8                                         spenQTdoc
    0,7                                                 itenQTdoc       0,7                                         spenQTor1
    0,6                                                                 0,6
                                                        itenQTor1                                                   spenQTor3
    0,5                                                                 0,5
                                                        itenQTor3       0,4                                         spenQTor3full
    0,4
    0,3                                                 itenQTor3full   0,3                                         spenQTor3exp
    0,2                                                                 0,2                                         spenQTor3fullexp
    0,1                                                                 0,1
      0                                                                   0

                       Average Precision                                          Average Precision


                                           Figure 4. Precision comparison between runs

As in the case of the monolingual task, the results show little difference among different approaches, but
consistently outperforming QTdoc and QTor1. It is once more apparent that our French to English retrieval has
been slightly worse than the others, while the Spanish to English have obtained the best individual results (while
not the best average results in all its runs).


5         Conclusions and future directions

The main conclusion that can be extracted taking into account the obtained results is that the simplest approaches
studied (ORing terms and indexing the query and looking for similar documents) are the ones which lead to
better results. Our main goal pursued with this first participation in the ImageCLEF track was to establish a
starting point for future research work in the cross-language information retrieval applied to image (and in
general other non-textual types of data that can be represented somehow by textual descriptions, such as video).
Taking into account the obtained results, it stands out that there is much room for improvement both in
monolingual and bilingual retrieval performance.

Also, despite the apparent bad results derived of performing synonym expansion, it looks like an interesting field
to continue doing research on, especially for its likely application to wider and more heterogeneous collections.

References

[1] Baeza-Yates, R., Ribeiro-Prieto B., “Modern Information Retrieval”, Addison Wesley (1999).
[2] Karen Sparck Jones y Peter Willet, “Readings in Information Retrieval”, Morgan Kaufmann Publishers, Inc.
San Francisco, California, 1997.
[3] “Free Translation”, www.freetranslation.com
[4] “Ergane Translation Dictionaries”, http://dictionaries.travlang.com
[5] “The Xapian Project”, www.sourceforge.net
[6] G.A. Miller. “WordNet: A lexical database for English”. Communications of the ACM, 38(11):39—41,
1995.
[7]“The Porter Stemming Algorithm” page maintained by Martin Porter.
www.tartarus.org/~martin/PorterStemmer/