=Paper= {{Paper |id=Vol-1169/CLEF2003wn-ImageCLEF-CloughEt2003b |storemode=property |title=Sheffield at ImageCLEF 2003 |pdfUrl=https://ceur-ws.org/Vol-1169/CLEF2003wn-ImageCLEF-CloughEt2003b.pdf |volume=Vol-1169 |dblpUrl=https://dblp.org/rec/conf/clef/CloughS03c }} ==Sheffield at ImageCLEF 2003== https://ceur-ws.org/Vol-1169/CLEF2003wn-ImageCLEF-CloughEt2003b.pdf
                                Sheffield at ImageCLEF 2003
                                         Paul Clough and Mark Sanderson
                                              University of Sheffield
                                     p.d.clough|m.sanderson@sheffield.ac.uk


                                                      Abstract

           In this paper, we use the Systran machine translation system for translating queries for cross
         language image retrieval in a pilot experiment at CLEF 2003, called ImageCLEF. The
         approach we have taken is to assume we have little experience in CLIR, few available
         resources and a limited time in which to create a working CLIR system for this task. In this
         preliminary study, we investigate the effectiveness of Systran on short queries by comparing a
         manual assessment of translation adequacy with an automatic score derived using NIST's
         mteval evaluation tool for machine translation output. We discuss the kinds of translation
         errors encountered during this analysis and show the impact on retrieval effectiveness for
         individual queries in the ImageCLEF task.


1. Introduction
The core component of a Cross Language Information Retrieval (CLIR) system is the method used to translate
the query from the source language into the language of the document collection (the target language). However,
this component involves specialised IR knowledge and familiarity with the source and target languages, or does
it? Imagine you are a company or organisation without these kinds of resources and you want a quick-fix
solution to a cross language problem. Can anything be done without buying in the necessary expertise? Can you
also evaluate how well the translation process has done without being able to understand the source language?
We would argue yes, although not without requiring certain resources (e.g. a CLIR test collection).

We experiment with using a “black-box” translation module: Systran, one of the oldest commercial machine
translation systems, which is widely used in industry and available for free via a Web-based interface. Our
experiences with using Systran have found that no multilingual processing is necessary as would normally be
required when dealing with cross language retrieval, e.g. tokenisation, case and diacritic normalisation,
decompounding and morphological analysis. This sounds ideal, but are there any problems with using Systran to
perform the translation? What kinds of translation errors are encountered, does translation quality vary across
different source languages, and how does translation quality affect retrieval performance? These are the kinds of
questions we wanted to address in this study which formed our entry for ImageCLEF 2003.

The ImageCLEF test collection can be used to evaluate retrieval performance, but does not necessarily reflect the
quality of translation because many factors other than translation might affect performance, e.g. the retrieval
system, retrieval enhancements such as query expansion, the relevance assessments or the use of content-based
retrieval methods. Therefore, to enable us to investigate where translation errors occur and assess the success of
Systran independently from retrieval, we manually assess translation adequacy, and show whether this correlates
with an automated approach to measuring translation quality as used in MT evaluation.

2. Background

2.1.1.    The ImageCLEF task
ImageCLEF is a pilot experiment run at CLEF 2003, dealing with the retrieval of images by their captions in
cases where the source and target languages differ (see [1] for further information about ImageCLEF). Because
the document to be retrieved is both visual and textual, approaches to this task may involve the use of both
multimodal and multilingual retrieval methods. The primary task at this year’s ImageCLEF is an ad hoc retrieval
task in which fifty topics were selected for retrieval and described using a topic title and narrative. Only the title
is translated into Dutch, Italian, Spanish, French, German and Spanish, and therefore suitable for CLIR. As co-
ordinators of this task, we found that assessors used both the image and the caption during their judgment for
relevance, and therefore we know that this task involves more than just CLIR. Further challenges include: (1)
captions that are typically short in length, (2) images that vary widely in their content and quality, and (3) short
user search requests which provide little context for translation.
2.1.2.    Systran
As a translation system, Systran is considered by many as a direct MT system (because the whole process relies
on dictionary lookup between a source and target language), although the stages resemble a transfer-based MT
system. Currently the on-line version of Systran offers bi-directional translation between 20 language pairs,
including languages from Western Europe, Asia, Eastern Europe, and in 2004 they plan to release English-
Arabic.

There are essentially three stages to Systran: analysis, transfer and synthesis. The first stage, analysis, pre-
processes the source text and performs functions such as character set conversion, spelling correction, sentence
segmentation, tokenisation, and POS tagging. Also during the analysis phase, Systran performs partial analysis
on sentences from the source language, capturing linguistic information such as predicate-argument relations,
major syntactic relationships, identification of noun phrases and prepositional phrase attachment using their own
linguistic formalism and dictionary lookup.

After analysis of the source language, the second process of transfer aims to match with the target language
through dictionary lookup, and then apply rules to re-order the words according to the target language syntax, e.g.
restructure propositions and expressions. The final synthesis stage tidies up the target text and determines
grammatical choice to make the result coherent. This stage relies heavily on large tables of rules to make its
decisions. For more information, consult [2] and [6].

3. Experimental setup

3.1. Manual assessment of translation quality
Assessing the quality of the output produced by a machine translation (MT) system offers a challenging problem
to researchers. Organisations such as DARPA and NIST have established the necessary resources and framework
in which to experiment with, and evaluate, MT systems as part of managed competitions, similar to the TREC
(see, e.g. [7]) and CLEF (see, e.g. [4]) campaigns. For manual evaluation1, three dimensions upon which to base
judgment include translation adequacy, fluency and informativeness. Translation quality is normally assessed
across an entire document when measuring fluency and informativeness, but adequacy is assessed between
smaller units (e.g. paragraphs or sentences) which provide a tighter and more direct semantic relationship.

To assess adequacy, a high quality reference translation and the output from an MT system are divided into
segments to evaluate how well the meaning is conveyed between the versions. Fluency measures how well the
translation conveys its content with regards to how the translation is presented and involves no comparison with
the reference translation. Informativeness measures how well an assessor has understood the content of a
translated document by asking them questions based on the translation and assessing the number answered
correctly.

Given titles from the ImageCLEF test collection in Chinese, Dutch, French, Spanish, German and Italian; we
first passed these through the on-line version of Systran to translate them into English, the language of the
ImageCLEF document collection. We then asked assessors to judge the adequacy of the translation by assuming
the English translation would be that for submission to a retrieval system for an ad hoc task. Translators who had
previously been involved with creating the ImageCLEF test collection were chosen to assess translation quality
because of their familiarity with the topics and the collection, each assessor given topics in their native language.

Translators were asked to assess topic titles2 in the source language with the Systran English version and make a
judgment on how well the translation captured the meaning of the original (i.e. how adequate the translated
version would be for retrieval purposes). A five-point scale was used to assess translation quality, a score of 5
representing a very good translation (i.e. the same or semantically-equivalent words and syntax), to very bad (i.e.
no translation, or the wrong words used altogether). Assessors were asked to take into account the “importance”
of translation errors in the scoring, e.g. for retrieval purposes, mis-translated proper nouns might be considered
worse than other parts-of-speech.




1
  See, e.g. the TIDES translation pages: http://www.ldc.upenn.edu/Projects/TIDES/
2
  In cases of multiple translations, we used the first translation.
3
  We used mteval-v09.pl which can be downloaded from: http://www.nist.gov/speech/tests/mt
Table 1 shows an example topic title for each language and translation score for very good to good (5-4), okay (3)
and bad to very bad (2-1) to provide an idea of the degree of error for these adequacy scores. We find that
assessment varies according to each assessor; some being stricter than others, which suggest that, further manual
assessments may help to reduce subjectivity. In some cases, particularly Spanish, the source language title
contains a spelling mistake which obviously affects translation quality. Some assessors allowed for this in their
rating, others did not, therefore suggesting the need to manually check all topics for errors prior to evaluation.

                                                                     Example topic title
       Source       Adequacy
                                             Source                     Systran English            Reference English
      language       rating
       Chinese         4-5         圣安德鲁斯风景的明信片                    Saint Andrews scenery         Picture postcard views of
     (simplified)                                                 postcard                      St Andrews
                        3          战争造成的破坏                        The war creates destruction   Damage due to war
                       1-2         大亚茅斯海滩                         Asian Mao si beach            Great Yarmouth beach
        Dutch          4-5         Mannen en vrouwen die          men and women who             men and women
                                   vis verwerken                  process fish                  processing fish
                        3          Vissers gefotografeerd         Fisherman photographed        Fishermen by the
                                   door Adamson                   Adamson                       photographer Adamson
                       1-2         Muzikanten en hun              Muzikanten and their          Musicians and their
                                   instrumenten                   instruments                   instruments
       German          4-5         Baby im Kinderwagen            Baby in the buggy             A baby in a pram
                        3          Portät der schottischen        Portraet of the Scottish      Portraits of Mary Queen of
                                   Königin Mary                   Queen Mary                    Scots
                       1-2         Museumaustellungsstücke        Museumaustellungsstuecke      Museum exhibits
        French         4-5         La rue du Nord St Andrews      The street of North St        North Street St Andrews
                                                                  Andrews
                        3          Bateaux sur Loch Lomond        Boats on Lomond log           Boats on Loch Lomond
                       1-2         Damage de guerre               Ramming of war                Damage due to war
        Italian        4-5         Banda Scozzese in marcia       Scottish band in march        Scottish marching bands
                        3          Vestito tradizionale gallese   Dressed traditional           Welsh national dress
                                                                  Welshman
                       1-2         Il monte Ben Nevis             The mount Very Nevis          The mountain Ben Nevis
       Spanish         4-5         El aforo de la iglesia         Chairs in a church            Seating inside a church
                        3          Puentes en la carretera        Bridges in the highway        Road bridges
                       1-2         las montañas de Ben            Mountains of Horseradish      The mountain Ben Nevis
                                   Nevis                          tree Nevis

                         Table 1              Example adequacy ratings assigned manually

Table 1 highlights some of the errors produced by the MT system: (1) un-translated words, e.g. “Muzikanten and
their instruments”, (2) incorrect translation of proper nouns, e.g. “Bateaux sur Loch Lomond” translated as
“Boats on Lomond Log” and “Il monte Ben Nevis” translated as “the mount Very Nevis”, and (3) mis-
translations, e.g. “damage de guerre” translated as “ramming of war”. The limited context of the topic titles also
produces errors where Systran produces the wrong meaning of a word, e.g. “Scottish blowing chapels” where
kapelle is mis-translated as chapel, rather than the correct word band. However, Systran does seem to be able to
handle different entry formats for diacritics (accents above characters) which play an important part in selecting
the correct translation of a word, e.g. in the query “Casas de te’ en la costa” (tearooms by the seaside), the word
te’ is translated correctly as té (sea) rather than te (you).

3.2. Automatic assessment of translation quality
Although most accurate (and most subjective), manual evaluation is time-consuming and expensive, therefore
automatic approaches to assess translation quality have also been proposed, such as the NIST mteval3 tool. This
approach divides documents into segments and computes co-occurrence statistics based on the overlap of word
n-grams between a reference translation produced manually and an MT version. This method has been shown to
correlate well with adequacy, fluency and informativeness because n-grams capture both lexical overlap and
syntactic structure [3].

In the latest version of mteval, two metrics are used to compute translation quality: IBM’s BLEU and NIST’s
own score. Both measures are based on n-gram co-occurrence, although a modified version of NIST’s score has
been shown to be the preferred measure. These scores assume that the reference translation is of high quality,
and that documents assessed are from the same genre. Both measures are also influenced by changes in literal
form, such that translations with the same meaning but using different words score lower than those that appear
exactly the same. This is justified in assuming the manual reference translation is the “best” translation possible
and the MT version should be as similar to this as possible. For n-gram scoring, the NIST formula is:
                                               
                                                      ∑    Info( w1...wn ) 
                                                                            
                                           N  all wi...wn                        
                                                                                          2      Lsys  
                                  Score =  ∑    that co -occur
                                               
                                                                            
                                                                            . exp  β log  min       
                                          n =1 
                                               
                                                            ∑     (1)
                                                        all wi...wn
                                                                            
                                                                            
                                                                                              Lref  
                                                      in sys output       
where
         β is chosen to make the brevity penalty factor = 0.5 when the number of words in the system output is
         2/3 of the average number of words in the reference translation.

         N is the n-gram length.

         Lref is the average number of words in a reference translation, averaged over all reference translations.

         Lsys is the number of words in the translation being scored.
                                                                        
         Info(wi...wn) = log 2  number of occurrences of w1...wn -1 
                                 number of occurrences of w1...wn 


The NIST formula uses info(w1…wn) to weight the “importance” of n-grams based on their length, i.e. that
longer n-grams are less likely than shorter ones, and reduces the effects of segment length on the translation
score. The information weight is computed from n-gram counts across the set of reference translations. The
brevity penalty factor is used to minimise the impact on the score of small variations in the length of a translation.
The mteval tool enables control of the n-gram length and maximises matches by normalising case, keeping
numerical information as single words, tokenising punctuation into separate words, and concatenating adjacent
non-ASCII words into single words.

In our experiments, we make the assumption that the English topic title is the reference translation, rather than
ask the translators to produce an English version from an original. For example, the English topic title “North
Street St Andrews” was translated into French as “La rue du Nord St Andrews”. We assume that if this were
translated into English again, the “best” translation would be “North Street St Andrews”. Given that the
translators used in the manual assessment were those who created the non-English translations from the English
titles in the first place, we feel this assumption can be justified.

Because manual assessment is based on translation adequacy for retrieval, the Systran version “The street of
North St Andrews” (a literal interpretation of the French version) is given a high adequacy rating even though it
differs in syntax from the reference translation “North Street St Andrews”. The result is that the NIST score for a
larger n-gram length would be low and not correlate with the score given manually (see Table 1 for more
examples). Therefore to minimise this we compute the NIST score for an n-gram length of 1 word, reducing the
measure to simply counting word overlap. In this case, the weighting function has the effect of reducing the
importance of those words occurring frequently, e.g. function words. Table 2 shows example translations and
their corresponding NIST score for Chinese translations. To use mteval, we created a reference containing the
English versions of the topic titles where each title represents a segment within a document, and a test file
containing the Systran versions in the same format.


          NIST score                Reference translation                                     Test translation
         8.1294             Mountain scenery                                    Mountain scenery
         3.3147             People dancing                                      dances people
         1.727              Picture postcards by the valentine                  the Tanzania photography company
                            photographic company                                photographs scenery postcard

               Table 2                Example translations and corresponding NIST score (for Chinese)
3.3. The GLASS retrieval system
At Sheffield, we have implemented our own version of a probabilistic retrieval system called GLASS, based on
the “best match” BM25 weighting operator (see, e.g. [5]). Captions were indexed using all 8 fields, which
include a title, description, photographer, location and set of manually assigned index categories and the default
settings of case normalisation, removal of stopwords and word stemming.

To improve document ranking using BM25, we used an approach where documents containing all query terms
were ranked higher than any other. We first identified documents containing all query terms, computed the
BM25 score and ranked these highest, followed by all other documents containing at least one query term, again
ranked by their BM25 score. The top 1000 images and captions returned for each topic title formed our entry to
ImageCLEF. Evaluation was carried out using the set of relevant images for each topic (qrels) which forms part
of the ImageCLEF test collection and the NIST information retrieval evaluation program, trec_eval4. We
evaluate retrieval effectiveness using average precision for each topic, and across topics mean average precision
(or MAP) is used.

4. Results

4.1. Translation quality
Figure 1 shows a stacked bar chart of manual assessment scores obtained across each language for each topic.
Each bar represents a topic and a maximum bar height of 30 would represent each assessor rating the translation
as very good. As expected, the quality of translation is dependent on the topic title, although the majority of
topics do get an overall rating that is less than 50-66% of the maximum possible value. The 6 topics with the
highest overall manual rating (over 25) are topics 3 (Picture postcard views of St Andrews), 22 (Ruined castles
in England), 43 (British windmills), 45 Harvesting), 47 (People dancing) and 49 (Musicians and their
instruments). The 2 lowest scoring topics (an overall score < 15) are topics 34 (Dogs rounding-up sheep) and 48
Museum exhibits). Some translations of these topics include:

English: Dogs rounding up sheep                                Museum exhibits                                      Ruined castles in England
Italian: Dogs that assemble sheep                              Exposures in museums                                 Ruins of castles in England
German: Dogs with sheep hats                                   Museumaustellungssteucke                             Castle ruins in England
Dutch:   Dogs which sheep bejeendrijven                        Museumstukken                                        Ruin of castles in United Kingdom
French: Dogs gathering of the preois                           Exposure of objects in museum                        Castles in ruins in England
Spanish: Dogs urging on ewes                                   Objects of museum                                    Castles in ruins in England
Chinese: Catches up with the sheep the dog                     no translation                                       Become the ruins the English castle

    30




    25




    20




    15




    10




     5




     0
         1 2 3 4   5 6 7 8   9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

                                              Chinese      Dutch     German       French      Italian   Spanish


                             Figure 1                 Manual assessment scores for each ImageCLEF topic

4
    We used a version of trec_eval supplied by UMASS.
Chinese appears to exhibit the greatest variation of scores, and from Table 3 has one of the lowest average rating
scores (Dutch being the lowest). The Chinese Systran translations are on average the shortest and 14% of the
topics get a rating of very bad (3rd highest), and 28% a rating of very good (the lowest). From Table 3, Italian
has the highest average manual rating, followed closely by German and Spanish suggesting these are strong
bilingual pairings for Systran. French has the highest number of topics rated very poor, followed by Chinese and
Italian which is perhaps surprising as French-English is claimed to be one of Systran’s strongest translations.
Upon inspection, many of these low scores are from words which have not been translated. Italian has the
highest number of topics rated very good, followed by German then French. Spanish has fewest topics given a
very poor rating.

                Avg          Avg         man-NIST                    Avg translation length            % topics         % topics           % topics
                manual       NIST        correlation                        (words)                    with             with                 with
                score        score       (Spearman’s                                                   manual           manual              NIST
                                         rho)                  Min       Max     Mean        SD        score of 1       score of 5         score=0
                                                                                                       (very bad)       (very good)
  Chinese         3.34         1.68           0.268*             0        14      3.76       2.65          14%              28%              38%
    Dutch         3.32         3.27           0.426*             1        13      4.32       2.30           8%              30%              12%
  German          3.64         3.67           0.492*             0         9      3.96       1.85           8%              44%              10%
   French         3.38         3.67           0.647*             2        10      4.78       1.96          24%              40%               8%
    Italian       3.65         2.87           0.184              1        11      5.12       2.05          12%              50%              18%
  Spanish         3.64         3.24           0.295*             1         8      4.38       1.52           6%              34%              10%
*correlation significant at p<0.01

           Table 3                 A summary of manual and automatic topic assessment for each source language

Figure 2 shows a stacked bar chart of the automatic ratings of each topic (the Y axes between the manual and
automatic graphs are not comparable) and immediately we see a much larger degree of variation across topics.
From Table 3, Chinese also has the lowest average NIST score (1.68), which can be explained by the large
proportion of topics with a zero score (38%). From Table 3, German and French have the highest average NIST
score, followed by Dutch and Spanish.

   35



   30



   25



   20



   15



   10



    5



    0
        1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50


                                           Chinese       Dutch        German     French      Italian    Spanish


                             Figure 2                Automatic NIST scores for each ImageCLEF topic

Table 4 shows the translations with a zero NIST score where the reference and Systran translations have no
words which overlap. In many cases, however, this is simply because different words are used to express the
same concept, or lexical variations of the word (such as plurals) are used instead. For information retrieval, this
is important because if a simple word co-occurrence model is used with no lexical expansion, the queries may
not match documents (although in some cases the lexical variations will recover these). This highlights one of
the limitations of using mteval for assessing translation quality in CLIR, particularly when the queries are short.
        Language            Reference translation                     Systran version              Manual score
         Chinese     Woodland scenes                        Forest scenery                              5
                     Scottish marching bands                no translation                              1
                     Tea rooms by the seaside               Seashore teahouse                           5
                     Portraits of Mary Queen of Scots       no translation                              1
                     Boats on Loch Lomond                   In Luo river Mongolia lake ships            2
                     Culross abbey                          Karohs overhaul Daoist temple               3
                     Road bridges                           Highway bridge                              5
                                                            Becomes the ruins the English               4
                     Ruined castles in England
                                                            castle
                     Portraits of Robert Burns              no translation                              4
                     Glasgow before 1920                    no translation                              1
                     Male portraits                         Men’s portrait                              5
                     The mountain Ben Nevis                 Nepali Uygur peak                           2
                     Churches with tall spires              Has the high apex the churches              4
                     Men holding tennis racquets            no translation                              1
                     A coat of arms                         çº¹ç«                                       1
                     British windmills                      England’s windmill                          4
                     Waterfalls in Wales                    Well’s waterfall                            2
                     Harvesting                             Harvests                                    5
                     Museum exhibits                        translation                                 1
          French     Woodland scenes                        Scenes of forests                           1
                     Waterfalls in Wales                    Water falls to the country of Scales        1
                     Harvesting                             Harvest                                     5
                     Mountain scenery                       Panorama mountaineer                        3
         German      Glasgow before 1920                    No translation                              1
                     Male portraits                         Portraets of men                            1
                     Harvesting                             Harvests                                    5
                     Welsh national dress                   Walisi tract                                1
                     Museum exhibits                        Museumaustellungsstuecke                    1
          Italian    Woodland scenes                        Scene of a forest                           5
                     Tea rooms by the seaside               It knows it from te’ on lungomare           1
                     Wartime aviation                       Air in time of war                          4
                     People using spinning machines         Persons who use a filatoio                  5
                     British windmills                      English flour mills                         2
                     Harvesting                             Harvesters                                  5
                     Welsh national dress                   Dressed traditional Welshman                3
                     People dancing                         Persons who dance                           5
                     Museum exhibits                        Exposures in museums                        4
         Spanish     Woodland scenes                        A forest                                    5
                     Wartime aviation                       Aviators in time military                   2
                     Male portraits                         Picture of a man                            4
                     Museum exhibits                        Objects of museum                           2
                     Mountain scenery                       Vista of mountains                          1
          Dutch      Woodland scenes                        bunch faces                                 1
                     Road bridges                           Viaducts                                    4
                     Men cutting peat                       Trurfstekers                                1
                     Harvesting                             harvest                                     2
                     Museum exhibits                        Museumstukken                               1
                     Mountain scenery                       Mount landscapes                            2

                             Table 4               Translations with a NIST score of 0

These differences also contribute to the lack of correlation between the manual and automatic assessments
(shown in Table 3). For Chinese Systran sometimes produces no translation (given a manual score of 1), and
there appear to be more cases when the translation has gone seriously wrong. For Dutch, erroneous translations
are caused also caused by the incorrect translation of compounds (which also occurs in German).

The most highly correlated scores are between the assessments for French (using Spearman’s rho) suggesting
that topics which receive a high manual assessment also receive a high automatic score, thereby confirming the
use of an automatic evaluation tool to assess translation quality for CLIR (particularly for French).
The correlation between manual and automatic results is not consistent across languages, however, where, for
example, the correlation for Italian is lowest and not significant (at p<0.01). From Table 4, many Italian
translations are rated highly by manual assessment and the kinds of translations suggest that the problem derives
from the inability of mteval to determine semantic equivalents between translations.

4.2. Retrieval performance
Figure 3 shows a graph of recall versus precision across all topics and for each language using the strict
intersection5 set of ImageCLEF relevance judgments. The graph follows a typical pattern showing that as the
number of relevant documents found increases (recall), the precision decreases as relevant documents appear in
lower rank positions. As with other results from CLIR experiments, the monolingual results are higher than those
for translated queries, showing that these do not retrieve as well. Chinese has the lowest precision-recall curve,
and is noticeably lower than the rest of the languages which seem to bunch together and follow a similar shape.
The French curve is the highest of the languages, which matches with Table 3 where French has the lowest NIST
score, the least number of topics with a zero NIST score, and a high proportion of topics with a high manual
assessment rating.


                                  1




                                 0.8




                                 0.6
                     Precision




                                 0.4




                                 0.2




                                  0
                                       0            0.2             0.4            0.6             0.8            1
                                                                          Recall

                                                  German             Mono                Spanish         French
                                                  Italian            Dutch               Chinese



                                       Figure 3             Precision-recall graph for the Sheffield entry

Figure 4 provides a breakdown of average precision for each topic and the stacked bar chart shows average
precision for monolingual retrieval and mean average precision across all languages excluding English. Some
languages will perform better or worse for each topic (depending on the quality of translation), but the graph
provides an overall indication of those topics making analysis clearer. Across all languages (excluding English)
and topics, the mean average precision is 0.420 (with a standard deviation of 0.23) which is on average 75% of
monolingual performance (Table 4 shows the breakdown across languages).

Topics which perform poorly include 4 (seating inside a church), 5 (woodland scenes), 29 (wartime aviation), 41
(a coat of arms) and 48 (museum exhibits). These exhibit average NIST scores of 2.63, 0.64, 2.80, 3.71 and 3.83
respectively, and manual ratings of 3, 3.7, 4.17, 3.5 and 1.83 respectively. In some cases, the translation quality
is high, but the retrieval low, e.g. topic 29, because relevance assessment for cross language image retrieval is

5
  Strict intersection is the smallest set of relevance documents including only those which co-occur between assessors and
marked as relevant (not including those judged as partially relevant).
based upon the image and caption. There are cases when images are not relevant, even though they contain query
terms in the caption, e.g. the image is too small, too dark, the object of interest is obscured or in the background,
or the caption contains words which do not describe the image contents (e.g. matches on fields such as the
photographer, or notes which provide background meta-information).

                                  2

                                 1.8


                                 1.6
  Mean Average Precision (MAP)




                                 1.4


                                 1.2

                                  1


                                 0.8


                                 0.6

                                 0.4


                                 0.2


                                  0
                                       1   3     5    7    9     11   13   15   17    19   21   23    25      27    29     31     33   35     37    39   41   43   45   47   49
                                                                                                 Topic number

                                                                                                Not English    Mono


 Figure 4                                            Monolingual average precision and MAP across systems (excluding English) for each topic

Table 5 summarises retrieval performance for each language, and also shows the correlation between
manual/automatic assessment of translation quality and average precision for each language. We find that French
has the highest MAP score (78% monolingual), followed by German (75% monolingual) and Spanish (73%
monolingual). On average, MAP and translation quality is correlated (using Spearman’s rho with p<0.01) for
both the manual and automatic assessments which suggests that a higher quality of translation does give better
retrieval performance in general, particularly for Chinese, German and French (manual assessments) and Spanish,
French and Dutch (automatic assessments).

                                                 Language             Mean            MAP-manual                   MAP-NIST                 % of
                                                                      Average         correlation                  correlation              monolingual
                                                                      Precision
                                                                      (MAP)
                                                       Chinese          0.285               0.472*                       0.384*                    51%
                                                         Dutch          0.390               0.412*                       0.426*                    69%
                                                       German           0.423               0.503*                       0.324*                    75%
                                                        French          0.438               0.460*                       0.456*                    78%
                                                         Italian        0.405               0.394*                       0.378*                    72%
                                                       Spanish          0.408               -0.061                       0.462*                    73%
                                                  Monolingual           0.562                  -                            -                       -
                                                 *correlation significant at p<0.01

                                       Table 5                 A summary of retrieval performance and its correlation with translation quality

We might expect MAP to correlate well with the NIST score for the GLASS system because both are based on
word co-occurrences, but it is interesting to note that retrieval effectiveness is correlated just as highly with the
manual assessments, even though correlation between the manual and automatic assessments is not always itself
high. This is useful as it shows that as a CLIR task, the quality of translation has a significant impact on retrieval
thereby enabling, in general, retrieval effectiveness to indicate the quality of translation. Remaining factors may
be due to relevance assessments, the IR system, pseudo relevance feedback or use of other retrieval-enhancing
methods.
5. Conclusions and future work
We have shown that cross language image retrieval for the ImageCLEF ad hoc task is possible with little or no
knowledge of CLIR, or requirement of linguistic resources. Using Systran as a translation “black-box” requires
little effort, but at the price of having no control over translation or being able to recover when translation goes
wrong. In particular, Systran provides only one translation version which may not be correct and would provide
better CLIR if several alternatives were output. There are many cases when proper names are mistranslated,
words with diacritics not interpreted properly, and words translated incorrectly because of the limited degree of
context. Because the task of CLIR does not necessarily require syntactic correctness, we find Systran can be used
successfully for translation between a wide range of language pairs where essentially we make use of only the
large dictionaries maintained by Systran.

We evaluated the quality of translation using both manual assessments, and an automatic tool used extensively in
MT evaluation. We find that quality varies between different languages for Systran based on both the manual
and automatic score which is correlated, sometimes highly, for all languages. There are limitations, however,
with the automatic tool which would improve correlation for query quality in CLIR evaluation, such as resolving
literal equivalents for semantically similar terms, reducing words to their stems, removing function words, and
maybe using a different weighting scheme for query terms (e.g. weight proper names highly). We aim to
experiment further with semantic equivalents using Wordnet, and also assess whether correlation between the
manual and automatic scores can be improved by using longer n-gram lengths.

Using a probabilistic retrieval system, we obtain a mean average precision score which is 75% of the
monolingual score. Although Chinese retrieval is lowest at 51%, this would still provide multi-lingual access to
the ImageCLEF test collection, albeit needing improvement. Also, given that the task is not purely text, but also
involves images, this score may be improved using content-based methods of retrieval. We aim to experiment
with pseudo relevance feedback, and in particular improve performance using query expansion based on
EuroWordnet, a European version of Wordnet.

As a retrieval task, we have shown that translation quality does affect retrieval performance because of the
correlation between manual assessments and retrieval performance, implying that in general, higher translation
quality results in higher retrieval performance. We have also shown that for some languages, the manual
assessments correlate well with the automatic assessment suggesting this method could be used to measure
translation quality given a CLIR test collection.

6. Acknowledgments
We would like to thank members of the Natural Language Processing group and Department of Information
Studies for their time and effort in producing manual assessments. Thanks also to Hideo Joho for help and
support with the GLASS system, and in particular his modified BM25 ranking algorithm, and thanks to NTU for
providing Chinese versions of the ImageCLEF titles. This work was carried out within the Eurovision project at
Sheffield University, funded by the EPSRC (Eurovision: GR/R56778/01).

7. References
[1] P. Clough and M. Sanderson. The CLEF 2003 cross language image retrieval task. In Proceedings of CLEF2003, 2003.

[2] Heisoft. How does Systran work? http://www.heisoft.de/volltext/systran/dok2/howworke.htm (site visited July 2003)

[3] National Institute of Standards and Technology (NIST). Automatic Evaluation of Machine Translation Quality Using N-
    gram Co-Occurrence Statistics. 2002. http://www.nist.gov/speech/tests/mt/resources/scoring.htm

[4] C. Peters and M. Braschler. Cross-Language System Evaluation: The CLEF Campaigns. In Journal of the American
    Society for Information Science and Technology, 52(12), 1067-1072, 2001.

[5] S. Robertson, S. Walker and M. Beaulieu. Okapi at TREC-7: automatic ad hoc, filtering VLC and interactive track. In
    NIST Special Publication 500-242: TREC-7, pp. 253-264, Gaithersburg, MA, 1998.

[6] Systran. The SYSRAN linguistics platform: A software solution to manage multilingual corporate knowledge. White
    paper. 2002. http://www.systransoft.com/Technology/SLP.pdf

[7] E.M. Voorhees and D. Harman. Overview of TREC 2001, In Proceedings of TREC2001, NIST, 2001.