=Paper=
{{Paper
|id=Vol-1169/CLEF2003wn-ImageCLEF-CloughEt2003b
|storemode=property
|title=Sheffield at ImageCLEF 2003
|pdfUrl=https://ceur-ws.org/Vol-1169/CLEF2003wn-ImageCLEF-CloughEt2003b.pdf
|volume=Vol-1169
|dblpUrl=https://dblp.org/rec/conf/clef/CloughS03c
}}
==Sheffield at ImageCLEF 2003==
Sheffield at ImageCLEF 2003
Paul Clough and Mark Sanderson
University of Sheffield
p.d.clough|m.sanderson@sheffield.ac.uk
Abstract
In this paper, we use the Systran machine translation system for translating queries for cross
language image retrieval in a pilot experiment at CLEF 2003, called ImageCLEF. The
approach we have taken is to assume we have little experience in CLIR, few available
resources and a limited time in which to create a working CLIR system for this task. In this
preliminary study, we investigate the effectiveness of Systran on short queries by comparing a
manual assessment of translation adequacy with an automatic score derived using NIST's
mteval evaluation tool for machine translation output. We discuss the kinds of translation
errors encountered during this analysis and show the impact on retrieval effectiveness for
individual queries in the ImageCLEF task.
1. Introduction
The core component of a Cross Language Information Retrieval (CLIR) system is the method used to translate
the query from the source language into the language of the document collection (the target language). However,
this component involves specialised IR knowledge and familiarity with the source and target languages, or does
it? Imagine you are a company or organisation without these kinds of resources and you want a quick-fix
solution to a cross language problem. Can anything be done without buying in the necessary expertise? Can you
also evaluate how well the translation process has done without being able to understand the source language?
We would argue yes, although not without requiring certain resources (e.g. a CLIR test collection).
We experiment with using a “black-box” translation module: Systran, one of the oldest commercial machine
translation systems, which is widely used in industry and available for free via a Web-based interface. Our
experiences with using Systran have found that no multilingual processing is necessary as would normally be
required when dealing with cross language retrieval, e.g. tokenisation, case and diacritic normalisation,
decompounding and morphological analysis. This sounds ideal, but are there any problems with using Systran to
perform the translation? What kinds of translation errors are encountered, does translation quality vary across
different source languages, and how does translation quality affect retrieval performance? These are the kinds of
questions we wanted to address in this study which formed our entry for ImageCLEF 2003.
The ImageCLEF test collection can be used to evaluate retrieval performance, but does not necessarily reflect the
quality of translation because many factors other than translation might affect performance, e.g. the retrieval
system, retrieval enhancements such as query expansion, the relevance assessments or the use of content-based
retrieval methods. Therefore, to enable us to investigate where translation errors occur and assess the success of
Systran independently from retrieval, we manually assess translation adequacy, and show whether this correlates
with an automated approach to measuring translation quality as used in MT evaluation.
2. Background
2.1.1. The ImageCLEF task
ImageCLEF is a pilot experiment run at CLEF 2003, dealing with the retrieval of images by their captions in
cases where the source and target languages differ (see [1] for further information about ImageCLEF). Because
the document to be retrieved is both visual and textual, approaches to this task may involve the use of both
multimodal and multilingual retrieval methods. The primary task at this year’s ImageCLEF is an ad hoc retrieval
task in which fifty topics were selected for retrieval and described using a topic title and narrative. Only the title
is translated into Dutch, Italian, Spanish, French, German and Spanish, and therefore suitable for CLIR. As co-
ordinators of this task, we found that assessors used both the image and the caption during their judgment for
relevance, and therefore we know that this task involves more than just CLIR. Further challenges include: (1)
captions that are typically short in length, (2) images that vary widely in their content and quality, and (3) short
user search requests which provide little context for translation.
2.1.2. Systran
As a translation system, Systran is considered by many as a direct MT system (because the whole process relies
on dictionary lookup between a source and target language), although the stages resemble a transfer-based MT
system. Currently the on-line version of Systran offers bi-directional translation between 20 language pairs,
including languages from Western Europe, Asia, Eastern Europe, and in 2004 they plan to release English-
Arabic.
There are essentially three stages to Systran: analysis, transfer and synthesis. The first stage, analysis, pre-
processes the source text and performs functions such as character set conversion, spelling correction, sentence
segmentation, tokenisation, and POS tagging. Also during the analysis phase, Systran performs partial analysis
on sentences from the source language, capturing linguistic information such as predicate-argument relations,
major syntactic relationships, identification of noun phrases and prepositional phrase attachment using their own
linguistic formalism and dictionary lookup.
After analysis of the source language, the second process of transfer aims to match with the target language
through dictionary lookup, and then apply rules to re-order the words according to the target language syntax, e.g.
restructure propositions and expressions. The final synthesis stage tidies up the target text and determines
grammatical choice to make the result coherent. This stage relies heavily on large tables of rules to make its
decisions. For more information, consult [2] and [6].
3. Experimental setup
3.1. Manual assessment of translation quality
Assessing the quality of the output produced by a machine translation (MT) system offers a challenging problem
to researchers. Organisations such as DARPA and NIST have established the necessary resources and framework
in which to experiment with, and evaluate, MT systems as part of managed competitions, similar to the TREC
(see, e.g. [7]) and CLEF (see, e.g. [4]) campaigns. For manual evaluation1, three dimensions upon which to base
judgment include translation adequacy, fluency and informativeness. Translation quality is normally assessed
across an entire document when measuring fluency and informativeness, but adequacy is assessed between
smaller units (e.g. paragraphs or sentences) which provide a tighter and more direct semantic relationship.
To assess adequacy, a high quality reference translation and the output from an MT system are divided into
segments to evaluate how well the meaning is conveyed between the versions. Fluency measures how well the
translation conveys its content with regards to how the translation is presented and involves no comparison with
the reference translation. Informativeness measures how well an assessor has understood the content of a
translated document by asking them questions based on the translation and assessing the number answered
correctly.
Given titles from the ImageCLEF test collection in Chinese, Dutch, French, Spanish, German and Italian; we
first passed these through the on-line version of Systran to translate them into English, the language of the
ImageCLEF document collection. We then asked assessors to judge the adequacy of the translation by assuming
the English translation would be that for submission to a retrieval system for an ad hoc task. Translators who had
previously been involved with creating the ImageCLEF test collection were chosen to assess translation quality
because of their familiarity with the topics and the collection, each assessor given topics in their native language.
Translators were asked to assess topic titles2 in the source language with the Systran English version and make a
judgment on how well the translation captured the meaning of the original (i.e. how adequate the translated
version would be for retrieval purposes). A five-point scale was used to assess translation quality, a score of 5
representing a very good translation (i.e. the same or semantically-equivalent words and syntax), to very bad (i.e.
no translation, or the wrong words used altogether). Assessors were asked to take into account the “importance”
of translation errors in the scoring, e.g. for retrieval purposes, mis-translated proper nouns might be considered
worse than other parts-of-speech.
1
See, e.g. the TIDES translation pages: http://www.ldc.upenn.edu/Projects/TIDES/
2
In cases of multiple translations, we used the first translation.
3
We used mteval-v09.pl which can be downloaded from: http://www.nist.gov/speech/tests/mt
Table 1 shows an example topic title for each language and translation score for very good to good (5-4), okay (3)
and bad to very bad (2-1) to provide an idea of the degree of error for these adequacy scores. We find that
assessment varies according to each assessor; some being stricter than others, which suggest that, further manual
assessments may help to reduce subjectivity. In some cases, particularly Spanish, the source language title
contains a spelling mistake which obviously affects translation quality. Some assessors allowed for this in their
rating, others did not, therefore suggesting the need to manually check all topics for errors prior to evaluation.
Example topic title
Source Adequacy
Source Systran English Reference English
language rating
Chinese 4-5 圣安德鲁斯风景的明信片 Saint Andrews scenery Picture postcard views of
(simplified) postcard St Andrews
3 战争造成的破坏 The war creates destruction Damage due to war
1-2 大亚茅斯海滩 Asian Mao si beach Great Yarmouth beach
Dutch 4-5 Mannen en vrouwen die men and women who men and women
vis verwerken process fish processing fish
3 Vissers gefotografeerd Fisherman photographed Fishermen by the
door Adamson Adamson photographer Adamson
1-2 Muzikanten en hun Muzikanten and their Musicians and their
instrumenten instruments instruments
German 4-5 Baby im Kinderwagen Baby in the buggy A baby in a pram
3 Portät der schottischen Portraet of the Scottish Portraits of Mary Queen of
Königin Mary Queen Mary Scots
1-2 Museumaustellungsstücke Museumaustellungsstuecke Museum exhibits
French 4-5 La rue du Nord St Andrews The street of North St North Street St Andrews
Andrews
3 Bateaux sur Loch Lomond Boats on Lomond log Boats on Loch Lomond
1-2 Damage de guerre Ramming of war Damage due to war
Italian 4-5 Banda Scozzese in marcia Scottish band in march Scottish marching bands
3 Vestito tradizionale gallese Dressed traditional Welsh national dress
Welshman
1-2 Il monte Ben Nevis The mount Very Nevis The mountain Ben Nevis
Spanish 4-5 El aforo de la iglesia Chairs in a church Seating inside a church
3 Puentes en la carretera Bridges in the highway Road bridges
1-2 las montañas de Ben Mountains of Horseradish The mountain Ben Nevis
Nevis tree Nevis
Table 1 Example adequacy ratings assigned manually
Table 1 highlights some of the errors produced by the MT system: (1) un-translated words, e.g. “Muzikanten and
their instruments”, (2) incorrect translation of proper nouns, e.g. “Bateaux sur Loch Lomond” translated as
“Boats on Lomond Log” and “Il monte Ben Nevis” translated as “the mount Very Nevis”, and (3) mis-
translations, e.g. “damage de guerre” translated as “ramming of war”. The limited context of the topic titles also
produces errors where Systran produces the wrong meaning of a word, e.g. “Scottish blowing chapels” where
kapelle is mis-translated as chapel, rather than the correct word band. However, Systran does seem to be able to
handle different entry formats for diacritics (accents above characters) which play an important part in selecting
the correct translation of a word, e.g. in the query “Casas de te’ en la costa” (tearooms by the seaside), the word
te’ is translated correctly as té (sea) rather than te (you).
3.2. Automatic assessment of translation quality
Although most accurate (and most subjective), manual evaluation is time-consuming and expensive, therefore
automatic approaches to assess translation quality have also been proposed, such as the NIST mteval3 tool. This
approach divides documents into segments and computes co-occurrence statistics based on the overlap of word
n-grams between a reference translation produced manually and an MT version. This method has been shown to
correlate well with adequacy, fluency and informativeness because n-grams capture both lexical overlap and
syntactic structure [3].
In the latest version of mteval, two metrics are used to compute translation quality: IBM’s BLEU and NIST’s
own score. Both measures are based on n-gram co-occurrence, although a modified version of NIST’s score has
been shown to be the preferred measure. These scores assume that the reference translation is of high quality,
and that documents assessed are from the same genre. Both measures are also influenced by changes in literal
form, such that translations with the same meaning but using different words score lower than those that appear
exactly the same. This is justified in assuming the manual reference translation is the “best” translation possible
and the MT version should be as similar to this as possible. For n-gram scoring, the NIST formula is:
∑ Info( w1...wn )
N all wi...wn
2 Lsys
Score = ∑ that co -occur
. exp β log min
n =1
∑ (1)
all wi...wn
Lref
in sys output
where
β is chosen to make the brevity penalty factor = 0.5 when the number of words in the system output is
2/3 of the average number of words in the reference translation.
N is the n-gram length.
Lref is the average number of words in a reference translation, averaged over all reference translations.
Lsys is the number of words in the translation being scored.
Info(wi...wn) = log 2 number of occurrences of w1...wn -1
number of occurrences of w1...wn
The NIST formula uses info(w1…wn) to weight the “importance” of n-grams based on their length, i.e. that
longer n-grams are less likely than shorter ones, and reduces the effects of segment length on the translation
score. The information weight is computed from n-gram counts across the set of reference translations. The
brevity penalty factor is used to minimise the impact on the score of small variations in the length of a translation.
The mteval tool enables control of the n-gram length and maximises matches by normalising case, keeping
numerical information as single words, tokenising punctuation into separate words, and concatenating adjacent
non-ASCII words into single words.
In our experiments, we make the assumption that the English topic title is the reference translation, rather than
ask the translators to produce an English version from an original. For example, the English topic title “North
Street St Andrews” was translated into French as “La rue du Nord St Andrews”. We assume that if this were
translated into English again, the “best” translation would be “North Street St Andrews”. Given that the
translators used in the manual assessment were those who created the non-English translations from the English
titles in the first place, we feel this assumption can be justified.
Because manual assessment is based on translation adequacy for retrieval, the Systran version “The street of
North St Andrews” (a literal interpretation of the French version) is given a high adequacy rating even though it
differs in syntax from the reference translation “North Street St Andrews”. The result is that the NIST score for a
larger n-gram length would be low and not correlate with the score given manually (see Table 1 for more
examples). Therefore to minimise this we compute the NIST score for an n-gram length of 1 word, reducing the
measure to simply counting word overlap. In this case, the weighting function has the effect of reducing the
importance of those words occurring frequently, e.g. function words. Table 2 shows example translations and
their corresponding NIST score for Chinese translations. To use mteval, we created a reference containing the
English versions of the topic titles where each title represents a segment within a document, and a test file
containing the Systran versions in the same format.
NIST score Reference translation Test translation
8.1294 Mountain scenery Mountain scenery
3.3147 People dancing dances people
1.727 Picture postcards by the valentine the Tanzania photography company
photographic company photographs scenery postcard
Table 2 Example translations and corresponding NIST score (for Chinese)
3.3. The GLASS retrieval system
At Sheffield, we have implemented our own version of a probabilistic retrieval system called GLASS, based on
the “best match” BM25 weighting operator (see, e.g. [5]). Captions were indexed using all 8 fields, which
include a title, description, photographer, location and set of manually assigned index categories and the default
settings of case normalisation, removal of stopwords and word stemming.
To improve document ranking using BM25, we used an approach where documents containing all query terms
were ranked higher than any other. We first identified documents containing all query terms, computed the
BM25 score and ranked these highest, followed by all other documents containing at least one query term, again
ranked by their BM25 score. The top 1000 images and captions returned for each topic title formed our entry to
ImageCLEF. Evaluation was carried out using the set of relevant images for each topic (qrels) which forms part
of the ImageCLEF test collection and the NIST information retrieval evaluation program, trec_eval4. We
evaluate retrieval effectiveness using average precision for each topic, and across topics mean average precision
(or MAP) is used.
4. Results
4.1. Translation quality
Figure 1 shows a stacked bar chart of manual assessment scores obtained across each language for each topic.
Each bar represents a topic and a maximum bar height of 30 would represent each assessor rating the translation
as very good. As expected, the quality of translation is dependent on the topic title, although the majority of
topics do get an overall rating that is less than 50-66% of the maximum possible value. The 6 topics with the
highest overall manual rating (over 25) are topics 3 (Picture postcard views of St Andrews), 22 (Ruined castles
in England), 43 (British windmills), 45 Harvesting), 47 (People dancing) and 49 (Musicians and their
instruments). The 2 lowest scoring topics (an overall score < 15) are topics 34 (Dogs rounding-up sheep) and 48
Museum exhibits). Some translations of these topics include:
English: Dogs rounding up sheep Museum exhibits Ruined castles in England
Italian: Dogs that assemble sheep Exposures in museums Ruins of castles in England
German: Dogs with sheep hats Museumaustellungssteucke Castle ruins in England
Dutch: Dogs which sheep bejeendrijven Museumstukken Ruin of castles in United Kingdom
French: Dogs gathering of the preois Exposure of objects in museum Castles in ruins in England
Spanish: Dogs urging on ewes Objects of museum Castles in ruins in England
Chinese: Catches up with the sheep the dog no translation Become the ruins the English castle
30
25
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Chinese Dutch German French Italian Spanish
Figure 1 Manual assessment scores for each ImageCLEF topic
4
We used a version of trec_eval supplied by UMASS.
Chinese appears to exhibit the greatest variation of scores, and from Table 3 has one of the lowest average rating
scores (Dutch being the lowest). The Chinese Systran translations are on average the shortest and 14% of the
topics get a rating of very bad (3rd highest), and 28% a rating of very good (the lowest). From Table 3, Italian
has the highest average manual rating, followed closely by German and Spanish suggesting these are strong
bilingual pairings for Systran. French has the highest number of topics rated very poor, followed by Chinese and
Italian which is perhaps surprising as French-English is claimed to be one of Systran’s strongest translations.
Upon inspection, many of these low scores are from words which have not been translated. Italian has the
highest number of topics rated very good, followed by German then French. Spanish has fewest topics given a
very poor rating.
Avg Avg man-NIST Avg translation length % topics % topics % topics
manual NIST correlation (words) with with with
score score (Spearman’s manual manual NIST
rho) Min Max Mean SD score of 1 score of 5 score=0
(very bad) (very good)
Chinese 3.34 1.68 0.268* 0 14 3.76 2.65 14% 28% 38%
Dutch 3.32 3.27 0.426* 1 13 4.32 2.30 8% 30% 12%
German 3.64 3.67 0.492* 0 9 3.96 1.85 8% 44% 10%
French 3.38 3.67 0.647* 2 10 4.78 1.96 24% 40% 8%
Italian 3.65 2.87 0.184 1 11 5.12 2.05 12% 50% 18%
Spanish 3.64 3.24 0.295* 1 8 4.38 1.52 6% 34% 10%
*correlation significant at p<0.01
Table 3 A summary of manual and automatic topic assessment for each source language
Figure 2 shows a stacked bar chart of the automatic ratings of each topic (the Y axes between the manual and
automatic graphs are not comparable) and immediately we see a much larger degree of variation across topics.
From Table 3, Chinese also has the lowest average NIST score (1.68), which can be explained by the large
proportion of topics with a zero score (38%). From Table 3, German and French have the highest average NIST
score, followed by Dutch and Spanish.
35
30
25
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Chinese Dutch German French Italian Spanish
Figure 2 Automatic NIST scores for each ImageCLEF topic
Table 4 shows the translations with a zero NIST score where the reference and Systran translations have no
words which overlap. In many cases, however, this is simply because different words are used to express the
same concept, or lexical variations of the word (such as plurals) are used instead. For information retrieval, this
is important because if a simple word co-occurrence model is used with no lexical expansion, the queries may
not match documents (although in some cases the lexical variations will recover these). This highlights one of
the limitations of using mteval for assessing translation quality in CLIR, particularly when the queries are short.
Language Reference translation Systran version Manual score
Chinese Woodland scenes Forest scenery 5
Scottish marching bands no translation 1
Tea rooms by the seaside Seashore teahouse 5
Portraits of Mary Queen of Scots no translation 1
Boats on Loch Lomond In Luo river Mongolia lake ships 2
Culross abbey Karohs overhaul Daoist temple 3
Road bridges Highway bridge 5
Becomes the ruins the English 4
Ruined castles in England
castle
Portraits of Robert Burns no translation 4
Glasgow before 1920 no translation 1
Male portraits Men’s portrait 5
The mountain Ben Nevis Nepali Uygur peak 2
Churches with tall spires Has the high apex the churches 4
Men holding tennis racquets no translation 1
A coat of arms çº¹ç« 1
British windmills England’s windmill 4
Waterfalls in Wales Well’s waterfall 2
Harvesting Harvests 5
Museum exhibits translation 1
French Woodland scenes Scenes of forests 1
Waterfalls in Wales Water falls to the country of Scales 1
Harvesting Harvest 5
Mountain scenery Panorama mountaineer 3
German Glasgow before 1920 No translation 1
Male portraits Portraets of men 1
Harvesting Harvests 5
Welsh national dress Walisi tract 1
Museum exhibits Museumaustellungsstuecke 1
Italian Woodland scenes Scene of a forest 5
Tea rooms by the seaside It knows it from te’ on lungomare 1
Wartime aviation Air in time of war 4
People using spinning machines Persons who use a filatoio 5
British windmills English flour mills 2
Harvesting Harvesters 5
Welsh national dress Dressed traditional Welshman 3
People dancing Persons who dance 5
Museum exhibits Exposures in museums 4
Spanish Woodland scenes A forest 5
Wartime aviation Aviators in time military 2
Male portraits Picture of a man 4
Museum exhibits Objects of museum 2
Mountain scenery Vista of mountains 1
Dutch Woodland scenes bunch faces 1
Road bridges Viaducts 4
Men cutting peat Trurfstekers 1
Harvesting harvest 2
Museum exhibits Museumstukken 1
Mountain scenery Mount landscapes 2
Table 4 Translations with a NIST score of 0
These differences also contribute to the lack of correlation between the manual and automatic assessments
(shown in Table 3). For Chinese Systran sometimes produces no translation (given a manual score of 1), and
there appear to be more cases when the translation has gone seriously wrong. For Dutch, erroneous translations
are caused also caused by the incorrect translation of compounds (which also occurs in German).
The most highly correlated scores are between the assessments for French (using Spearman’s rho) suggesting
that topics which receive a high manual assessment also receive a high automatic score, thereby confirming the
use of an automatic evaluation tool to assess translation quality for CLIR (particularly for French).
The correlation between manual and automatic results is not consistent across languages, however, where, for
example, the correlation for Italian is lowest and not significant (at p<0.01). From Table 4, many Italian
translations are rated highly by manual assessment and the kinds of translations suggest that the problem derives
from the inability of mteval to determine semantic equivalents between translations.
4.2. Retrieval performance
Figure 3 shows a graph of recall versus precision across all topics and for each language using the strict
intersection5 set of ImageCLEF relevance judgments. The graph follows a typical pattern showing that as the
number of relevant documents found increases (recall), the precision decreases as relevant documents appear in
lower rank positions. As with other results from CLIR experiments, the monolingual results are higher than those
for translated queries, showing that these do not retrieve as well. Chinese has the lowest precision-recall curve,
and is noticeably lower than the rest of the languages which seem to bunch together and follow a similar shape.
The French curve is the highest of the languages, which matches with Table 3 where French has the lowest NIST
score, the least number of topics with a zero NIST score, and a high proportion of topics with a high manual
assessment rating.
1
0.8
0.6
Precision
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Recall
German Mono Spanish French
Italian Dutch Chinese
Figure 3 Precision-recall graph for the Sheffield entry
Figure 4 provides a breakdown of average precision for each topic and the stacked bar chart shows average
precision for monolingual retrieval and mean average precision across all languages excluding English. Some
languages will perform better or worse for each topic (depending on the quality of translation), but the graph
provides an overall indication of those topics making analysis clearer. Across all languages (excluding English)
and topics, the mean average precision is 0.420 (with a standard deviation of 0.23) which is on average 75% of
monolingual performance (Table 4 shows the breakdown across languages).
Topics which perform poorly include 4 (seating inside a church), 5 (woodland scenes), 29 (wartime aviation), 41
(a coat of arms) and 48 (museum exhibits). These exhibit average NIST scores of 2.63, 0.64, 2.80, 3.71 and 3.83
respectively, and manual ratings of 3, 3.7, 4.17, 3.5 and 1.83 respectively. In some cases, the translation quality
is high, but the retrieval low, e.g. topic 29, because relevance assessment for cross language image retrieval is
5
Strict intersection is the smallest set of relevance documents including only those which co-occur between assessors and
marked as relevant (not including those judged as partially relevant).
based upon the image and caption. There are cases when images are not relevant, even though they contain query
terms in the caption, e.g. the image is too small, too dark, the object of interest is obscured or in the background,
or the caption contains words which do not describe the image contents (e.g. matches on fields such as the
photographer, or notes which provide background meta-information).
2
1.8
1.6
Mean Average Precision (MAP)
1.4
1.2
1
0.8
0.6
0.4
0.2
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
Topic number
Not English Mono
Figure 4 Monolingual average precision and MAP across systems (excluding English) for each topic
Table 5 summarises retrieval performance for each language, and also shows the correlation between
manual/automatic assessment of translation quality and average precision for each language. We find that French
has the highest MAP score (78% monolingual), followed by German (75% monolingual) and Spanish (73%
monolingual). On average, MAP and translation quality is correlated (using Spearman’s rho with p<0.01) for
both the manual and automatic assessments which suggests that a higher quality of translation does give better
retrieval performance in general, particularly for Chinese, German and French (manual assessments) and Spanish,
French and Dutch (automatic assessments).
Language Mean MAP-manual MAP-NIST % of
Average correlation correlation monolingual
Precision
(MAP)
Chinese 0.285 0.472* 0.384* 51%
Dutch 0.390 0.412* 0.426* 69%
German 0.423 0.503* 0.324* 75%
French 0.438 0.460* 0.456* 78%
Italian 0.405 0.394* 0.378* 72%
Spanish 0.408 -0.061 0.462* 73%
Monolingual 0.562 - - -
*correlation significant at p<0.01
Table 5 A summary of retrieval performance and its correlation with translation quality
We might expect MAP to correlate well with the NIST score for the GLASS system because both are based on
word co-occurrences, but it is interesting to note that retrieval effectiveness is correlated just as highly with the
manual assessments, even though correlation between the manual and automatic assessments is not always itself
high. This is useful as it shows that as a CLIR task, the quality of translation has a significant impact on retrieval
thereby enabling, in general, retrieval effectiveness to indicate the quality of translation. Remaining factors may
be due to relevance assessments, the IR system, pseudo relevance feedback or use of other retrieval-enhancing
methods.
5. Conclusions and future work
We have shown that cross language image retrieval for the ImageCLEF ad hoc task is possible with little or no
knowledge of CLIR, or requirement of linguistic resources. Using Systran as a translation “black-box” requires
little effort, but at the price of having no control over translation or being able to recover when translation goes
wrong. In particular, Systran provides only one translation version which may not be correct and would provide
better CLIR if several alternatives were output. There are many cases when proper names are mistranslated,
words with diacritics not interpreted properly, and words translated incorrectly because of the limited degree of
context. Because the task of CLIR does not necessarily require syntactic correctness, we find Systran can be used
successfully for translation between a wide range of language pairs where essentially we make use of only the
large dictionaries maintained by Systran.
We evaluated the quality of translation using both manual assessments, and an automatic tool used extensively in
MT evaluation. We find that quality varies between different languages for Systran based on both the manual
and automatic score which is correlated, sometimes highly, for all languages. There are limitations, however,
with the automatic tool which would improve correlation for query quality in CLIR evaluation, such as resolving
literal equivalents for semantically similar terms, reducing words to their stems, removing function words, and
maybe using a different weighting scheme for query terms (e.g. weight proper names highly). We aim to
experiment further with semantic equivalents using Wordnet, and also assess whether correlation between the
manual and automatic scores can be improved by using longer n-gram lengths.
Using a probabilistic retrieval system, we obtain a mean average precision score which is 75% of the
monolingual score. Although Chinese retrieval is lowest at 51%, this would still provide multi-lingual access to
the ImageCLEF test collection, albeit needing improvement. Also, given that the task is not purely text, but also
involves images, this score may be improved using content-based methods of retrieval. We aim to experiment
with pseudo relevance feedback, and in particular improve performance using query expansion based on
EuroWordnet, a European version of Wordnet.
As a retrieval task, we have shown that translation quality does affect retrieval performance because of the
correlation between manual assessments and retrieval performance, implying that in general, higher translation
quality results in higher retrieval performance. We have also shown that for some languages, the manual
assessments correlate well with the automatic assessment suggesting this method could be used to measure
translation quality given a CLIR test collection.
6. Acknowledgments
We would like to thank members of the Natural Language Processing group and Department of Information
Studies for their time and effort in producing manual assessments. Thanks also to Hideo Joho for help and
support with the GLASS system, and in particular his modified BM25 ranking algorithm, and thanks to NTU for
providing Chinese versions of the ImageCLEF titles. This work was carried out within the Eurovision project at
Sheffield University, funded by the EPSRC (Eurovision: GR/R56778/01).
7. References
[1] P. Clough and M. Sanderson. The CLEF 2003 cross language image retrieval task. In Proceedings of CLEF2003, 2003.
[2] Heisoft. How does Systran work? http://www.heisoft.de/volltext/systran/dok2/howworke.htm (site visited July 2003)
[3] National Institute of Standards and Technology (NIST). Automatic Evaluation of Machine Translation Quality Using N-
gram Co-Occurrence Statistics. 2002. http://www.nist.gov/speech/tests/mt/resources/scoring.htm
[4] C. Peters and M. Braschler. Cross-Language System Evaluation: The CLEF Campaigns. In Journal of the American
Society for Information Science and Technology, 52(12), 1067-1072, 2001.
[5] S. Robertson, S. Walker and M. Beaulieu. Okapi at TREC-7: automatic ad hoc, filtering VLC and interactive track. In
NIST Special Publication 500-242: TREC-7, pp. 253-264, Gaithersburg, MA, 1998.
[6] Systran. The SYSRAN linguistics platform: A software solution to manage multilingual corporate knowledge. White
paper. 2002. http://www.systransoft.com/Technology/SLP.pdf
[7] E.M. Voorhees and D. Harman. Overview of TREC 2001, In Proceedings of TREC2001, NIST, 2001.