European Ad Hoc Retrieval Experiments with
     Hummingbird SearchServerTM at CLEF 2005
                                        Stephen Tomlinson
                                          Hummingbird
                                     Ottawa, Ontario, Canada
                               stephen.tomlinson@hummingbird.com
                                  http://www.hummingbird.com/

                                         August 21, 2005


                                               Abstract
      Hummingbird participated in the 4 monolingual information retrieval tasks (Bulgar-
      ian, French, Hungarian and Portuguese) of the Ad-Hoc Track of the Cross-Language
      Evaluation Forum (CLEF) 2005. In the ad hoc retrieval tasks, the system was given 50
      natural language queries, and the goal was to find all of the relevant documents (with
      high precision) in a particular document set. We conducted diagnostic experiments
      with different techniques for matching word variations and handling stopwords. We
      found that the experimental stemmers significantly increased mean average precision
      for the 4 languages. Analysis of individual topics found that the algorithmic Bulgar-
      ian and Hungarian stemmers encountered some unanticipated stopword collisions. A
      comparison to an experimental 4-gram technique suggested that Hungarian stemming
      would further benefit from decompounding. A blind feedback technique which sig-
      nificantly increased mean average precision for some languages was also significantly
      detrimental to the rank of the first relevant retrieved for one language.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval

General Terms
Measurement, Performance, Experimentation

Keywords
Bulgarian Retrieval, Hungarian Retrieval, First Relevant Score, Per-Topic Analysis


1    Introduction
Hummingbird SearchServer1 is a toolkit for developing enterprise search and retrieval applications.
The SearchServer kernel is also embedded in other Hummingbird products for the enterprise.
   SearchServer works in Unicode internally [3] and supports most of the world’s major char-
acter sets and languages. The major conferences in text retrieval experimentation (CLEF [2],
   1 SearchServerTM , SearchSQLTM and Intuitive SearchingTM are trademarks of Hummingbird Ltd. All other

copyrights, trademarks and tradenames are the property of their respective owners.
                   Table 1: Sizes of CLEF 2005 Ad-Hoc Track Test Collections
    Language      Text Size (uncompressed)   Documents    Topics    Rel/Topic
    Portuguese        591,987,753 bytes           210,734        50      58 (lo 2, med 44, hi 239)
    French            508,863,606 bytes           177,452        50      51 (lo 1, med 35, hi 185)
    Bulgarian         216,432,023 bytes            69,195        49       16 (lo 1, med 10, hi 69)
    Hungarian         106,631,823 bytes           49,530         50       19 (lo 1, med 13, hi 87)


NTCIR [6] and TREC [11]) have provided judged test collections for objective experimentation
with SearchServer in more than a dozen languages.
   This (draft) paper describes experimental work with SearchServer for the task of finding rel-
evant documents for natural language queries in 4 European languages (Bulgarian, French, Hun-
garian and Portuguese) using the CLEF 2005 Ad-Hoc Track test collections.


2      Methodology
2.1     Data
The CLEF 2005 Ad-Hoc Track document sets consisted of tagged (SGML-formatted) news articles
in 4 different languages: Bulgarian, French, Hungarian and Portuguese. Table 1 gives the sizes.
    The CLEF organizers created 50 natural language “topics” (numbered 251-300) and translated
them into many languages. One topic was discarded for Bulgarian because it had no relevant
documents. Table 1 gives the final number of topics for each language and their average number
of relevant documents (along with the lowest, median and highest number of relevant documents
of any topic). For more information on the CLEF test collections, see the track overview paper.

2.2     Indexing
Our indexing approach was the mostly the same as last year [15]. Accents were not indexed except
for the combining breve in Bulgarian. The apostrophe was treated as a word separator for the 4
investigated languages. The custom text reader, cTREC, was updated to maintain support for
the CLEF guidelines of only indexing specifically tagged fields.
    Some stop words were excluded from indexing (e.g. “the”, “by” and “of” in English). For these
experiments, the stop word list for Portuguese was based on the Porter list [7], and the lists for
Bulgarian and Hungarian were based on Savoy’s [9]. We used our own list for French.
    Unlike previous years, this year we added AL=“0-9” to the stopfiles to specify that the digits
0-9 were to be treated as alphabet characters (e.g. so that “G7” would be indexed as 1 term instead
of 2).
    By default, the SearchServer index supports both exact matching (after some Unicode-based
normalizations, such as decompositions and conversion to upper-case) and morphological matching
(e.g. inflections, derivations and compounds, depending on the linguistic component used).
    For many languages (including French and Portuguese), SearchServer provides the option of
finding inflections based on lexical stemming (i.e. stemming based on a dictionary or lexicon for
the language). For example, in English, “baby”, “babied”, “babies”, “baby’s” and “babying” all have
“baby” as a stem. Specifying an inflected search for any of these terms will match all of the others.
The lexical stemming of the post-6.0 experimental development version of SearchServer used for
the experiments in this paper was based on internal stemming component 3.7.0.15. We treat each
linguistic component as a black box in this paper.
    Lexical stemming in SearchServer typically does “inflectional” stemming which generally retains
the part of speech (e.g. a plural of a noun is typically stemmed to the singular form). It typically
does not do “derivational” stemming which would often change the part of speech or the meaning
more substantially (e.g. “performer” is not stemmed to “perform”).
   Lexical stemming in SearchServer includes compound-splitting (decompounding) for compound
words in particular languages (such as Dutch, Finnish, German and Swedish). For example, in
German, “babykost” (baby food) has “baby” and “kost” as stems.
   Lexical stemmers can produce more than one stem, even for non-compound words. For ex-
ample, in English, “axes” has both “axe” and “axis” as stems (different meanings), and in French,
“important” has both “important” (adjective) and “importer” (verb) as stems (different parts of
speech). SearchServer records all the stem mappings at index-time to support maximum recall
and does so in a way to allow searching to weight some inflections higher than others.

2.3    Searching
We experimented with the SearchServer CONTAINS predicate. Our test application specified
SearchSQL to perform a boolean-OR of the query words. For example, for Bulgarian topic 279
whose Title was “Референдуми в Швейцария” (Swiss referendums), a corresponding SearchSQL
query would be:

SELECT RELEVANCE(’2:3’) AS REL, DOCNO
FROM CLEF05BG
WHERE FT_TEXT CONTAINS ’Референдуми’|’в’|’Швейцария’
ORDER BY REL DESC;

(Note that “в” is a stopword for Bulgarian so its inclusion in the query wouldn’t actually add
any matches.)
    Most aspects of the SearchServer relevance value calculation are the same as described last
year [15]. Briefly, SearchServer dampens the term frequency and adjusts for document length in a
manner similar to Okapi [8] and dampens the inverse document frequency using an approximation
of the logarithm. These calculations are based on the stems of the terms (roughly speaking)
when doing morphological searching (i.e. when SET TERM_GENERATOR ‘word!ftelp/inflect’
was previously specified). The SearchServer RELEVANCE_METHOD setting was set to ‘2:3’
and RELEVANCE_DLEN_IMP was set to 750 for all experiments in this paper.

2.4    Diagnostic Runs
For the diagnostic runs listed in Tables 2, the run names consist of a language code (“BG” for
Bulgarian, “FR” for French, “HU” for Hungarian and “PT” for Portuguese) followed by one of the
following labels:

   • “lex”: (FR and PT only): The run used SearchServer lexical stemming. The /inflect option
     (SET TERM_GENERATOR ‘word!ftelp/inflect’) was specified.
   • “lexnos”: Same as “lex” except that /nostop was additionally specified which prevents query
     terms from being discarded if all of their stems are stopwords (note that stopwords themselves
     were still not found because they were not indexed).
   • “lexall”: Same as “lex” except that a separate index was used which did not stop any words
     from being indexed (specifying /nostop would make no difference with this index).
   • “lexsing”: Same as “lex” except that /single was additionally specified (so that just one
     stemming interpretation was used at search time).
   • “neu” (BG and HU only): Same as “lex” except that the experimental Neuchatel stemmer
     was used [9].
   • “neunos”: Same as “lexnos” except that the Neuchatel stemmer was used.
   • “neuall”: Same as “lexall” except that the Neuchatel stemmer was used.
               Table 2: Mean Scores of Diagnostic Title-only runs
Run          FRS     Success@1       Success@5       Success@10     MRR     MAP
BG-neuall    0.782   15/49 (31%)     38/49 (78%)     41/49 (84%)    0.500   0.255
BG-neunos    0.781   16/49 (33%)     38/49 (78%)     41/49 (84%)    0.507   0.263
BG-4gram     0.758   20/49 (41%)     32/49 (65%)     40/49 (82%)    0.525   0.264
BG-snru      0.757   17/49 (35%)     34/49 (69%)     40/49 (82%)    0.499   0.242
BG-neu       0.749   15/49 (31%)     35/49 (71%)     39/49 (80%)    0.476   0.259
BG-none      0.685   14/49 (29%)     30/49 (61%)     35/49 (71%)    0.440   0.195
FR-sn        0.820   27/50 (54%)     40/50 (80%)     43/50 (86%)    0.645   0.318
FR-lex       0.810   25/50 (50%)     39/50 (78%)     42/50 (84%)    0.618   0.302
FR-lexnos    0.810   25/50 (50%)     39/50 (78%)     42/50 (84%)    0.618   0.302
FR-lexall    0.810   25/50 (50%)     39/50 (78%)     43/50 (86%)    0.618   0.301
FR-4gram     0.809   24/50 (48%)     41/50 (82%)     43/50 (86%)    0.617   0.279
FR-lexsing   0.802   25/50 (50%)     39/50 (78%)     42/50 (84%)    0.615   0.299
FR-none      0.778   20/50 (40%)     38/50 (76%)     43/50 (86%)    0.549   0.232
HU-4gram     0.834   24/50 (48%)     39/50 (78%)     45/50 (90%)    0.619   0.341
HU-neunos    0.789   26/50 (52%)     36/50 (72%)     42/50 (84%)    0.625   0.287
HU-neuall    0.788   25/50 (50%)     37/50 (74%)     41/50 (82%)    0.614   0.280
HU-neu       0.788   25/50 (50%)     37/50 (74%)     42/50 (84%)    0.613   0.274
HU-neuposs   0.769   24/50 (48%)     36/50 (72%)     41/50 (82%)    0.588   0.271
HU-none      0.671   17/50 (34%)     30/50 (60%)     37/50 (74%)    0.464   0.184
PT-sn        0.892   30/50 (60%)     43/50 (86%)     47/50 (94%)    0.712   0.269
PT-lexall    0.865   30/50 (60%)     42/50 (84%)     46/50 (92%)    0.707   0.300
PT-lex       0.856   31/50 (62%)     42/50 (84%)     45/50 (90%)    0.714   0.300
PT-lexnos    0.856   31/50 (62%)     42/50 (84%)     45/50 (90%)    0.714   0.300
PT-lexsing   0.843   30/50 (60%)     40/50 (80%)     44/50 (88%)    0.699   0.290
PT-none      0.821   28/50 (56%)     39/50 (78%)     43/50 (86%)    0.662   0.246
PT-4gram     0.815   27/50 (54%)     41/50 (82%)     41/50 (82%)    0.662   0.231
   • “neuposs” (HU only): Same as “neu” except that the call to the remove_possessive function
     was skipped. (Prof. Savoy suggested to us that it was unclear if removing possessive pronouns
     was a good idea, which we interpreted as uncertainty about the remove_possessive function.)
   • “sn” (FR and PT only): Same as “lex” except that the Porter (Snowball) stemmer [7] was
     used.
   • “snru” (BG only): Same as “neu” except that the Porter (Snowball) stemmer for Russian
     was used.
   • “4gram”: Same as “lexall” except that the run used a different index which primarily consisted
     of the 4-grams of terms, e.g. the word ‘search’ would produce index terms of ‘sear’, ‘earc’
     and ‘arch’. No stemming was done; searching used the IS_ABOUT predicate (instead of
     the CONTAINS predicate) with morphological options disabled to search for the 4-grams of
     the query terms.
   • “none”: The run disabled morphological searching. (The run used the same index as “lex” for
     FR and PT and the same index as “neu” for HU and BG, but SET TERM_GENERATOR
     ‘’ was specified so that variations from stemming were not matched.)

   Note that all diagnostic runs just used the Title field of the topic.

2.5    Evaluation Measures
Traditionally in ad hoc retrieval experiments, the primary evaluation measure is “average preci-
sion”. For a topic, it is the average of the precision after each relevant document is retrieved (using
zero as the precision for relevant documents which are not retrieved). By convention, it is based on
the first 1000 retrieved documents for the topic. The score ranges from 0.0 (no relevants found) to
1.0 (all relevants found at the top of the list). Average precision takes into account both precision
and recall, and it is very good for detecting retrieval differences because even small differences in
the ranks of relevant documents affect the score. “Mean Average Precision” (MAP) is the mean of
the average precision scores over all of the topics (i.e. all topics are weighted equally).
    If one wishes to focus on just the first relevant document, the traditional measure is “Reciprocal
Rank” (RR). For a topic, it is 1r where r is the rank of the first row for which a desired page is
found, or zero if a desired page was not found. “Mean Reciprocal Rank” (MRR) is the mean of
the reciprocal ranks over all the topics.
    An experimental measure introduced in this paper (along with the companion web retrieval
paper [12]) is “First Relevant Score” (denoted “FRS”). Like reciprocal rank, it is based on just the
rank of the first relevant retrieved for a topic, but it is better suited to per-topic analysis. FRS is
1.081−r where r is the rank of the first row for which a desired page is found, or zero if a desired
page was not found. Like reciprocal rank, finding the first relevant at rank 1 produces a score of
1.0. At rank 2, FRS is just 7 points lower (0.93), whereas RR is 50 points lower (0.50). At rank
3, FRS is another 7 points lower (0.86), whereas RR is 17 points lower (0.33). At rank 10, FRS
is 0.50, whereas RR is 0.10. FRS is greater than RR for ranks 2 to 52 and lower for ranks 53
and beyond. A possible interpretation of FRS is that it may be an indicator of the percentage of
potential result list reading the system saved the user to get to the first relevant, assuming that
users are less and less likely to continue reading as they get deeper into the result list.
    “Success@n” is the percentage of topics for which at least one relevant document was returned
in the first n rows. Like the other first relevant measures, this measure hides a lot of retrieval
differences (particularly in recall), but it is more intuitive and may be an indicator of a user’s
impression of a method’s robustness across topics. This paper lists Success@1, Success@5 and
Success@10.

2.6    Statistical Significance Tables
For tables comparing 2 diagnostic runs (such as Table 3), the columns are as follows:
    • “Expt” specifies the experiment. The language code is given, followed by the labels of the
      2 runs being compared. The difference is the first run minus the second run. For example,
      “FR lex-none” specifies the difference of subtracting the scores of the French ‘none’ run from
      the French ‘lex’ run (of Table 2).

    • “∆MAP” is the difference of the mean average precision scores of the two runs being com-
      pared (and “∆FRS” is the difference of the (mean) FRS scores).
    • “95% Conf” is an approximate 95% confidence interval for the difference (calculated from
      plus/minus twice the standard error of the mean difference). If zero is not in the interval,
      the result is “statistically significant” (at the 5% level), i.e. the feature is unlikely to be of
      neutral impact (on average), though if the average difference is small (e.g. <0.020) it may
      still be too minor to be considered “significant” in the magnitude sense.
    • “vs.” is the number of topics on which the first run scored higher, lower and tied (respectively)
      compared to the second run. These numbers should always add to the number of topics (49
      for Bulgarian, 50 for the others).
    • “3 Extreme Diffs (Topic)” lists 3 of the individual topic differences, each followed by the
      topic number in brackets (the topic numbers range from 251 to 300). The first difference
      is the largest one of any topic (based on the absolute value). The third difference is the
      largest difference in the other direction (so the first and third differences give the range of
      differences observed in this experiment). The middle difference is the largest of the remaining
      differences (based on the absolute value).


3     Results of Morphological Experiments
In the per-topic analysis, the official topic translations were used as much as possible. Online
translation services were consulted at times ([5] was sometimes helpful for Hungarian, and we
found the Russian-to-English translations at [1] often worked for Bulgarian). Prof. Savoy also
assisted with some Bulgarian words. But any translation errors are the responsibility of the
author.

3.1     Impact of Stemming
Table 3 isolates the impact of stemming on the average precision measure (e.g. “FR lex-none” is the
difference of the “FR-lex” and “FR-none” runs of Table 2). For each of the 4 languages, the increase
in mean average precision was statistically significant (i.e. zero was not in the approximate 95%
confidence interval). In FRS, there was higher variance, and only the increase for Hungarian was
statistically significant. Note that for some queries, it was still better to only match the original
query form (not variations from stemming); SearchServer allows this option to be controlled for
each query term at search-time.
    Table 3 shows that topic 279 (Swiss referendums) was substantially affected by stemming for
all 4 languages, so we examine it for each language:

    • HU-279 (Svájci népszavazások): Without Hungarian stemming, no document contained both
      of the query terms. No relevant document contained the query word ‘népszavazások’. Only
      some of the relevant documents even contained ‘Svájci’ (and lots of non-relevants also did).
      With stemming, average precision was 87 points higher from extra matches such as ‘svájciak’,
      ‘Svájc’, ‘Svájcban’, ‘Svájcot’, ‘Svájcról’, ‘népszavazáson’, ‘népszavazás’, ‘népszavazást’ and
      ‘népszavazással’.
    • BG-279 (Референдуми в Швейцария): With Bulgarian stemming, average precision was 58
      points higher from extra matches for ‘referendums’ such as референдум and референдума.
           Table 3: Impact of Stemming on Average Precision and First Relevant Score
  Expt              ∆MAP        95% Conf        vs.            3 Extreme Diffs (Topic)
  HU-neu-none       0.090     ( 0.038, 0.143)   32-11-7     0.87 (279), 0.77 (294), −0.12 (265)
  FR-lex-none       0.070     ( 0.028, 0.112)   29-16-5     0.53 (297), 0.45 (284), −0.12 (275)
  BG-neu-none       0.064     ( 0.005, 0.123)   29-15-5     0.90 (271), 0.58 (279), −0.50 (258)
  PT-lex-none       0.054     ( 0.027, 0.080)   34-13-3     0.35 (279), 0.30 (286), −0.09 (296)
                    ∆FRS
  HU-neu-none       0.117     ( 0.024, 0.209)   19-10-21    1.00 (271), 0.98 (294), −0.83 (262)
  BG-neu-none       0.064    (−0.042, 0.170)    16-17-16    0.96 (294), 0.86 (269), −0.87 (273)
  PT-lex-none       0.035    (−0.017, 0.087)     12-7-31    0.69 (263), 0.60 (254), −0.54 (282)
  FR-lex-none       0.033    (−0.032, 0.097)     15-8-27    0.73 (276), 0.64 (284), −0.60 (279)


         Table 4: Impact of /nostop Option on Average Precision and First Relevant Score
   Expt             ∆MAP        95% Conf          vs.           3 Extreme Diffs (Topic)
   HU-nos-neu        0.013   (−0.005, 0.031)    3-1-46     0.40 (292), 0.13 (265), −0.03 (282)
   BG-nos-neu        0.005   (−0.003, 0.012)    2-2-45     0.17 (273), 0.06 (267), −0.01 (257)
   FR-nos-lex        0.000        n/a           0-0-50      0.00 (276), 0.00 (252), 0.00 (300)
   PT-nos-lex        0.000        n/a           0-0-50      0.00 (276), 0.00 (252), 0.00 (300)
                    ∆FRS
   BG-nos-neu       0.031    (−0.010, 0.072)    3-1-45     0.80 (273), 0.57 (267), −0.05 (257)
   HU-nos-neu       0.001    (−0.014, 0.015)    1-1-48     0.26 (292), 0.00 (253), −0.23 (282)


   • PT-279 (Referendos suı́ços): The query word ‘suı́ços’ was common in the relevant documents,
     but many relevants just used ‘referendo’ and not the query word ‘referendos’. Average
     precision was 35 points higher with Portuguese stemming; extra matches included ‘referendo’,
     ‘suı́ço’, ‘suı́ça’ and ‘suı́ças’.
   • FR-279 (Référendums en Suisse): This French topic scored lower with stemming (the rank
     of the first relevant fell from 1 to 13, and average precision fell from 0.10 to 0.01). It
     appears that the relevant documents were more likely to use the plural ‘Référendums’ than
     the singular ‘Référendum’, and the latter was a more common word which generated lots of
     matches when stemming.


3.2      Impact of Experimental /nostop Option
Table 4 isolates the impact of using the SearchServer /nostop option. The option had no effect on
the 50 French and Portuguese topics, and it affected only a few of the Bulgarian and Hungarian
topics. The /nostop option prevents query terms from being discarded if all of their stems are
stopwords (note that stopwords themselves are still not found because they are not indexed). The
default is to not use /nostop because past experiments otherwise found a lot of spurrious matches
in some languages (such as Finnish and Korean). We investigate some of the topics flagged in
Table 4:

   • HU-265 (A Deutsche Bank szerzeményei (Deutsche Bank Takeovers)): The query word
     ‘Bank’ stemmed to ‘ban’ (in) which was a stopword, so by default, the word ‘Bank’ was
     not matched in the documents. With the /nostop option, ‘Bank’ was matched and average
     precision was 13 points higher. (Incidentally, this issue is presumably why Table 3 shows
     that stemming scored 12 points lower on HU-265; without stemming, ‘Bank’ was found in
      the documents.) Perhaps this issue would not have arisen with a lexical stemmer which
      would preserve the meaning more closely.
   • HU-292 (Német városok újjáépı́tése (Rebuilding German Cities)): The query word ‘Német’
     (German) stemmed to ‘nem’ (not) which was a stopword and so this useful word was dropped
     from the query by default. With the /nostop option, average precision was 40 points higher.
   • HU-282 (Elı́téltekkel szembeni durva bánásmód (Prison Abuse)): In this topic, the default
     scored higher. Using /nostop changed the rank of the first relevant from 3 to 7. The
     stopword list contained ‘szemben’ (in front of), and the query word ‘szembeni’ presumably
     is a related noise word, and discarding it was useful. The /nostop option kept ‘szembeni’,
     which only occurred in 319 documents, so it had a high enough weighting from inverse
     document frequency to hurt precision.
   • BG-273 (Разширяването на НАТО (NATO Expansion)): НАТО (NATO) stemmed to НА
     (on) which was a stopword, so the default behaviour removed a key word from the query.
     With /nostop, the first relevant score was 80 points higher.
   • BG-267 (Най-добрите чуждоезикови филми (Best Foreign Language Films)): The query
     word филми (films) stemmed to филм (film) which surprisingly was a stopword, so the
     default behaviour discarded a key query term. Our supplier [9] has confirmed that this was
     an error in the Bulgarian stopword list.
   • BG-257 (Етническото прочистване на Балканите (Ethnic Cleansing in the Balkans)): The
     query word Балканите (Balkans) stemmed to балкан (Balkan mountain) which surprisingly
     was a stopword. Even though it turned out that precision was a little higher without the
     Balkans term in this case, in general this appears to be another error in the stopword list.

    In the topics we examined, in 3 cases the default behaviour of dropping useful terms may have
been from the stemmers for Bulgarian and Hungarian being algorithmic instead of lexical (a lexical
stemmer typically does not change the meaning of a word, except when words are ambiguous). It
appears for algorithmic stemmers it may be better to use the /nostop option by default.
    In another 2 cases, it appears the stoplist was in error, which illustrates the usefulness of the
CLEF judged test collections: they enable an analyst who does not understand a language to find
issues in a resource for the language and make inferences about its quality.

3.3    Impact of Indexing All Words
Table 5 isolates the impact of indexing all words (i.e. of not using a stopword list). None of
the mean differences were statistically significant, but Bulgarian and Hungarian had some large
per-topic differences in average precision which we investigate:

   • HU-292 (Német városok újjáépı́tése (Rebuilding German Cities)): We saw earlier that this
     topic benefitted from the /nostop option (average precision up 40 points), but when index-
     ing all words, average precision fell back (33 points). The reason was that the common
     word ‘nem’ (not) was now indexed, so ‘Német’ (German), which stems to ‘nem’ with the
     algorithmic stemmer, had a much lower inverse document frequency than before, and this
     useful word received less weight. (Even if it had received more weight, there would have
     been potential confusion with all the indexed occurrences of ‘nem’.)
   • BG-271 (Бракове между хомосексуални (Gay Marriages)): The stopword между (between)
     was not in the 2 relevant documents. When it was indexed, its inclusion caused some non-
     relevants to be preferred, and average precision dropped 55 points.
   • BG-295 (Пране на пари (Money Laundering)): This topic scored higher when indexing all
     words. Surprisingly, the word пари (money) was a stopword, presumably another error (the
     Bulgarian stoplist apparently needs a review). It seems fine that на (on) was a stopword.
    Table 5: Impact of Indexing All Words on Average Precision and First Relevant Score
 Expt          ∆MAP         95% Conf         vs.             3 Extreme Diffs (Topic)
 PT-all-nos      −0.000      (−0.003, 0.002)     18-17-15      0.03 (280), −0.01 (259), −0.02 (282)
 FR-all-nos      −0.001      (−0.005, 0.003)      24-17-9       −0.07 (262), 0.01 (290), 0.01 (289)
 HU-all-nos      −0.006      (−0.021, 0.008)       7-7-36      −0.33 (292), −0.05 (265), 0.05 (274)
 BG-all-nos      −0.008      (−0.034, 0.018)     16-17-16      −0.55 (271), −0.14 (268), 0.20 (295)
                 ∆FRS
 PT-all-nos      0.009       (−0.007, 0.025)       5-1-44       0.38 (282), 0.06 (263), −0.07 (291)
 BG-all-nos      0.001       (−0.008, 0.010)       3-4-42      0.13 (263), −0.07 (268), −0.07 (271)
 FR-all-nos      −0.000      (−0.009, 0.008)       4-4-42      0.10 (286), −0.09 (258), −0.09 (288)
 HU-all-nos      −0.000      (−0.010, 0.009)       1-3-46      0.16 (282), −0.04 (299), −0.14 (292)


            Table 6: 4-grams vs. Stems in Average Precision and First Relevant Score
 Expt           ∆MAP          95% Conf          vs.             3 Extreme Diffs (Topic)
 HU-4gr-all       0.060       ( 0.018, 0.103)      32-17-1      0.46 (255), 0.33 (292), −0.30 (283)
 BG-4gr-all       0.009      (−0.028, 0.046)       25-24-0      0.50 (258), 0.25 (254), −0.33 (285)
 FR-4gr-all      −0.021      (−0.048, 0.005)       18-31-1      0.25 (291), 0.22 (263), −0.20 (273)
 PT-4gr-all      −0.068     (−0.104,−0.032)        14-35-1      −0.43 (259), −0.28 (286), 0.22 (297)
                 ∆FRS
 HU-4gr-all       0.046      (−0.036, 0.128)      15-15-20      1.00 (286), 0.93 (261), −0.81 (251)
 FR-4gr-all      −0.001      (−0.041, 0.039)      13-15-22      0.60 (279), 0.26 (281), −0.40 (259)
 BG-4gr-all      −0.024      (−0.093, 0.045)      17-14-18       −0.82 (274), 0.56 (270), 0.59 (288)
 PT-4gr-all      −0.051      (−0.134, 0.032)       7-17-26      −1.00 (259), −0.83 (292), 0.96 (260)


    In practice, indexing all words may not be so troublesome because it is typically easy for users
to omit noise words from the query, and stemming issues can be worked around by disabling the
finding of word variants (SearchServer makes it optional at search-time).

3.4     Comparison to 4-grams
Compound words appear to be fairly common in Hungarian, but the algorithimic stemmer did not
perform decompounding, a technique we have found to be useful for languages such as Finnish [15].
However, [4] has found that using 4-grams as index terms works well in ad hoc ranking experiments
for many European languages, including compound-word languages. Table 6 compares our 4-gram
runs to the stemming runs which indexed all words (because we did not use stopwords with our 4-
gram index). As anticipated, there was a statistically significant increase in mean average precision
for Hungarian, though there was a decrease for Portuguese which was also statistically significant.
We look at the largest per-topic differences for Hungarian:

   • HU-255 (Internetfüggők (Internet Junkies)): Average precision was 46 points higher with
     4-grams for this topic (a compound word). The stemmer found the 3 relevant documents
     which contained ‘internetfüggő’ or the original query word ‘internetfüggők’. 4-grams matched
     other variants such as ‘Internetfüggőség’ (Internet dependence), ‘internetfüggőséggel’ and
     ‘internetfüggőségben’ and found all 6 relevant documents. 4-grams also matched other po-
     tentially helpful words such as ‘internet’, ‘internetezők’, ‘internetezés’, ‘komputerfüggőséget’
     and ‘függővé’. But 4-grams also produced unwanted matches, such as ‘intervallum’ (inter-
     val) and ‘Szinte’ (as good as); these both came from the 4-gram ‘inte’. If the stemmer had
     just additionally matched ‘Internetfüggőség’, all 6 relevants would have found, but we’re still
      investigating if the -seg suffix is one that a Hungarian stemmer should generally remove or
      not.
   • HU-292 (Német városok újjáépı́tése (Rebuilding German Cities)): On this topic, 4-grams
     still just found 1 of the 2 relevant documents, but it moved it from rank 3 to 1 (compared to
     the stemming run). While 4-grams additionally matched ‘újjáépı́tik’, the bigger advantage
     was probably that the 4-gram method did not match ‘nem’ which we know from earlier was
     a troublesome match for the stemming run.
   • HU-283 (James Bond-filmek (James Bond Films)): On this topic, the 4-gram run scored 30
     points lower in average precision than the stemming run. The 4-gram run favored documents
     with the ‘filmek’ pattern (which corresponded to three 4-grams (‘film’, ‘ilme’ and ‘lmek’)
     and so it received roughly 3 times the weight compared to the stemming run). However,
     the relevant documents tended not to use ‘filmek’; instead they tended to use other variants
     matched by the stemmer such as ‘film’, ‘filmet’, ‘filmnél’, ‘filmben’ and ‘filmhez’.
   • HU-286 (Futballsérülések (Football Injuries)): This topic had no matches in the stemming
     run, but a relevant document was ranked first in the 4-gram run. 4-gram matches in the
     relevant documents included ‘futballista’, ‘futballkapus’ (goalkeeper), ‘futballválogatott’,
     ‘vállsérülést’, ‘vállsérüléssel’, ‘vállsérülés’, ‘sérülés’ (injury), ‘sérült’ and ‘sérültet’. This
     might be a case for which decompounding would be helpful.
   • HU-261 (Jövendőmondás (Fortune-telling)): The stemming run only matched the one doc-
     ument which contained ‘jövendőmondást’ and ‘jövendőmondás’ and it was judged non-
     relevant, so it scored 0 on this topic. The 4-gram returned 1 of the 3 relevant documents at
     rank 2 (the others weren’t ranked in the top 100). Matches in the relevant document included
     ‘jövendölők’ and ‘jövendőmondók’. The latter of these perhaps could have been matched with
     additional stemming rules, but the former would require a stemmer to do decompounding
     (or, if the user had decompounded the query, the latter would require index-time decom-
     pounding to match).

    SearchServer can find character sequences inside European words without n-gramming if the
user specifies wildcards, so for precise searches it’s unclear if n-gram indexes would add value.
N-gram approaches typically produce larger indexes and its queries can be slower for common
word-searching cases. We’re not aware of them being used in practice for European language
retrieval, except perhaps by web search engines for url indexing.

3.5    Comparison to Alternate Stemmers
Table 7 compares alternate stemming approaches to the approach we used in our submitted runs.
Unfortunately, we have run out of time to examine more topics in detail for this draft paper, but we
note in particular that it seems not to matter very much on average whether the remove_possessive
function of the Hungarian stemmer is called or not.

3.6    Impact of /single Option
Table 8 isolates the impact of using the SearchServer /single option. This option only makes
a difference for the SearchServer lexical stemmers which can produce more than one stem for a
term. Like last year [15], our method for including all stems without overweighting some of the
terms apparently was effective. Even in the high-variance first relevant score measure, the bigger
differences favored including all stems.
  Table 7: Alternate Stemming vs. Baseline in Average Precision and First Relevant Score
Expt           ∆MAP       95% Conf            vs.             3 Extreme Diffs (Topic)
FR-sn-lex      0.017       ( 0.001, 0.032)   20-16-14    0.29 (291), 0.15 (287), −0.08 (278)
HU-poss-neu    −0.003     (−0.017, 0.012)    18-9-23     −0.27 (268), 0.11 (258), 0.13 (262)
BG-snru-neu    −0.017     (−0.064, 0.029)    19-25-5    −0.64 (259), −0.44 (271), 0.50 (258)
PT-sn-lex      −0.031    (−0.060,−0.001)     21-23-6    −0.41 (279), −0.28 (286), 0.21 (274)
               ∆FRS
PT-sn-lex      0.036      (−0.024, 0.096)     10-8-32   0.96 (260), 0.49 (300), −0.59 (292)
FR-sn-lex      0.010      (−0.005, 0.025)      7-7-36    0.19 (252), 0.16 (299), −0.12 (251)
BG-snru-neu    0.008      (−0.070, 0.086)    14-13-22   0.87 (273), 0.84 (270), −0.79 (280)
HU-poss-neu    −0.019     (−0.078, 0.040)      4-5-41   −0.95 (265), −0.68 (270), 0.69 (262)


       Table 8: Impact of /single Option on Average Precision and First Relevant Score
Expt           ∆MAP          95% Conf          vs.            3 Extreme Diffs (Topic)
FR-sing-lex     −0.002     (−0.011, 0.007)   8-7-35     −0.15 (297), −0.10 (284), 0.11 (263)
PT-sing-lex     −0.010    (−0.018,−0.002)    8-11-31    −0.10 (292), −0.10 (275), 0.02 (298)
                ∆FRS
FR-sing-lex     −0.009    (−0.025, 0.008)     1-2-47    −0.40 (259), −0.06 (284), 0.03 (299)
PT-sing-lex     −0.013    (−0.037, 0.011)     1-3-46    −0.59 (292), −0.07 (275), 0.06 (267)
                             Table 9: Mean Scores of Submitted Runs
    Run             FRS        Success@1      Success@5       Success@10          MRR       MAP
    humBG05t        0.749     15/49 (31%)      35/49 (71%)       39/49 (80%)      0.476     0.259
    humBG05td       0.815     18/49 (37%)      39/49 (80%)       42/49 (86%)      0.537     0.275
    humBG05tde      0.752     21/49 (43%)      35/49 (71%)       38/49 (78%)      0.549     0.298
    humFR05t        0.810     25/50 (50%)      39/50 (78%)       42/50 (84%)      0.618     0.302
    humFR05td       0.825     30/50 (60%)      39/50 (78%)       41/50 (82%)      0.686     0.369
    humFR05tde      0.822     31/50 (62%)      40/50 (80%)       41/50 (82%)      0.697     0.401
    humHU05t        0.788     25/50 (50%)      37/50 (74%)       42/50 (84%)      0.613     0.274
    humHU05td       0.838     23/50 (46%)      41/50 (82%)       43/50 (86%)      0.614     0.306
    humHU05tde      0.835     22/50 (44%)      38/50 (76%)       45/50 (90%)      0.602     0.331
    humPT05t        0.856     31/50 (62%)      42/50 (84%)       45/50 (90%)      0.714     0.300
    humPT05td       0.939     35/50 (70%)      48/50 (96%)       49/50 (98%)      0.805     0.357
    humPT05tde      0.925     35/50 (70%)      47/50 (94%)       48/50 (96%)      0.799     0.386


4     Submitted Runs
Table 9 lists the mean scores of the runs submitted for assessment in May 2005. In the identifiers
(e.g. “humFR05tde”), ‘t’ and ‘d’ indicate that the Title and Description field of the topic were used
(respectively), and ‘e’ indicates that query expansion from blind feedback on the first 2 rows was
used (see the 2003 paper [14] for more details). From the Description fields for Bulgarian, French
and Portuguese, instruction words such as “find”, “relevant” and “document” were automatically
removed (based on looking at some older topic lists, not this year’s topics; this step was skipped
for Hungarian because we lacked an older topic list).
    The submitted French and Portuguese Title-only runs (i.e. “humFR05t” and “humPT05t” of
Table 9) correspond to the “lex” diagnostic runs (i.e. “FR-lex” and “PT-lex” of Table 2) except that
the submitted runs used an older experimental version of SearchServer (though there don’t appear
to have been any differences that affected the runs). The submitted Bulgarian and Hungarian
Title-only runs (i.e. “humBG05t” and “humHU05t”of Table 9) correspond to the “neu” diagnostic
runs (i.e. “BG-neu” and “HU-neu” of Table 2).

4.1       Impact of Adding the Description Field
Table 10 isolates the impact of adding the Description field to the query. Though adding the
Description tended to increase the scores on average (and in some cases this result was statistically
significant), one should keep in mind that the Description often repeated the Title words, which
hence received twice the weight in the combined query. We would expect to see more variance if
the Title was replaced by the Description instead of being augmented by it

4.2       Impact of Blind Feedback
Table 11 isolates the impact of the blind feedback technique (based on using the first 2 returned
rows to expand the query). While mean average precision increased for all 4 languages (and the
increase was statistically significant for 3 of them), the first relevant score decreased for all 4
languages (and the decrease was statistically significant for the other 1 of them).
    The blind feedback technique presumably works best if relevant documents appear in the first
2 rows, in which case first relevant score cannot be improved. If the first 2 rows do not contain
relevant documents, then using those rows to expand the query may hurt the query and push
down the first relevant even further.
    This result may explain in part why blind feedback techniques are not known to be used
        Table 10: Impact of Description on Average Precision and First Relevant Score
   Expt       ∆MAP        95% Conf           vs.            3 Extreme Diffs (Topic)
   FR-td-t      0.068     ( 0.030, 0.105)     35-14-1        0.61 (256), 0.33 (281), −0.18 (277)
   PT-td-t      0.057     ( 0.008, 0.107)     31-18-1        −0.45 (258), 0.34 (299), 0.34 (264)
   HU-td-t      0.031    (−0.002, 0.065)      33-17-0       0.33 (286), 0.31 (290), −0.23 (274)
   BG-td-t      0.016    (−0.034, 0.066)      29-19-1       −0.68 (271), −0.38 (277), 0.30 (294)
               ∆FRS
   PT-td-t     0.083      ( 0.005, 0.160)     18-10-22       1.00 (272), 0.86 (288), −0.50 (258)
   BG-td-t     0.065     (−0.010, 0.141)      23-15-11       0.80 (273), 0.70 (286), −0.53 (278)
   HU-td-t     0.049     (−0.026, 0.125)      16-16-18       1.00 (286), 0.93 (261), −0.59 (282)
   FR-td-t     0.014     (−0.033, 0.062)      17-10-23      0.74 (282), −0.36 (257), −0.54 (292)


      Table 11: Impact of Blind Feedback on Average Precision and First Relevant Score
  Expt         ∆MAP          95% Conf         vs.            3 Extreme Diffs (Topic)
  FR-tde-td      0.031      ( 0.015, 0.047)      34-16-0         0.17 (273), 0.16 (290), −0.07 (268)
  PT-tde-td      0.029      ( 0.005, 0.053)      34-16-0         0.30 (290), 0.20 (275), −0.24 (274)
  HU-tde-td      0.025      ( 0.003, 0.047)      31-17-2         0.29 (254), 0.18 (290), −0.18 (279)
  BG-tde-td      0.023     (−0.002, 0.048)       29-18-2         0.50 (272), 0.14 (254), −0.10 (277)
                ∆FRS
  FR-tde-td     −0.002      (−0.041, 0.036)      10-6-34         −0.58 (282), −0.34 (272), 0.42 (252)
  HU-tde-td     −0.003      (−0.037, 0.032)       7-6-37         −0.39 (298), −0.37 (300), 0.38 (269)
  PT-tde-td     −0.014      (−0.038, 0.010)       6-7-37         −0.50 (258), −0.16 (277), 0.07 (269)
  BG-tde-td     −0.062     (−0.109,−0.016)       9-16-24         −0.63 (277), −0.50 (299), 0.13 (296)


in practice even though they have been popular with experimenters for several years in ad hoc
evaluations (which typically focus on mean average precision).


References
[1] AltaVista’s Babel Fish Translation Service. http://babelfish.altavista.com/tr
[2] Cross-Language Evaluation Forum web site. http://www.clef-campaign.org/
[3] Andrew Hodgson. Converting the Fulcrum Search Engine to Unicode. Sixteenth International
   Unicode Conference, 2000.
[4] Paul McNamee and James Mayfield. JHU/APL Experiments in Tokenization and Non-Word
   Translation. Working Notes for the CLEF 2003 Workshop, 2003.
[5] MTA       SZTAKI:       English-Hungarian,       Hungarian-English           Online     Dictionary.
   http://dict.sztaki.hu/english-hungarian
[6] NTCIR       (NII-NACSIS      Test   Collection         for      IR     Systems)       Home   Page.
   http://research.nii.ac.jp/∼ntcadm/index-en.html
[7] M. F. Porter. Snowball:        A language for stemming               algorithms.   October   2001.
   http://snowball.tartarus.org/texts/introduction.html
[8] S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu and M. Gatford. Okapi at
   TREC-3. Proceedings of TREC-3, 1995.
[9] Jacques Savoy. CLEF and          Multilingual   information   retrieval   resource   page.
   http://www.unine.ch/info/clef/
[10] Börkur Sigurbjörnsson, Jaap Kamps and Maarten de Rijke. Overview of WebCLEF 2005. To
   appear in Working Notes for the CLEF 2005 Workshop, 2005.

[11] Text REtrieval Conference (TREC) Home Page. http://trec.nist.gov/
[12] Stephen Tomlinson. European Web Retrieval Experiments with Hummingbird
   SearchServerTM at CLEF 2005. To appear in Working Notes for the CLEF 2005 Work-
   shop, 2005.
[13] Stephen Tomlinson. Experiments in 8 European Languages              with   Hummingbird
   SearchServerTM at CLEF 2002. Proceedings of CLEF 2002, 2003.
[14] Stephen Tomlinson. Lexical and Algorithmic Stemming Compared for 9 European Languages
   with Hummingbird SearchServerTM at CLEF 2003. Working Notes for the CLEF 2003 Work-
   shop, 2003.
[15] Stephen Tomlinson. Finnish, Portuguese and Russian Retrieval with Hummingbird
   SearchServerTM at CLEF 2004. Working Notes for the CLEF 2004 Workshop, 2004.