Finnish, Portuguese and Russian Retrieval with
     Hummingbird SearchServerTM at CLEF 2004
                                        Stephen Tomlinson
                                          Hummingbird
                                     Ottawa, Ontario, Canada
                               stephen.tomlinson@hummingbird.com
                                  http://www.hummingbird.com/

                                         August 16, 2004

                                               Abstract
          Hummingbird participated in the Finnish, Portuguese, Russian and French mono-
      lingual information retrieval tasks of the Cross-Language Evaluation Forum (CLEF)
      2004: for the natural language queries, find all the relevant documents (with high
      precision) in the CLEF 2004 document sets. SearchServer’s experimental lexical stem-
      mers significantly increased mean average precision for each of the 4 languages. For
      Finnish, mean average precision was significantly higher with SearchServer’s experi-
      mental decompounding option enabled. For each language, the submitted SearchServer
      run returned a relevant document in the first row for more than half of the short (Title-
      only) queries. At least one relevant document was returned in the first ten rows for
      75-90% of the short queries.


1     Introduction
Hummingbird SearchServer1 is a toolkit for developing enterprise search and retrieval applications.
The SearchServer kernel is also embedded in other Hummingbird products for the enterprise.
    SearchServer works in Unicode internally [3] and supports most of the world’s major character
sets and languages. The major conferences in text retrieval evaluation (CLEF [1], NTCIR [4] and
TREC [8]) have provided opportunities to objectively evaluate SearchServer’s support for more
than a dozen languages.
    This (draft) paper describes experimental work with SearchServer for the task of finding rele-
vant documents for natural language queries in 4 European languages (Finnish, Portuguese, Rus-
sian and French) using the CLEF 2004 test collections. Portuguese is new to CLEF this year, and
the experimental SearchServer version has some enhancements which substantially affect Finnish
and Russian, so we focus on these 3 languages.


2     Methodology
2.1    Data
The CLEF 2004 document sets consisted of tagged (SGML-formatted) news articles (mostly from
1995) in 4 different languages: Finnish, Portuguese, Russian and French. Table 1 gives the sizes.
   The CLEF organizers created 50 natural language “topics” (numbered 201-250) and translated
them into many languages. Each topic contained a “Title” (subject of the topic), “Description”
   1 SearchServerTM , SearchSQLTM and Intuitive SearchingTM are trademarks of Hummingbird Ltd. All other

copyrights, trademarks and tradenames are the property of their respective owners.
                           Table 1: Sizes of CLEF 2004 Test Collections
        Language          Text Size (uncompressed)     Documents Topics           Rel/Topic
        French           255,334,872 bytes (244 MB)          90,261        49         19
        Portuguese       185,739,565 bytes (177 MB)          55,070        46         15
        Finnish          143,902,109 bytes (137 MB)          55,344        45          9
        Russian           68,802,653 bytes (66 MB)           16,716        34          4


(a one-sentence specification of the information need) and “Narrative” (more detailed guidelines
for what a relevant document should or should not contain). The participants were asked to
use the Title and Description fields for at least one automatic submission per task this year to
facilitate comparison of results. Some topics were discarded for some languages because no relevant
documents existed for them. Table 1 gives the final number of topics for each language and their
average number of relevant documents. For more information on the CLEF test collections, see
the CLEF web site [1].

2.2    Indexing
Our indexing approach was the mostly the same as last year [10]. Accents were not indexed except
for the combining breve in Russian. The apostrophe was treated as a word separator for the 4
investigated languages. Our custom text reader, cTREC, was updated to maintain support for
the CLEF guidelines of only indexing specifically tagged fields (the new Portuguese collection
necessitated a minor update).
    Some stop words were excluded from indexing (e.g. “the”, “by” and “of” in English). For these
experiments, our stop word lists for Portuguese and Russian were based on the Porter lists [5],
and this year we based on our Finnish list on Savoy’s [7]. We used our own list for French.
    By default, the SearchServer index supports both exact matching (after some Unicode-based
normalizations, such as decompositions and conversion to upper-case) and morphological matching
(e.g. inflections, derivations and compounds, depending on the linguistic component used).
    For many languages (including the 4 European languages investigated in CLEF 2004), Search-
Server includes the option of finding inflections based on lexical stemming (i.e. stemming based
on a dictionary or lexicon for the language). For example, in English, “baby”, “babied”, “babies”,
“baby’s” and “babying” all have “baby” as a stem. Specifying an inflected search for any of these
terms will match all of the others. The lexical stemming of the post-5.x experimental development
version of SearchServer used for the experiments in this paper was based on internal stemming
component 3.6.3.4 for the submitted runs and 3.7.0.15 for the diagnostic runs. We treat each
linguistic component as a black box in this paper.
    SearchServer typically does “inflectional” stemming which generally retains the part of speech
(e.g. a plural of a noun is typically stemmed to the singular form). It typically does not do “deriva-
tional” stemming which would often change the part of speech or the meaning more substantially
(e.g. “performer” is not stemmed to “perform”).
    SearchServer’s lexical stemming includes compound-splitting (decompounding) for compound
words in Finnish (and also some other languages not investigated this year, such as German, Dutch
and Swedish). For example, in German, “babykost” (baby food) has “baby” and “kost” as stems.
    SearchServer’s lexical stemming also supports some spelling variations. In English, British and
American spellings have the same stems, e.g. “labour” stems to “labor”, “hospitalisation” stems to
“hospitalization” and “plough” stems to “plow”.
    Lexical stemmers can produce more than one stem, even for non-compound words. For ex-
ample, in English, “axes” has both “axe” and “axis” as stems (different meanings), and in French,
“important” has both “important” (adjective) and “importer” (verb) as stems (different parts of
speech). SearchServer records all the stem mappings at index-time to support maximum recall
and does so in a way to allow searching to weight some inflections higher than others.
2.3    Searching
Unlike previous years, this year we experimented with SearchServer’s CONTAINS predicate (in-
stead of the IS_ABOUT predicate) though it should not make a difference to the ranking. Our
test application specified SearchSQL to perform a boolean-OR of the query words. For example,
for Russian topic 250 whose Title was “Бешенство у людей” (Rabies in Humans), a corresponding
SearchSQL query would be:

SELECT RELEVANCE(’2:3’) AS REL, DOCNO
FROM CLEF04RU
WHERE FT_TEXT CONTAINS ’Бешенство’|’у’|’людей’
ORDER BY REL DESC;

(Note that “у” is a stopword for Russian so its inclusion in the query won’t actually add any
matches.)
    Most aspects of SearchServer’s relevance value calculation are the same as described last year
[10]. Briefly, SearchServer dampens the term frequency and adjusts for document length in a
manner similar to Okapi [6] and dampens the inverse document frequency using an approximation
of the logarithm. These calculations are based on the stems of the terms when doing morphological
searching (i.e. when SET TERM_GENERATOR ‘word!ftelp/inflect’ was previously specified).
    An experimental new default is that SearchServer only includes morphological matches from
compound words if all of its stems (from a particular stemming interpretation) are in the same or
consecutive words. For example, in German, a morphological search for the compound “babykost”
(baby food) will no longer match “baby” or “kost” by themselves, but it will match “babykost” and
“baby kost” (and if SET PHRASE_DISTANCE 1 is specified, it will also match the hyphenated
“baby-kost”). Words (and compounds) still match inside compounds (and larger compounds), e.g.
a search for “kost” still matches “babykost”. To restore the old behaviour of matching if just one
stem is in common, one can specify the /decompound option (e.g. SET TERM_GENERATOR
‘word!ftelp/inflect/decompound’). See Section 3.3.1 for several more decompounding examples.
    This year’s experimental SearchServer version contains an enhancement for handling multiple
stemming interpretations. For each document, only the interpretation that produces the highest
score for the document is used in the relevance calculation (but all interpretations are still used for
matching and search term highlighting). Sometimes this enhancement causes the original query
form of the word to get more weight than some of its inflections (and it never gets less weight).
This approach overcomes the previous issue of terms with multiple stemming interpretations being
over-weighted; it used to be better for CLEF experiments to workaround by using the /single or
/noalt options, but Section 3.5 verifies that this is no longer the case.
    SearchServer’s RELEVANCE_METHOD setting can be used to optionally square the impor-
tance of the inverse document frequency (by choosing a RELEVANCE_METHOD of ‘2:4’ instead
of ‘2:3’). The importance of document length to the ranking is controlled by SearchServer’s
RELEVANCE_DLEN_IMP setting (scale of 0 to 1000). For all experiments in this paper, REL-
EVANCE_METHOD was set to ‘2:3’ and RELEVANCE_DLEN_IMP was set to 750.

2.4    Diagnostic Runs
For the diagnostic runs listed in Table 2, the run names consist of a language code (“FI” for Finnish,
“FR” for French, “PT” for Portuguese and “RU” for Russian) followed by one of the following labels:
   • “lex”: The run used SearchServer’s lexical stemming with decompounding enabled, i.e. SET
     TERM_GENERATOR ‘word!ftelp/inflect/decompound’. (Of the investigated languages,
     decompounding only makes a difference for Finnish.)
   • “compound” (Finnish only): Same as “lex” except that /decompound was not specified.
   • “single”: Same as “lex” except that /single was additionally specified (so that just one stem-
     ming interpretation was used).
                           Table 2: Scores of Diagnostic Title-only runs
            Run              AvgP       Robust@1         Robust@5        Robust@10
            FI-lex             0.561     32/45 (71%)       36/45 (80%)       38/45 (84%)
            FI-chain           0.553     30/45 (67%)       36/45 (80%)       39/45 (87%)
            FI-single          0.550     32/45 (71%)       35/45 (78%)       37/45 (82%)
            FI-compound        0.469     28/45 (62%)       30/45 (67%)       33/45 (73%)
            FI-alg             0.424     26/45 (58%)       30/45 (67%)       34/45 (76%)
            FI-none            0.328     19/45 (42%)       26/45 (58%)       27/45 (60%)
            RU-lex             0.430     19/34 (56%)       27/34 (79%)       27/34 (79%)
            RU-chain           0.405     18/34 (53%)       26/34 (76%)       26/34 (76%)
            RU-single          0.396     17/34 (50%)       26/34 (76%)       27/34 (79%)
            RU-alg             0.410     18/34 (53%)       26/34 (76%)       26/34 (76%)
            RU-none            0.220      9/34 (26%)       20/34 (59%)       22/34 (65%)
            PT-lex             0.405     24/46 (52%)       33/46 (72%)       35/46 (76%)
            PT-chain           0.411     24/46 (52%)       34/46 (74%)       35/46 (76%)
            PT-single          0.388     22/46 (48%)       31/46 (67%)       36/46 (78%)
            PT-alg             0.387     25/46 (54%)       33/46 (72%)       34/46 (74%)
            PT-none            0.327     18/46 (39%)       26/46 (57%)       31/46 (67%)
            FR-lex             0.422     25/49 (51%)       39/49 (80%)       44/49 (90%)
            FR-chain           0.418     26/49 (53%)       38/49 (78%)       42/49 (86%)
            FR-single          0.423     26/49 (53%)       39/49 (80%)       44/49 (90%)
            FR-alg             0.417     26/49 (53%)       38/49 (78%)       43/49 (88%)
            FR-none            0.361     22/49 (45%)       39/49 (80%)       42/49 (86%)


   • “alg”: The run used a different index based on the coarser algorithmic Porter “Snowball”
     stemmer [5] for the language. Decompounding is not available with this stemmer and the
     /single option is redundant.

   • “chain”: The run used a different index based on applying the SearchServer stemmer (as
     “lex”) and then the algorithmic stemmer.

   • “none”: The run disabled morphological searching, i.e. SET TERM_GENERATOR ‘’.

   Note that all diagnostic runs just used the Title field of the topic.

2.5    Evaluation Measures
The primary evaluation measure in this paper is “mean average precision” based on the first 1000
retrieved documents for each topic (denoted “AvgP” in Tables 2 and 9). “Average precision” for a
topic is the average of the precision after each relevant document is retrieved (using zero as the
precision for relevant documents which are not retrieved). The score ranges from 0.0 (no relevants
found) to 1.0 (all relevants found at the top of the list). For a set of topics, all topics are weighted
equally by the mean. Average precision takes into account both precision and recall, and it is
very good for detecting retrieval differences because even small differences in the ranks of relevant
documents affect the score.
    A more experimental measure is “robustness at 10 documents” (denoted “Robust@10”) which
is the percentage of topics for which at least one relevant document was returned in the first 10
rows (this was one of the measures investigated in the TREC Robust Retrieval track last year
[11]). This measure hides a lot of retrieval differences (particularly in recall), but it may be an
indicator of a user’s impression of a method’s robustness across topics. We also list the Robust@1
and Robust@5 variants.
                      Table 3: Impact of Lexical Stemming on Average Precision
    Expt              AvgDiff     95% Conf          vs.         3 Extreme Diffs (Topic)
    FI lex-none        0.233      ( 0.146, 0.326)       31-9-5      1.00 (224), 0.96 (210), −0.24 (208)
    RU lex-none        0.209      ( 0.108, 0.325)      22-1-11      1.00 (250), 1.00 (203), −0.04 (228)
    PT lex-none        0.078      ( 0.037, 0.125)      25-8-13      0.61 (213), 0.53 (229), −0.08 (248)
    FR lex-none        0.061      ( 0.030, 0.096)      23-20-6      0.42 (229), 0.40 (235), −0.07 (216)


2.6     Statistical Significance Tables
For tables comparing 2 diagnostic runs (such as Table 3), the columns are as follows:
    • “Expt” specifies the experiment. The language code is given, followed by the labels of the
      2 runs being compared. The difference is the first run minus the second run. For example,
      “FI lex-none” specifies the difference of subtracting the scores of the Finnish ‘none’ run from
      the Finnish ‘lex’ run (of Table 2).
    • “AvgDiff” is the difference of the mean scores of the two runs being compared (the table
      heading says which evaluation measure is being compared).
    • “95% Conf” is an approximate 95% confidence interval for the difference calculated using
      Efron’s bootstrap percentile method2 [2] (using 100,000 iterations). If zero is not in the
      interval, the result is “statistically significant” (at the 5% level), i.e. the feature is unlikely
      to be of neutral impact, though if the average difference is small (e.g. <0.020) it may still
      be too minor to be considered “significant” in the magnitude sense.
    • “vs.” is the number of topics on which the first run scored higher, lower and tied (respectively)
      compared to the second run. These numbers should always add to the number of topics (45
      for Finnish, 49 for French, 46 for Portuguese, 34 for Russian).
    • “3 Extreme Diffs (Topic)” lists 3 of the individual topic differences, each followed by the
      topic number in brackets (the topic numbers range from 201 to 250). The first difference
      is the largest one of any topic (based on the absolute value). The third difference is the
      largest difference in the other direction (so the first and third differences give the range of
      differences observed in this experiment). The middle difference is the largest of the remaining
      differences (based on the absolute value).


3     Results of Morphological Experiments
This section looks at the differences between the runs of Table 2 in more detail.

3.1     Impact of Lexical Stemming
Table 3 isolates the impact of SearchServer’s lexical stemming on the average precision measure
(e.g. “FI lex-none” is the difference of the “FI-lex” and “FI-none” runs of Table 2). For each of the
4 languages, the increase in mean average precision was statistically significant (i.e. zero is not in
the approximate 95% confidence interval). Note that for some queries, it is still better to only
match the original query form (not inflections); SearchServer allows this option to be controlled
for each query term at search-time.
    Table 4 isolates the impact of SearchServer’s lexical stemming on the Robust@10 measure.
For each language, at least one relevant was found in the first 10 rows more often with inflections
enabled than disabled; the increases were statistically significant for Finnish and Russian.
   2 See [9] for some comparisons of confidence intervals from the bootstrap percentile, Wilcoxon signed rank and

standard error methods for both average precision and Precision@10.
             Table 4: Impact of Lexical Stemming on Robustness at 10 Documents
  Expt            AvgDiff      95% Conf         vs.         3 Extreme Diffs (Topic)
  FI lex-none       0.244     ( 0.133, 0.378)    11-0-34     1.00 (224), 1.00 (215), 0.00 (250)
  RU lex-none       0.147     ( 0.029, 0.265)    5-0-29      1.00 (221), 1.00 (244), 0.00 (226)
  PT lex-none       0.087    (−0.001, 0.196)     5-1-40     1.00 (243), 1.00 (234), −1.00 (235)
  FR lex-none       0.041    (−0.041, 0.123)     3-1-45     1.00 (241), 1.00 (239), −1.00 (222)


                Table 5: Lexical vs. Algorithmic Stemming on Average Precision
   Expt          AvgDiff      95% Conf           vs.          3 Extreme Diffs (Topic)
   FI lex-alg     0.137      ( 0.064, 0.219)     26-12-7    0.98 (210), 0.86 (226), −0.15 (219)
   RU lex-alg     0.019     (−0.003, 0.050)      7-8-19     0.40 (227), 0.20 (202), −0.07 (224)
   PT lex-alg     0.018     (−0.014, 0.055)     16-14-16    0.53 (229), 0.32 (217), −0.27 (204)
   FR lex-alg     0.005     (−0.003, 0.013)     18-14-17    −0.09 (203), 0.06 (209), 0.08 (231)


3.2     Comparison with Algorithmic Stemming
Table 5 contains the results of a diagnostic experiment comparing average precision when the only
difference is the stemmer used: the experimental SearchServer lexical stemmer or Porter’s algo-
rithmic stemmer. Positive differences indicate that the SearchServer stemmer led to a higher score
and negative differences indicate that the algorithmic stemmer led to a higher score. Using Search-
Server’s stemmer scored higher on average for each language and this increase was statistically
significant for Finnish.
    In this section, we look at the Portuguese and Russian topics with the largest differences.
Finnish is examined in more detail in the subsequent decompounding section. French was inves-
tigated in last year’s paper [10].

3.2.1    Portuguese Stemming
Topic PT-229: Table 5 shows that the largest difference between the stemming approaches for
Portuguese was on topic 229 (Construção de Barragens (Dam Building)) in which average pre-
cision was 53 points higher with SearchServer’s stemmer. The main reason was that, unlike the
algorithmic stemmer, the SearchServer stemmer matched “Barragem”, an inflection used in many
relevant documents. SearchServer additionally matched “construções” which may also have been
helpful.
    Topic PT-217: The next largest difference for Portuguese was on topic 217 (Sida em África
(AIDS in Africa)) for which Table 5 shows that average precision was 32 points higher with
SearchServer’s stemmer. The main reason was that, unlike SearchServer, the algorithmic stem-
mer matched “sido”, a common word unrelated to AIDS, which decreased precision substantially.
SearchServer additionally matched “africanos” which may also have been helpful.
    Topic PT-204: The largest negative difference was on topic 204 (Vı́timas de Avalanches (Vic-
tims of Avalanches)) for which using the algorithmic stemmer scored 27 points higher. Both
stemmers matched “Avalanche” but the algorithmic stemmer additionally matched “avalancha”
which was the only variant used in 3 of the relevant documents. We should investigate this case
further.

3.2.2    Russian Stemming
Topic RU-227: Table 5 shows that the largest difference between the stemming approaches for
Russian was on topic 227 (Алтайская амазонка (Altai Ice Maiden)) for which average precision
             Table 6: Decompounding Experiments (Finnish) on Average Precision
 Expt              AvgDiff      95% Conf           vs.            3 Extreme Diffs (Topic)
 FI lex-cmpd        0.092     ( 0.034, 0.162)    17-9-19     0.98 (210), 0.72 (226), −0.18 (219)
 FI cmpd-none       0.141     ( 0.075, 0.214)    27-10-8     1.00 (224), 0.81 (204), −0.22 (208)
 FI cmpd-alg        0.045    (−0.001, 0.094)    19-14-12     0.57 (204), 0.50 (216), −0.40 (205)


was 40 points higher with SearchServer’s stemmer. SearchServer internally produced 2 stems
for “Алтайская” (Altai), itself and “Алтайскай”. The words which had “Алтайская” as a stem
(such as “Алтайской”, “Алтайские”, “Алтайскую” and “алтайских”) were less common in the
documents than the words which shared the “Алтайскай” stem (the same words plus more such
as “Алтайского”, “Алтайском” and “Алтайскому”), so SearchServer’s experimental new scoring
scheme for alternative stems gives the former group a higher weight from inverse document fre-
quency than the latter group. In this case, it turned out just 1 relevant document was matched
by either stemmer and it just used the original word “Алтайская”. The algorithmic stemmer pro-
duced just one stem for these words, so its weighting did not have a preference for the query form
and some documents with the second group of terms ended up ranking higher. The algorithmic
stemmer additionally matched “Алтайске” which was not helpful in this case. This topic illustrates
a benefit from SearchServer’s experimental new handling of multiple stemming interpretations.
    Topic RU-202: The next largest difference was on topic 202 (Арест Ника Леесон (Nick Leeson’s
Arrest)) for which the score was 20 points higher with SearchServer’s stemmer. The 3 relevant
documents used different spellings for “Leeson” (“Лисон”, “Лизона”, “Лизон” and “Лисона”) which
did not match the query form of “Леесон” with either stemmer. And inflections of “Арест” (Arrest)
did not appear in the relevant documents. So the matches just came from variants of “Nick”.
Both stemmers matched the forms used in the relevant documents (“Ника” and “Ник”). But the
algorithmic stemmer additionally matched other terms such as “Никому” and “никого” which
lowered precision substantially in this case.

3.3     Impact of Decompounding (Finnish)
The first row of Table 6 (“FI lex-cmpd”) isolates the impact of SearchServer’s experimental new
“/decompound” option for Finnish (decompounding is not new to SearchServer for Finnish, but
an option to control its impact separately from inflectional stemming at search-time is). This
option allows words to match if they share any stem of query compound words. Without the
/decompound option, the (experimental new) default is to require all the stems of a compound
word to be in the same or consecutive words to be considered a match. Table 6 shows that mean
average precision was 9 points higher with /decompound set, and this difference was statistically
significant.
    The second row of Table 6 (“FI cmpd-none”) shows that even without the /decompound option,
use of SearchServer’s stemming for Finnish scored 14 points higher than not using stemming. (Note
that the first two rows of Table 6 add up to the 23 point gain from lexical stemming shown in
Table 3.)
    The third row of Table 6 (“FI cmpd-alg”) compares SearchServer’s stemming without the
/decompound option to algorithmic stemming (which does not even decompound at index-time)
and shows that using SearchServer’s stemmer scored 4.5 points higher, though this difference did
not quite pass the statistical significance test. (SearchServer’s stemming with the /decompound
option is compared to algorithmic stemming in Table 5 in which the difference is the sum of the
differences of rows 1 and 3 of Table 6.)
    We look at some Finnish topics in more detail to understand these results better.
      Table 7: Impact of Adding Algorithmic to Lexical Stemming on Average Precision
 Expt           AvgDiff     95% Conf           vs.           3 Extreme Diffs (Topic)
 FI chain-lex     −0.008    (−0.037, 0.019)    13-16-16    −0.38 (203), −0.30 (215), 0.31 (205)
 RU chain-lex     −0.025    (−0.062, 0.000)      5-8-21    −0.50 (226), −0.20 (202), 0.02 (232)
 PT chain-lex      0.006    (−0.007, 0.023)     9-14-23     0.27 (204), 0.08 (205), −0.15 (232)
 FR chain-lex     −0.004    (−0.010, 0.003)     8-18-23    0.09 (203), −0.06 (220), −0.06 (209)


3.3.1   Finnish Decompounding
Topic FI-210: Table 6 shows that the largest impact of Finnish decompounding was on topic
210 (Nobel rauhanpalkintoehdokkaat (Nobel Peace Prize Candidates)) for which using Search-
Server’s stemmer with the /decompound option scored 98 points higher than not using /decom-
pound (and also 98 points higher than using the algorithmic stemmer according to Table 5). This
topic had just 1 relevant document, and the only match for the non-decompounding approaches
was the word “Nobel” which occurred in lots of documents, so the relevant document did not
stand out among them. With SearchServer’s decompounding, many more words in the relevant
document matched such as “rauhan”, “rauhanpalkituksi”, “rauhanpalkinnon”, “rauhanvälittäjänä”,
“ehdokasta” and “ehdokkaina” because these words shared at least one (but not all) the stems of
the query compound “rauhanpalkintoehdokkaat”, and the relevant document was ranked first.
     Topic FI-226: Table 6 shows that the next largest impact of Finnish decompounding was on
topic 226 (Sukupuolenvaihdosleikkaukset (Sex-change Operations)) for which using SearchServer’s
stemmer with the /decompound option scored 72 points higher than not using /decompound
(and also 86 points higher than using the algorithmic stemmer according to Table 5). The al-
gorithmic stemmer just found the one of the 13 relevant documents which contained the query
word “Sukupuolenvaihdosleikkaukset”. SearchServer without /decompound matched that docu-
ment plus 3 other relevants, two which contained “sukupuolen vaihdosleikkaukseen” (an example of
a consecutive-word match) and one which contained “Sukupuolenvaihdosleikkausta”. SearchServer
with /decompound matched all 13 relevant documents; the key additional matches appeared to
be “Sukupuolen-vaihdos”, “sukupuolenvaihtoleikkaukset”, “sukupuolenvaihdot”, “Sukupuolenvaih-
doshan”, “sukupuolenkorjausleikkausten” and “sukupuolenvahvistusleikkaus”, though other match-
ing words may also have been helpful such as “leikkaussali”, “sukupuoli” and “vaihdos”.
     Topic FI-219: Table 6 shows that the largest negative impact of Finnish decompounding
was on topic 219 (EU:n komissaariehdokkaat (EU Commissioner Candidates)) for which using
SearchServer’s stemmer with the /decompound option scored 18 points lower than not using /de-
compound (and also 15 points lower than using the algorithmic stemmer according to Table 5).
Without the /decompound option, SearchServer found a lot of precise matches in relevant doc-
uments such as “komissaariehdokasta”, “komissaariehdokkaalle”, “komissaariehdokkaista”, “komis-
saariehdokkaalta”, “komissaariehdokkaiden” and “komissaariehdokkaan”. Furthermore, in some
relevant documents it found matches in larger compounds (which the algorithmic stemmer could
not) such as “naiskomissaariehdokasta” and “tanskalaiseltakomissaariehdokkaalta”. With /decom-
pound set, SearchServer would also find all these matches, but precision was substantially hurt in
this case by additionally matching terms in non-relevant documents such as “jäsenehdokkaiden”,
“jäsenehdokkaita”, “ykkösehdokkaista”, “tutkimuskomissaari” and “henkilöstökomissaari”. This
topic shows why a user may prefer to have /decompound not set; in cases where the user does not
need the component words to occur together, the user can either manually separate the terms or
set the /decompound option. But for automatic ad hoc searches for topics, it is better on average
to use the /decompound option.
                  Table 8: Impact of Using All Lexical Stems on Average Precision
    Expt           AvgDiff      95% Conf          vs.            3 Extreme Diffs (Topic)
    FI lex-sing    0.010     (−0.002, 0.032)     7-10-28      0.44 (215), 0.05 (236), −0.02 (233)
    RU lex-sing    0.033      ( 0.002, 0.082)     9-3-22      0.67 (203), 0.31 (210), −0.01 (233)
    PT lex-sing     0.017    (−0.004, 0.049)     6-11-29      0.60 (213), 0.16 (236), −0.12 (248)
    FR lex-sing    −0.001    (−0.004, 0.003)      2-9-38     0.06 (235), −0.03 (248), −0.03 (215)


3.4        Adding Algorithmic to Lexical Stemming
Table 7 shows the impact of applying the algorithmic stemmer to the result of SearchServer’s
stemmer (this is possible because SearchServer’s stemmer returns real words; the other order would
not work because the algorithmic stemmer often truncates to a non-word). This approach would
still produce all the matches of SearchServer’s stemming and may sometimes produce additional
matches from algorithmic stemming. However, there was a decrease in mean average precision for
Russian which was borderline significant. The other differences were not statistically significant.
While algorithmic stemming may occasionally add a helpful match, it can also add poor matches
that hurt precision. In a future experiment, perhaps it would be better to treat algorithmic stems
as alternative stemming interpretations (instead of replacing the lexical stem) so that lexical
inflections are likely to get higher weight when the algorithmic stem is too common.

3.5        Impact of Using All Lexical Stems
Table 8 shows the impact of using all stemming interpretations from SearchServer’s lexical stem-
ming instead of arbitrarily just using the first one. The increase in mean average precision was
statistically significant for Russian. On the individual topics, there were some large increases, but
(reassuringly) no correspondingly large decreases. In past years, mean average precision was typ-
ically lower when including all the stems because of over-weighting issues, so this result suggests
that the enhancement for handling multiple stemming interpretations has succeeded at addressing
this issue.
    Topic RU-203: The largest difference for Russian was on topic 203 (Партизанская война в
Восточном Тиморе (East Timor Guerrillas)) for which the score was 67 points higher when using
all stemming interpretations. The query word “Восточном” (Eastern) had 2 stems, “Восточнома”
and “восточный”. The inflections in the relevant document (namely “Восточного”, “Восточный”,
“восточных” and “восточной”) only shared the latter stem.


4      Submitted Runs
In the identifiers of the runs submitted for assessment in May 2004 (e.g. “humFI04tde”), the first
3 letters “hum” indicate a Hummingbird submission, the next 2 letters are the language code, and
the number “04” indicates CLEF 2004. “t”, “d” and “n” indicate that the Title, Description and
Narrative field of the topic were used (respectively). “e” indicates that query expansion from blind
feedback on the first 2 rows was used (see last year’s paper [10] for more details). The submitted
runs all used inflections from SearchServer’s lexical stemming (including decompounding where
applicable). The scores of the submitted runs are listed in Table 9.
    The submitted Title-only runs (e.g. “humFI04t” of Table 9) correspond to the “lex” diagnostic
runs (e.g. “FI-lex” of Table 2) except that the submitted runs used an older experimental version
of SearchServer (including an older version of the lexical stemming component) so the scores are
not exactly the same.
                             Table 9: Scores of Submitted Runs
           Run              AvgP      Robust@1        Robust@5        Robust@10
           humFI04t         0.556    34/45 (76%)     35/45 (78%)      37/45 (82%)
           humFI04td        0.593    31/45 (69%)     37/45 (82%)      38/45 (84%)
           humFI04tde       0.637    32/45 (71%)     40/45 (89%)      42/45 (93%)
           humRU04t         0.430    19/34 (56%)     27/34 (79%)      27/34 (79%)
           humRU04td        0.409    17/34 (50%)     26/34 (76%)      27/34 (79%)
           humRU04tde       0.443    17/34 (50%)     26/34 (76%)      27/34 (79%)
           humPT04t         0.405    24/46 (52%)     33/46 (72%)      35/46 (76%)
           humPT04td        0.453    23/46 (50%)     32/46 (70%)      34/46 (74%)
           humPT04tde       0.475    23/46 (50%)     32/46 (70%)      35/46 (76%)
           humFR04t         0.421    25/49 (51%)     39/49 (80%)      44/49 (90%)
           humFR04td        0.458    26/49 (53%)     43/49 (88%)      44/49 (90%)
           humFR04tde       0.493    26/49 (53%)     43/49 (88%)      43/49 (88%)


References
[1] Cross-Language Evaluation Forum web site. http://www.clef-campaign.org/

[2] Bradley Efron and Robert J. Tibshirani. An Introduction to the Bootstrap. 1993. Chapman &
   Hall/CRC.

[3] Andrew Hodgson. Converting the Fulcrum Search Engine to Unicode. In Sixteenth Interna-
   tional Unicode Conference, Amsterdam, The Netherlands, March 2000.

[4] NTCIR       (NII-NACSIS      Test   Collection   for    IR     Systems)    Home      Page.
   http://research.nii.ac.jp/∼ntcadm/index-en.html

[5] M. F. Porter. Snowball:        A language for stemming       algorithms.   October    2001.
   http://snowball.tartarus.org/texts/introduction.html

[6] S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, M. Gatford.
   (City University.) Okapi at TREC-3. In D. K. Harman, editor, Overview of the
   Third Text REtrieval Conference (TREC-3). NIST Special Publication 500-226.
   http://trec.nist.gov/pubs/trec3/t3_proceedings.html

[7] Jacques Savoy. CLEF and Multilingual information retrieval. http://www.unine.ch/info/clef/
[8] Text REtrieval Conference (TREC) Home Page. http://trec.nist.gov/
[9] Stephen Tomlinson. Experiments in 8 European Languages with Hummingbird SearchServerTM
   at CLEF 2002. In Carol Peters, editor, Working Notes for the CLEF 2002 Workshop.
   http://clef.iei.pi.cnr.it:2002/workshop2002/WN/26.pdf
[10] Stephen Tomlinson. Lexical and Algorithmic Stemming Compared for 9 European Languages
   with Hummingbird SearchServerTM at CLEF 2003. In Carol Peters, editor, Working Notes for
   the CLEF 2003 Workshop. http://clef.iei.pi.cnr.it/2003/WN_web/19.pdf
[11] Ellen M. Voorhees. Overview of the TREC 2003 Robust Retrieval                       Track.
   In Proceedings of the Twelfth Text REtrieval Conference (TREC                         2003).
   http://trec.nist.gov/pubs/trec12/t12_proceedings.html