=Paper=
{{Paper
|id=Vol-1171/CLEF2005wn-adhoc-Tomlinson2005
|storemode=property
|title=European Ad Hoc Retrieval Experiments with Hummingbird SearchServerTM at CLEF 2005
|pdfUrl=https://ceur-ws.org/Vol-1171/CLEF2005wn-adhoc-Tomlinson2005.pdf
|volume=Vol-1171
|dblpUrl=https://dblp.org/rec/conf/clef/Tomlinson05c
}}
==European Ad Hoc Retrieval Experiments with Hummingbird SearchServerTM at CLEF 2005==
European Ad Hoc Retrieval Experiments with
Hummingbird SearchServerTM at CLEF 2005
Stephen Tomlinson
Hummingbird
Ottawa, Ontario, Canada
stephen.tomlinson@hummingbird.com
http://www.hummingbird.com/
August 21, 2005
Abstract
Hummingbird participated in the 4 monolingual information retrieval tasks (Bulgar-
ian, French, Hungarian and Portuguese) of the Ad-Hoc Track of the Cross-Language
Evaluation Forum (CLEF) 2005. In the ad hoc retrieval tasks, the system was given 50
natural language queries, and the goal was to find all of the relevant documents (with
high precision) in a particular document set. We conducted diagnostic experiments
with different techniques for matching word variations and handling stopwords. We
found that the experimental stemmers significantly increased mean average precision
for the 4 languages. Analysis of individual topics found that the algorithmic Bulgar-
ian and Hungarian stemmers encountered some unanticipated stopword collisions. A
comparison to an experimental 4-gram technique suggested that Hungarian stemming
would further benefit from decompounding. A blind feedback technique which sig-
nificantly increased mean average precision for some languages was also significantly
detrimental to the rank of the first relevant retrieved for one language.
Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval
General Terms
Measurement, Performance, Experimentation
Keywords
Bulgarian Retrieval, Hungarian Retrieval, First Relevant Score, Per-Topic Analysis
1 Introduction
Hummingbird SearchServer1 is a toolkit for developing enterprise search and retrieval applications.
The SearchServer kernel is also embedded in other Hummingbird products for the enterprise.
SearchServer works in Unicode internally [3] and supports most of the world’s major char-
acter sets and languages. The major conferences in text retrieval experimentation (CLEF [2],
1 SearchServerTM , SearchSQLTM and Intuitive SearchingTM are trademarks of Hummingbird Ltd. All other
copyrights, trademarks and tradenames are the property of their respective owners.
Table 1: Sizes of CLEF 2005 Ad-Hoc Track Test Collections
Language Text Size (uncompressed) Documents Topics Rel/Topic
Portuguese 591,987,753 bytes 210,734 50 58 (lo 2, med 44, hi 239)
French 508,863,606 bytes 177,452 50 51 (lo 1, med 35, hi 185)
Bulgarian 216,432,023 bytes 69,195 49 16 (lo 1, med 10, hi 69)
Hungarian 106,631,823 bytes 49,530 50 19 (lo 1, med 13, hi 87)
NTCIR [6] and TREC [11]) have provided judged test collections for objective experimentation
with SearchServer in more than a dozen languages.
This (draft) paper describes experimental work with SearchServer for the task of finding rel-
evant documents for natural language queries in 4 European languages (Bulgarian, French, Hun-
garian and Portuguese) using the CLEF 2005 Ad-Hoc Track test collections.
2 Methodology
2.1 Data
The CLEF 2005 Ad-Hoc Track document sets consisted of tagged (SGML-formatted) news articles
in 4 different languages: Bulgarian, French, Hungarian and Portuguese. Table 1 gives the sizes.
The CLEF organizers created 50 natural language “topics” (numbered 251-300) and translated
them into many languages. One topic was discarded for Bulgarian because it had no relevant
documents. Table 1 gives the final number of topics for each language and their average number
of relevant documents (along with the lowest, median and highest number of relevant documents
of any topic). For more information on the CLEF test collections, see the track overview paper.
2.2 Indexing
Our indexing approach was the mostly the same as last year [15]. Accents were not indexed except
for the combining breve in Bulgarian. The apostrophe was treated as a word separator for the 4
investigated languages. The custom text reader, cTREC, was updated to maintain support for
the CLEF guidelines of only indexing specifically tagged fields.
Some stop words were excluded from indexing (e.g. “the”, “by” and “of” in English). For these
experiments, the stop word list for Portuguese was based on the Porter list [7], and the lists for
Bulgarian and Hungarian were based on Savoy’s [9]. We used our own list for French.
Unlike previous years, this year we added AL=“0-9” to the stopfiles to specify that the digits
0-9 were to be treated as alphabet characters (e.g. so that “G7” would be indexed as 1 term instead
of 2).
By default, the SearchServer index supports both exact matching (after some Unicode-based
normalizations, such as decompositions and conversion to upper-case) and morphological matching
(e.g. inflections, derivations and compounds, depending on the linguistic component used).
For many languages (including French and Portuguese), SearchServer provides the option of
finding inflections based on lexical stemming (i.e. stemming based on a dictionary or lexicon for
the language). For example, in English, “baby”, “babied”, “babies”, “baby’s” and “babying” all have
“baby” as a stem. Specifying an inflected search for any of these terms will match all of the others.
The lexical stemming of the post-6.0 experimental development version of SearchServer used for
the experiments in this paper was based on internal stemming component 3.7.0.15. We treat each
linguistic component as a black box in this paper.
Lexical stemming in SearchServer typically does “inflectional” stemming which generally retains
the part of speech (e.g. a plural of a noun is typically stemmed to the singular form). It typically
does not do “derivational” stemming which would often change the part of speech or the meaning
more substantially (e.g. “performer” is not stemmed to “perform”).
Lexical stemming in SearchServer includes compound-splitting (decompounding) for compound
words in particular languages (such as Dutch, Finnish, German and Swedish). For example, in
German, “babykost” (baby food) has “baby” and “kost” as stems.
Lexical stemmers can produce more than one stem, even for non-compound words. For ex-
ample, in English, “axes” has both “axe” and “axis” as stems (different meanings), and in French,
“important” has both “important” (adjective) and “importer” (verb) as stems (different parts of
speech). SearchServer records all the stem mappings at index-time to support maximum recall
and does so in a way to allow searching to weight some inflections higher than others.
2.3 Searching
We experimented with the SearchServer CONTAINS predicate. Our test application specified
SearchSQL to perform a boolean-OR of the query words. For example, for Bulgarian topic 279
whose Title was “Референдуми в Швейцария” (Swiss referendums), a corresponding SearchSQL
query would be:
SELECT RELEVANCE(’2:3’) AS REL, DOCNO
FROM CLEF05BG
WHERE FT_TEXT CONTAINS ’Референдуми’|’в’|’Швейцария’
ORDER BY REL DESC;
(Note that “в” is a stopword for Bulgarian so its inclusion in the query wouldn’t actually add
any matches.)
Most aspects of the SearchServer relevance value calculation are the same as described last
year [15]. Briefly, SearchServer dampens the term frequency and adjusts for document length in a
manner similar to Okapi [8] and dampens the inverse document frequency using an approximation
of the logarithm. These calculations are based on the stems of the terms (roughly speaking)
when doing morphological searching (i.e. when SET TERM_GENERATOR ‘word!ftelp/inflect’
was previously specified). The SearchServer RELEVANCE_METHOD setting was set to ‘2:3’
and RELEVANCE_DLEN_IMP was set to 750 for all experiments in this paper.
2.4 Diagnostic Runs
For the diagnostic runs listed in Tables 2, the run names consist of a language code (“BG” for
Bulgarian, “FR” for French, “HU” for Hungarian and “PT” for Portuguese) followed by one of the
following labels:
• “lex”: (FR and PT only): The run used SearchServer lexical stemming. The /inflect option
(SET TERM_GENERATOR ‘word!ftelp/inflect’) was specified.
• “lexnos”: Same as “lex” except that /nostop was additionally specified which prevents query
terms from being discarded if all of their stems are stopwords (note that stopwords themselves
were still not found because they were not indexed).
• “lexall”: Same as “lex” except that a separate index was used which did not stop any words
from being indexed (specifying /nostop would make no difference with this index).
• “lexsing”: Same as “lex” except that /single was additionally specified (so that just one
stemming interpretation was used at search time).
• “neu” (BG and HU only): Same as “lex” except that the experimental Neuchatel stemmer
was used [9].
• “neunos”: Same as “lexnos” except that the Neuchatel stemmer was used.
• “neuall”: Same as “lexall” except that the Neuchatel stemmer was used.
Table 2: Mean Scores of Diagnostic Title-only runs
Run FRS Success@1 Success@5 Success@10 MRR MAP
BG-neuall 0.782 15/49 (31%) 38/49 (78%) 41/49 (84%) 0.500 0.255
BG-neunos 0.781 16/49 (33%) 38/49 (78%) 41/49 (84%) 0.507 0.263
BG-4gram 0.758 20/49 (41%) 32/49 (65%) 40/49 (82%) 0.525 0.264
BG-snru 0.757 17/49 (35%) 34/49 (69%) 40/49 (82%) 0.499 0.242
BG-neu 0.749 15/49 (31%) 35/49 (71%) 39/49 (80%) 0.476 0.259
BG-none 0.685 14/49 (29%) 30/49 (61%) 35/49 (71%) 0.440 0.195
FR-sn 0.820 27/50 (54%) 40/50 (80%) 43/50 (86%) 0.645 0.318
FR-lex 0.810 25/50 (50%) 39/50 (78%) 42/50 (84%) 0.618 0.302
FR-lexnos 0.810 25/50 (50%) 39/50 (78%) 42/50 (84%) 0.618 0.302
FR-lexall 0.810 25/50 (50%) 39/50 (78%) 43/50 (86%) 0.618 0.301
FR-4gram 0.809 24/50 (48%) 41/50 (82%) 43/50 (86%) 0.617 0.279
FR-lexsing 0.802 25/50 (50%) 39/50 (78%) 42/50 (84%) 0.615 0.299
FR-none 0.778 20/50 (40%) 38/50 (76%) 43/50 (86%) 0.549 0.232
HU-4gram 0.834 24/50 (48%) 39/50 (78%) 45/50 (90%) 0.619 0.341
HU-neunos 0.789 26/50 (52%) 36/50 (72%) 42/50 (84%) 0.625 0.287
HU-neuall 0.788 25/50 (50%) 37/50 (74%) 41/50 (82%) 0.614 0.280
HU-neu 0.788 25/50 (50%) 37/50 (74%) 42/50 (84%) 0.613 0.274
HU-neuposs 0.769 24/50 (48%) 36/50 (72%) 41/50 (82%) 0.588 0.271
HU-none 0.671 17/50 (34%) 30/50 (60%) 37/50 (74%) 0.464 0.184
PT-sn 0.892 30/50 (60%) 43/50 (86%) 47/50 (94%) 0.712 0.269
PT-lexall 0.865 30/50 (60%) 42/50 (84%) 46/50 (92%) 0.707 0.300
PT-lex 0.856 31/50 (62%) 42/50 (84%) 45/50 (90%) 0.714 0.300
PT-lexnos 0.856 31/50 (62%) 42/50 (84%) 45/50 (90%) 0.714 0.300
PT-lexsing 0.843 30/50 (60%) 40/50 (80%) 44/50 (88%) 0.699 0.290
PT-none 0.821 28/50 (56%) 39/50 (78%) 43/50 (86%) 0.662 0.246
PT-4gram 0.815 27/50 (54%) 41/50 (82%) 41/50 (82%) 0.662 0.231
• “neuposs” (HU only): Same as “neu” except that the call to the remove_possessive function
was skipped. (Prof. Savoy suggested to us that it was unclear if removing possessive pronouns
was a good idea, which we interpreted as uncertainty about the remove_possessive function.)
• “sn” (FR and PT only): Same as “lex” except that the Porter (Snowball) stemmer [7] was
used.
• “snru” (BG only): Same as “neu” except that the Porter (Snowball) stemmer for Russian
was used.
• “4gram”: Same as “lexall” except that the run used a different index which primarily consisted
of the 4-grams of terms, e.g. the word ‘search’ would produce index terms of ‘sear’, ‘earc’
and ‘arch’. No stemming was done; searching used the IS_ABOUT predicate (instead of
the CONTAINS predicate) with morphological options disabled to search for the 4-grams of
the query terms.
• “none”: The run disabled morphological searching. (The run used the same index as “lex” for
FR and PT and the same index as “neu” for HU and BG, but SET TERM_GENERATOR
‘’ was specified so that variations from stemming were not matched.)
Note that all diagnostic runs just used the Title field of the topic.
2.5 Evaluation Measures
Traditionally in ad hoc retrieval experiments, the primary evaluation measure is “average preci-
sion”. For a topic, it is the average of the precision after each relevant document is retrieved (using
zero as the precision for relevant documents which are not retrieved). By convention, it is based on
the first 1000 retrieved documents for the topic. The score ranges from 0.0 (no relevants found) to
1.0 (all relevants found at the top of the list). Average precision takes into account both precision
and recall, and it is very good for detecting retrieval differences because even small differences in
the ranks of relevant documents affect the score. “Mean Average Precision” (MAP) is the mean of
the average precision scores over all of the topics (i.e. all topics are weighted equally).
If one wishes to focus on just the first relevant document, the traditional measure is “Reciprocal
Rank” (RR). For a topic, it is 1r where r is the rank of the first row for which a desired page is
found, or zero if a desired page was not found. “Mean Reciprocal Rank” (MRR) is the mean of
the reciprocal ranks over all the topics.
An experimental measure introduced in this paper (along with the companion web retrieval
paper [12]) is “First Relevant Score” (denoted “FRS”). Like reciprocal rank, it is based on just the
rank of the first relevant retrieved for a topic, but it is better suited to per-topic analysis. FRS is
1.081−r where r is the rank of the first row for which a desired page is found, or zero if a desired
page was not found. Like reciprocal rank, finding the first relevant at rank 1 produces a score of
1.0. At rank 2, FRS is just 7 points lower (0.93), whereas RR is 50 points lower (0.50). At rank
3, FRS is another 7 points lower (0.86), whereas RR is 17 points lower (0.33). At rank 10, FRS
is 0.50, whereas RR is 0.10. FRS is greater than RR for ranks 2 to 52 and lower for ranks 53
and beyond. A possible interpretation of FRS is that it may be an indicator of the percentage of
potential result list reading the system saved the user to get to the first relevant, assuming that
users are less and less likely to continue reading as they get deeper into the result list.
“Success@n” is the percentage of topics for which at least one relevant document was returned
in the first n rows. Like the other first relevant measures, this measure hides a lot of retrieval
differences (particularly in recall), but it is more intuitive and may be an indicator of a user’s
impression of a method’s robustness across topics. This paper lists Success@1, Success@5 and
Success@10.
2.6 Statistical Significance Tables
For tables comparing 2 diagnostic runs (such as Table 3), the columns are as follows:
• “Expt” specifies the experiment. The language code is given, followed by the labels of the
2 runs being compared. The difference is the first run minus the second run. For example,
“FR lex-none” specifies the difference of subtracting the scores of the French ‘none’ run from
the French ‘lex’ run (of Table 2).
• “∆MAP” is the difference of the mean average precision scores of the two runs being com-
pared (and “∆FRS” is the difference of the (mean) FRS scores).
• “95% Conf” is an approximate 95% confidence interval for the difference (calculated from
plus/minus twice the standard error of the mean difference). If zero is not in the interval,
the result is “statistically significant” (at the 5% level), i.e. the feature is unlikely to be of
neutral impact (on average), though if the average difference is small (e.g. <0.020) it may
still be too minor to be considered “significant” in the magnitude sense.
• “vs.” is the number of topics on which the first run scored higher, lower and tied (respectively)
compared to the second run. These numbers should always add to the number of topics (49
for Bulgarian, 50 for the others).
• “3 Extreme Diffs (Topic)” lists 3 of the individual topic differences, each followed by the
topic number in brackets (the topic numbers range from 251 to 300). The first difference
is the largest one of any topic (based on the absolute value). The third difference is the
largest difference in the other direction (so the first and third differences give the range of
differences observed in this experiment). The middle difference is the largest of the remaining
differences (based on the absolute value).
3 Results of Morphological Experiments
In the per-topic analysis, the official topic translations were used as much as possible. Online
translation services were consulted at times ([5] was sometimes helpful for Hungarian, and we
found the Russian-to-English translations at [1] often worked for Bulgarian). Prof. Savoy also
assisted with some Bulgarian words. But any translation errors are the responsibility of the
author.
3.1 Impact of Stemming
Table 3 isolates the impact of stemming on the average precision measure (e.g. “FR lex-none” is the
difference of the “FR-lex” and “FR-none” runs of Table 2). For each of the 4 languages, the increase
in mean average precision was statistically significant (i.e. zero was not in the approximate 95%
confidence interval). In FRS, there was higher variance, and only the increase for Hungarian was
statistically significant. Note that for some queries, it was still better to only match the original
query form (not variations from stemming); SearchServer allows this option to be controlled for
each query term at search-time.
Table 3 shows that topic 279 (Swiss referendums) was substantially affected by stemming for
all 4 languages, so we examine it for each language:
• HU-279 (Svájci népszavazások): Without Hungarian stemming, no document contained both
of the query terms. No relevant document contained the query word ‘népszavazások’. Only
some of the relevant documents even contained ‘Svájci’ (and lots of non-relevants also did).
With stemming, average precision was 87 points higher from extra matches such as ‘svájciak’,
‘Svájc’, ‘Svájcban’, ‘Svájcot’, ‘Svájcról’, ‘népszavazáson’, ‘népszavazás’, ‘népszavazást’ and
‘népszavazással’.
• BG-279 (Референдуми в Швейцария): With Bulgarian stemming, average precision was 58
points higher from extra matches for ‘referendums’ such as референдум and референдума.
Table 3: Impact of Stemming on Average Precision and First Relevant Score
Expt ∆MAP 95% Conf vs. 3 Extreme Diffs (Topic)
HU-neu-none 0.090 ( 0.038, 0.143) 32-11-7 0.87 (279), 0.77 (294), −0.12 (265)
FR-lex-none 0.070 ( 0.028, 0.112) 29-16-5 0.53 (297), 0.45 (284), −0.12 (275)
BG-neu-none 0.064 ( 0.005, 0.123) 29-15-5 0.90 (271), 0.58 (279), −0.50 (258)
PT-lex-none 0.054 ( 0.027, 0.080) 34-13-3 0.35 (279), 0.30 (286), −0.09 (296)
∆FRS
HU-neu-none 0.117 ( 0.024, 0.209) 19-10-21 1.00 (271), 0.98 (294), −0.83 (262)
BG-neu-none 0.064 (−0.042, 0.170) 16-17-16 0.96 (294), 0.86 (269), −0.87 (273)
PT-lex-none 0.035 (−0.017, 0.087) 12-7-31 0.69 (263), 0.60 (254), −0.54 (282)
FR-lex-none 0.033 (−0.032, 0.097) 15-8-27 0.73 (276), 0.64 (284), −0.60 (279)
Table 4: Impact of /nostop Option on Average Precision and First Relevant Score
Expt ∆MAP 95% Conf vs. 3 Extreme Diffs (Topic)
HU-nos-neu 0.013 (−0.005, 0.031) 3-1-46 0.40 (292), 0.13 (265), −0.03 (282)
BG-nos-neu 0.005 (−0.003, 0.012) 2-2-45 0.17 (273), 0.06 (267), −0.01 (257)
FR-nos-lex 0.000 n/a 0-0-50 0.00 (276), 0.00 (252), 0.00 (300)
PT-nos-lex 0.000 n/a 0-0-50 0.00 (276), 0.00 (252), 0.00 (300)
∆FRS
BG-nos-neu 0.031 (−0.010, 0.072) 3-1-45 0.80 (273), 0.57 (267), −0.05 (257)
HU-nos-neu 0.001 (−0.014, 0.015) 1-1-48 0.26 (292), 0.00 (253), −0.23 (282)
• PT-279 (Referendos suı́ços): The query word ‘suı́ços’ was common in the relevant documents,
but many relevants just used ‘referendo’ and not the query word ‘referendos’. Average
precision was 35 points higher with Portuguese stemming; extra matches included ‘referendo’,
‘suı́ço’, ‘suı́ça’ and ‘suı́ças’.
• FR-279 (Référendums en Suisse): This French topic scored lower with stemming (the rank
of the first relevant fell from 1 to 13, and average precision fell from 0.10 to 0.01). It
appears that the relevant documents were more likely to use the plural ‘Référendums’ than
the singular ‘Référendum’, and the latter was a more common word which generated lots of
matches when stemming.
3.2 Impact of Experimental /nostop Option
Table 4 isolates the impact of using the SearchServer /nostop option. The option had no effect on
the 50 French and Portuguese topics, and it affected only a few of the Bulgarian and Hungarian
topics. The /nostop option prevents query terms from being discarded if all of their stems are
stopwords (note that stopwords themselves are still not found because they are not indexed). The
default is to not use /nostop because past experiments otherwise found a lot of spurrious matches
in some languages (such as Finnish and Korean). We investigate some of the topics flagged in
Table 4:
• HU-265 (A Deutsche Bank szerzeményei (Deutsche Bank Takeovers)): The query word
‘Bank’ stemmed to ‘ban’ (in) which was a stopword, so by default, the word ‘Bank’ was
not matched in the documents. With the /nostop option, ‘Bank’ was matched and average
precision was 13 points higher. (Incidentally, this issue is presumably why Table 3 shows
that stemming scored 12 points lower on HU-265; without stemming, ‘Bank’ was found in
the documents.) Perhaps this issue would not have arisen with a lexical stemmer which
would preserve the meaning more closely.
• HU-292 (Német városok újjáépı́tése (Rebuilding German Cities)): The query word ‘Német’
(German) stemmed to ‘nem’ (not) which was a stopword and so this useful word was dropped
from the query by default. With the /nostop option, average precision was 40 points higher.
• HU-282 (Elı́téltekkel szembeni durva bánásmód (Prison Abuse)): In this topic, the default
scored higher. Using /nostop changed the rank of the first relevant from 3 to 7. The
stopword list contained ‘szemben’ (in front of), and the query word ‘szembeni’ presumably
is a related noise word, and discarding it was useful. The /nostop option kept ‘szembeni’,
which only occurred in 319 documents, so it had a high enough weighting from inverse
document frequency to hurt precision.
• BG-273 (Разширяването на НАТО (NATO Expansion)): НАТО (NATO) stemmed to НА
(on) which was a stopword, so the default behaviour removed a key word from the query.
With /nostop, the first relevant score was 80 points higher.
• BG-267 (Най-добрите чуждоезикови филми (Best Foreign Language Films)): The query
word филми (films) stemmed to филм (film) which surprisingly was a stopword, so the
default behaviour discarded a key query term. Our supplier [9] has confirmed that this was
an error in the Bulgarian stopword list.
• BG-257 (Етническото прочистване на Балканите (Ethnic Cleansing in the Balkans)): The
query word Балканите (Balkans) stemmed to балкан (Balkan mountain) which surprisingly
was a stopword. Even though it turned out that precision was a little higher without the
Balkans term in this case, in general this appears to be another error in the stopword list.
In the topics we examined, in 3 cases the default behaviour of dropping useful terms may have
been from the stemmers for Bulgarian and Hungarian being algorithmic instead of lexical (a lexical
stemmer typically does not change the meaning of a word, except when words are ambiguous). It
appears for algorithmic stemmers it may be better to use the /nostop option by default.
In another 2 cases, it appears the stoplist was in error, which illustrates the usefulness of the
CLEF judged test collections: they enable an analyst who does not understand a language to find
issues in a resource for the language and make inferences about its quality.
3.3 Impact of Indexing All Words
Table 5 isolates the impact of indexing all words (i.e. of not using a stopword list). None of
the mean differences were statistically significant, but Bulgarian and Hungarian had some large
per-topic differences in average precision which we investigate:
• HU-292 (Német városok újjáépı́tése (Rebuilding German Cities)): We saw earlier that this
topic benefitted from the /nostop option (average precision up 40 points), but when index-
ing all words, average precision fell back (33 points). The reason was that the common
word ‘nem’ (not) was now indexed, so ‘Német’ (German), which stems to ‘nem’ with the
algorithmic stemmer, had a much lower inverse document frequency than before, and this
useful word received less weight. (Even if it had received more weight, there would have
been potential confusion with all the indexed occurrences of ‘nem’.)
• BG-271 (Бракове между хомосексуални (Gay Marriages)): The stopword между (between)
was not in the 2 relevant documents. When it was indexed, its inclusion caused some non-
relevants to be preferred, and average precision dropped 55 points.
• BG-295 (Пране на пари (Money Laundering)): This topic scored higher when indexing all
words. Surprisingly, the word пари (money) was a stopword, presumably another error (the
Bulgarian stoplist apparently needs a review). It seems fine that на (on) was a stopword.
Table 5: Impact of Indexing All Words on Average Precision and First Relevant Score
Expt ∆MAP 95% Conf vs. 3 Extreme Diffs (Topic)
PT-all-nos −0.000 (−0.003, 0.002) 18-17-15 0.03 (280), −0.01 (259), −0.02 (282)
FR-all-nos −0.001 (−0.005, 0.003) 24-17-9 −0.07 (262), 0.01 (290), 0.01 (289)
HU-all-nos −0.006 (−0.021, 0.008) 7-7-36 −0.33 (292), −0.05 (265), 0.05 (274)
BG-all-nos −0.008 (−0.034, 0.018) 16-17-16 −0.55 (271), −0.14 (268), 0.20 (295)
∆FRS
PT-all-nos 0.009 (−0.007, 0.025) 5-1-44 0.38 (282), 0.06 (263), −0.07 (291)
BG-all-nos 0.001 (−0.008, 0.010) 3-4-42 0.13 (263), −0.07 (268), −0.07 (271)
FR-all-nos −0.000 (−0.009, 0.008) 4-4-42 0.10 (286), −0.09 (258), −0.09 (288)
HU-all-nos −0.000 (−0.010, 0.009) 1-3-46 0.16 (282), −0.04 (299), −0.14 (292)
Table 6: 4-grams vs. Stems in Average Precision and First Relevant Score
Expt ∆MAP 95% Conf vs. 3 Extreme Diffs (Topic)
HU-4gr-all 0.060 ( 0.018, 0.103) 32-17-1 0.46 (255), 0.33 (292), −0.30 (283)
BG-4gr-all 0.009 (−0.028, 0.046) 25-24-0 0.50 (258), 0.25 (254), −0.33 (285)
FR-4gr-all −0.021 (−0.048, 0.005) 18-31-1 0.25 (291), 0.22 (263), −0.20 (273)
PT-4gr-all −0.068 (−0.104,−0.032) 14-35-1 −0.43 (259), −0.28 (286), 0.22 (297)
∆FRS
HU-4gr-all 0.046 (−0.036, 0.128) 15-15-20 1.00 (286), 0.93 (261), −0.81 (251)
FR-4gr-all −0.001 (−0.041, 0.039) 13-15-22 0.60 (279), 0.26 (281), −0.40 (259)
BG-4gr-all −0.024 (−0.093, 0.045) 17-14-18 −0.82 (274), 0.56 (270), 0.59 (288)
PT-4gr-all −0.051 (−0.134, 0.032) 7-17-26 −1.00 (259), −0.83 (292), 0.96 (260)
In practice, indexing all words may not be so troublesome because it is typically easy for users
to omit noise words from the query, and stemming issues can be worked around by disabling the
finding of word variants (SearchServer makes it optional at search-time).
3.4 Comparison to 4-grams
Compound words appear to be fairly common in Hungarian, but the algorithimic stemmer did not
perform decompounding, a technique we have found to be useful for languages such as Finnish [15].
However, [4] has found that using 4-grams as index terms works well in ad hoc ranking experiments
for many European languages, including compound-word languages. Table 6 compares our 4-gram
runs to the stemming runs which indexed all words (because we did not use stopwords with our 4-
gram index). As anticipated, there was a statistically significant increase in mean average precision
for Hungarian, though there was a decrease for Portuguese which was also statistically significant.
We look at the largest per-topic differences for Hungarian:
• HU-255 (Internetfüggők (Internet Junkies)): Average precision was 46 points higher with
4-grams for this topic (a compound word). The stemmer found the 3 relevant documents
which contained ‘internetfüggő’ or the original query word ‘internetfüggők’. 4-grams matched
other variants such as ‘Internetfüggőség’ (Internet dependence), ‘internetfüggőséggel’ and
‘internetfüggőségben’ and found all 6 relevant documents. 4-grams also matched other po-
tentially helpful words such as ‘internet’, ‘internetezők’, ‘internetezés’, ‘komputerfüggőséget’
and ‘függővé’. But 4-grams also produced unwanted matches, such as ‘intervallum’ (inter-
val) and ‘Szinte’ (as good as); these both came from the 4-gram ‘inte’. If the stemmer had
just additionally matched ‘Internetfüggőség’, all 6 relevants would have found, but we’re still
investigating if the -seg suffix is one that a Hungarian stemmer should generally remove or
not.
• HU-292 (Német városok újjáépı́tése (Rebuilding German Cities)): On this topic, 4-grams
still just found 1 of the 2 relevant documents, but it moved it from rank 3 to 1 (compared to
the stemming run). While 4-grams additionally matched ‘újjáépı́tik’, the bigger advantage
was probably that the 4-gram method did not match ‘nem’ which we know from earlier was
a troublesome match for the stemming run.
• HU-283 (James Bond-filmek (James Bond Films)): On this topic, the 4-gram run scored 30
points lower in average precision than the stemming run. The 4-gram run favored documents
with the ‘filmek’ pattern (which corresponded to three 4-grams (‘film’, ‘ilme’ and ‘lmek’)
and so it received roughly 3 times the weight compared to the stemming run). However,
the relevant documents tended not to use ‘filmek’; instead they tended to use other variants
matched by the stemmer such as ‘film’, ‘filmet’, ‘filmnél’, ‘filmben’ and ‘filmhez’.
• HU-286 (Futballsérülések (Football Injuries)): This topic had no matches in the stemming
run, but a relevant document was ranked first in the 4-gram run. 4-gram matches in the
relevant documents included ‘futballista’, ‘futballkapus’ (goalkeeper), ‘futballválogatott’,
‘vállsérülést’, ‘vállsérüléssel’, ‘vállsérülés’, ‘sérülés’ (injury), ‘sérült’ and ‘sérültet’. This
might be a case for which decompounding would be helpful.
• HU-261 (Jövendőmondás (Fortune-telling)): The stemming run only matched the one doc-
ument which contained ‘jövendőmondást’ and ‘jövendőmondás’ and it was judged non-
relevant, so it scored 0 on this topic. The 4-gram returned 1 of the 3 relevant documents at
rank 2 (the others weren’t ranked in the top 100). Matches in the relevant document included
‘jövendölők’ and ‘jövendőmondók’. The latter of these perhaps could have been matched with
additional stemming rules, but the former would require a stemmer to do decompounding
(or, if the user had decompounded the query, the latter would require index-time decom-
pounding to match).
SearchServer can find character sequences inside European words without n-gramming if the
user specifies wildcards, so for precise searches it’s unclear if n-gram indexes would add value.
N-gram approaches typically produce larger indexes and its queries can be slower for common
word-searching cases. We’re not aware of them being used in practice for European language
retrieval, except perhaps by web search engines for url indexing.
3.5 Comparison to Alternate Stemmers
Table 7 compares alternate stemming approaches to the approach we used in our submitted runs.
Unfortunately, we have run out of time to examine more topics in detail for this draft paper, but we
note in particular that it seems not to matter very much on average whether the remove_possessive
function of the Hungarian stemmer is called or not.
3.6 Impact of /single Option
Table 8 isolates the impact of using the SearchServer /single option. This option only makes
a difference for the SearchServer lexical stemmers which can produce more than one stem for a
term. Like last year [15], our method for including all stems without overweighting some of the
terms apparently was effective. Even in the high-variance first relevant score measure, the bigger
differences favored including all stems.
Table 7: Alternate Stemming vs. Baseline in Average Precision and First Relevant Score
Expt ∆MAP 95% Conf vs. 3 Extreme Diffs (Topic)
FR-sn-lex 0.017 ( 0.001, 0.032) 20-16-14 0.29 (291), 0.15 (287), −0.08 (278)
HU-poss-neu −0.003 (−0.017, 0.012) 18-9-23 −0.27 (268), 0.11 (258), 0.13 (262)
BG-snru-neu −0.017 (−0.064, 0.029) 19-25-5 −0.64 (259), −0.44 (271), 0.50 (258)
PT-sn-lex −0.031 (−0.060,−0.001) 21-23-6 −0.41 (279), −0.28 (286), 0.21 (274)
∆FRS
PT-sn-lex 0.036 (−0.024, 0.096) 10-8-32 0.96 (260), 0.49 (300), −0.59 (292)
FR-sn-lex 0.010 (−0.005, 0.025) 7-7-36 0.19 (252), 0.16 (299), −0.12 (251)
BG-snru-neu 0.008 (−0.070, 0.086) 14-13-22 0.87 (273), 0.84 (270), −0.79 (280)
HU-poss-neu −0.019 (−0.078, 0.040) 4-5-41 −0.95 (265), −0.68 (270), 0.69 (262)
Table 8: Impact of /single Option on Average Precision and First Relevant Score
Expt ∆MAP 95% Conf vs. 3 Extreme Diffs (Topic)
FR-sing-lex −0.002 (−0.011, 0.007) 8-7-35 −0.15 (297), −0.10 (284), 0.11 (263)
PT-sing-lex −0.010 (−0.018,−0.002) 8-11-31 −0.10 (292), −0.10 (275), 0.02 (298)
∆FRS
FR-sing-lex −0.009 (−0.025, 0.008) 1-2-47 −0.40 (259), −0.06 (284), 0.03 (299)
PT-sing-lex −0.013 (−0.037, 0.011) 1-3-46 −0.59 (292), −0.07 (275), 0.06 (267)
Table 9: Mean Scores of Submitted Runs
Run FRS Success@1 Success@5 Success@10 MRR MAP
humBG05t 0.749 15/49 (31%) 35/49 (71%) 39/49 (80%) 0.476 0.259
humBG05td 0.815 18/49 (37%) 39/49 (80%) 42/49 (86%) 0.537 0.275
humBG05tde 0.752 21/49 (43%) 35/49 (71%) 38/49 (78%) 0.549 0.298
humFR05t 0.810 25/50 (50%) 39/50 (78%) 42/50 (84%) 0.618 0.302
humFR05td 0.825 30/50 (60%) 39/50 (78%) 41/50 (82%) 0.686 0.369
humFR05tde 0.822 31/50 (62%) 40/50 (80%) 41/50 (82%) 0.697 0.401
humHU05t 0.788 25/50 (50%) 37/50 (74%) 42/50 (84%) 0.613 0.274
humHU05td 0.838 23/50 (46%) 41/50 (82%) 43/50 (86%) 0.614 0.306
humHU05tde 0.835 22/50 (44%) 38/50 (76%) 45/50 (90%) 0.602 0.331
humPT05t 0.856 31/50 (62%) 42/50 (84%) 45/50 (90%) 0.714 0.300
humPT05td 0.939 35/50 (70%) 48/50 (96%) 49/50 (98%) 0.805 0.357
humPT05tde 0.925 35/50 (70%) 47/50 (94%) 48/50 (96%) 0.799 0.386
4 Submitted Runs
Table 9 lists the mean scores of the runs submitted for assessment in May 2005. In the identifiers
(e.g. “humFR05tde”), ‘t’ and ‘d’ indicate that the Title and Description field of the topic were used
(respectively), and ‘e’ indicates that query expansion from blind feedback on the first 2 rows was
used (see the 2003 paper [14] for more details). From the Description fields for Bulgarian, French
and Portuguese, instruction words such as “find”, “relevant” and “document” were automatically
removed (based on looking at some older topic lists, not this year’s topics; this step was skipped
for Hungarian because we lacked an older topic list).
The submitted French and Portuguese Title-only runs (i.e. “humFR05t” and “humPT05t” of
Table 9) correspond to the “lex” diagnostic runs (i.e. “FR-lex” and “PT-lex” of Table 2) except that
the submitted runs used an older experimental version of SearchServer (though there don’t appear
to have been any differences that affected the runs). The submitted Bulgarian and Hungarian
Title-only runs (i.e. “humBG05t” and “humHU05t”of Table 9) correspond to the “neu” diagnostic
runs (i.e. “BG-neu” and “HU-neu” of Table 2).
4.1 Impact of Adding the Description Field
Table 10 isolates the impact of adding the Description field to the query. Though adding the
Description tended to increase the scores on average (and in some cases this result was statistically
significant), one should keep in mind that the Description often repeated the Title words, which
hence received twice the weight in the combined query. We would expect to see more variance if
the Title was replaced by the Description instead of being augmented by it
4.2 Impact of Blind Feedback
Table 11 isolates the impact of the blind feedback technique (based on using the first 2 returned
rows to expand the query). While mean average precision increased for all 4 languages (and the
increase was statistically significant for 3 of them), the first relevant score decreased for all 4
languages (and the decrease was statistically significant for the other 1 of them).
The blind feedback technique presumably works best if relevant documents appear in the first
2 rows, in which case first relevant score cannot be improved. If the first 2 rows do not contain
relevant documents, then using those rows to expand the query may hurt the query and push
down the first relevant even further.
This result may explain in part why blind feedback techniques are not known to be used
Table 10: Impact of Description on Average Precision and First Relevant Score
Expt ∆MAP 95% Conf vs. 3 Extreme Diffs (Topic)
FR-td-t 0.068 ( 0.030, 0.105) 35-14-1 0.61 (256), 0.33 (281), −0.18 (277)
PT-td-t 0.057 ( 0.008, 0.107) 31-18-1 −0.45 (258), 0.34 (299), 0.34 (264)
HU-td-t 0.031 (−0.002, 0.065) 33-17-0 0.33 (286), 0.31 (290), −0.23 (274)
BG-td-t 0.016 (−0.034, 0.066) 29-19-1 −0.68 (271), −0.38 (277), 0.30 (294)
∆FRS
PT-td-t 0.083 ( 0.005, 0.160) 18-10-22 1.00 (272), 0.86 (288), −0.50 (258)
BG-td-t 0.065 (−0.010, 0.141) 23-15-11 0.80 (273), 0.70 (286), −0.53 (278)
HU-td-t 0.049 (−0.026, 0.125) 16-16-18 1.00 (286), 0.93 (261), −0.59 (282)
FR-td-t 0.014 (−0.033, 0.062) 17-10-23 0.74 (282), −0.36 (257), −0.54 (292)
Table 11: Impact of Blind Feedback on Average Precision and First Relevant Score
Expt ∆MAP 95% Conf vs. 3 Extreme Diffs (Topic)
FR-tde-td 0.031 ( 0.015, 0.047) 34-16-0 0.17 (273), 0.16 (290), −0.07 (268)
PT-tde-td 0.029 ( 0.005, 0.053) 34-16-0 0.30 (290), 0.20 (275), −0.24 (274)
HU-tde-td 0.025 ( 0.003, 0.047) 31-17-2 0.29 (254), 0.18 (290), −0.18 (279)
BG-tde-td 0.023 (−0.002, 0.048) 29-18-2 0.50 (272), 0.14 (254), −0.10 (277)
∆FRS
FR-tde-td −0.002 (−0.041, 0.036) 10-6-34 −0.58 (282), −0.34 (272), 0.42 (252)
HU-tde-td −0.003 (−0.037, 0.032) 7-6-37 −0.39 (298), −0.37 (300), 0.38 (269)
PT-tde-td −0.014 (−0.038, 0.010) 6-7-37 −0.50 (258), −0.16 (277), 0.07 (269)
BG-tde-td −0.062 (−0.109,−0.016) 9-16-24 −0.63 (277), −0.50 (299), 0.13 (296)
in practice even though they have been popular with experimenters for several years in ad hoc
evaluations (which typically focus on mean average precision).
References
[1] AltaVista’s Babel Fish Translation Service. http://babelfish.altavista.com/tr
[2] Cross-Language Evaluation Forum web site. http://www.clef-campaign.org/
[3] Andrew Hodgson. Converting the Fulcrum Search Engine to Unicode. Sixteenth International
Unicode Conference, 2000.
[4] Paul McNamee and James Mayfield. JHU/APL Experiments in Tokenization and Non-Word
Translation. Working Notes for the CLEF 2003 Workshop, 2003.
[5] MTA SZTAKI: English-Hungarian, Hungarian-English Online Dictionary.
http://dict.sztaki.hu/english-hungarian
[6] NTCIR (NII-NACSIS Test Collection for IR Systems) Home Page.
http://research.nii.ac.jp/∼ntcadm/index-en.html
[7] M. F. Porter. Snowball: A language for stemming algorithms. October 2001.
http://snowball.tartarus.org/texts/introduction.html
[8] S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu and M. Gatford. Okapi at
TREC-3. Proceedings of TREC-3, 1995.
[9] Jacques Savoy. CLEF and Multilingual information retrieval resource page.
http://www.unine.ch/info/clef/
[10] Börkur Sigurbjörnsson, Jaap Kamps and Maarten de Rijke. Overview of WebCLEF 2005. To
appear in Working Notes for the CLEF 2005 Workshop, 2005.
[11] Text REtrieval Conference (TREC) Home Page. http://trec.nist.gov/
[12] Stephen Tomlinson. European Web Retrieval Experiments with Hummingbird
SearchServerTM at CLEF 2005. To appear in Working Notes for the CLEF 2005 Work-
shop, 2005.
[13] Stephen Tomlinson. Experiments in 8 European Languages with Hummingbird
SearchServerTM at CLEF 2002. Proceedings of CLEF 2002, 2003.
[14] Stephen Tomlinson. Lexical and Algorithmic Stemming Compared for 9 European Languages
with Hummingbird SearchServerTM at CLEF 2003. Working Notes for the CLEF 2003 Work-
shop, 2003.
[15] Stephen Tomlinson. Finnish, Portuguese and Russian Retrieval with Hummingbird
SearchServerTM at CLEF 2004. Working Notes for the CLEF 2004 Workshop, 2004.