Hummingbird Ottawa, Ontario, Canada stephen.tomlinson@hummingbird.com http://www.hummingbird.com/

European Ad Hoc Retrieval Experiments with Hummingbird SearchServerTM at CLEF 2005

Stephen Tomlinson

stephen.tomlinson@hummingbird.com 0 0 Ottawa , Ontario , Canada

Hummingbird participated in the 4 monolingual information retrieval tasks (Bulgarian, French, Hungarian and Portuguese) of the Ad-Hoc Track of the Cross-Language Evaluation Forum (CLEF) 2005. In the ad hoc retrieval tasks, the system was given 50 natural language queries, and the goal was to find all of the relevant documents (with high precision) in a particular document set. We conducted diagnostic experiments with different techniques for matching word variations and handling stopwords. We found that the experimental stemmers significantly increased mean average precision for the 4 languages. Analysis of individual topics found that the algorithmic Bulgarian and Hungarian stemmers encountered some unanticipated stopword collisions. A comparison to an experimental 4-gram technique suggested that Hungarian stemming would further benefit from decompounding. A blind feedback technique which significantly increased mean average precision for some languages was also significantly detrimental to the rank of the first relevant retrieved for one language.

Bulgarian Retrieval Hungarian Retrieval First Relevant Score Per-Topic Analysis

Hummingbird Ottawa, Ontario, Canada stephen.tomlinson@hummingbird.com http://www.hummingbird.com/

Hummingbird SearchServer1 is a toolkit for developing enterprise search and retrieval applications. The SearchServer kernel is also embedded in other Hummingbird products for the enterprise.

SearchServer works in Unicode internally [3] and supports most of the world’s major character sets and languages. The major conferences in text retrieval experimentation (CLEF [2], 1SearchServerTM, SearchSQLTMand Intuitive SearchingTM are trademarks of Hummingbird Ltd. All other copyrights, trademarks and tradenames are the property of their respective owners.

Language

Portuguese French Bulgarian Hungarian

NTCIR [6] and TREC [ 11 ]) have provided judged test collections for objective experimentation with SearchServer in more than a dozen languages.

This (draft) paper describes experimental work with SearchServer for the task of finding relevant documents for natural language queries in 4 European languages (Bulgarian, French, Hungarian and Portuguese) using the CLEF 2005 Ad-Hoc Track test collections. 2 2.1

Methodology Data

The CLEF 2005 Ad-Hoc Track document sets consisted of tagged (SGML-formatted) news articles in 4 different languages: Bulgarian, French, Hungarian and Portuguese. Table 1 gives the sizes.

The CLEF organizers created 50 natural language “topics” (numbered 251-300) and translated them into many languages. One topic was discarded for Bulgarian because it had no relevant documents. Table 1 gives the final number of topics for each language and their average number of relevant documents (along with the lowest, median and highest number of relevant documents of any topic). For more information on the CLEF test collections, see the track overview paper. 2.2

Indexing

Our indexing approach was the mostly the same as last year [ 15 ]. Accents were not indexed except for the combining breve in Bulgarian. The apostrophe was treated as a word separator for the 4 investigated languages. The custom text reader, cTREC, was updated to maintain support for the CLEF guidelines of only indexing specifically tagged fields.

Some stop words were excluded from indexing (e.g. “the”, “by” and “of” in English). For these experiments, the stop word list for Portuguese was based on the Porter list [7], and the lists for Bulgarian and Hungarian were based on Savoy’s [ 9 ]. We used our own list for French.

Unlike previous years, this year we added AL=“0-9” to the stopfiles to specify that the digits 0-9 were to be treated as alphabet characters (e.g. so that “G7” would be indexed as 1 term instead of 2).

By default, the SearchServer index supports both exact matching (after some Unicode-based normalizations, such as decompositions and conversion to upper-case) and morphological matching (e.g. inflections, derivations and compounds, depending on the linguistic component used).

For many languages (including French and Portuguese), SearchServer provides the option of finding inflections based on lexical stemming (i.e. stemming based on a dictionary or lexicon for the language). For example, in English, “baby”, “babied”, “babies”, “baby’s” and “babying” all have “baby” as a stem. Specifying an inflected search for any of these terms will match all of the others. The lexical stemming of the post-6.0 experimental development version of SearchServer used for the experiments in this paper was based on internal stemming component 3.7.0.15. We treat each linguistic component as a black box in this paper.

Lexical stemming in SearchServer typically does “inflectional” stemming which generally retains the part of speech (e.g. a plural of a noun is typically stemmed to the singular form). It typically does not do “derivational” stemming which would often change the part of speech or the meaning more substantially (e.g. “performer” is not stemmed to “perform”).

Lexical stemming in SearchServer includes compound-splitting (decompounding) for compound words in particular languages (such as Dutch, Finnish, German and Swedish). For example, in German, “babykost” (baby food) has “baby” and “kost” as stems.

Lexical stemmers can produce more than one stem, even for non-compound words. For example, in English, “axes” has both “axe” and “axis” as stems (different meanings), and in French, “important” has both “important” (adjective) and “importer” (verb) as stems (different parts of speech). SearchServer records all the stem mappings at index-time to support maximum recall and does so in a way to allow searching to weight some inflections higher than others. 2.3

Searching

We experimented with the SearchServer CONTAINS predicate. Our test application specified SearchSQL to perform a boolean-OR of the query words. For example, for Bulgarian topic 279 whose Title was “Референдуми в Швейцария” (Swiss referendums), a corresponding SearchSQL query would be: SELECT RELEVANCE(’2:3’) AS REL, DOCNO FROM CLEF05BG WHERE FT_TEXT CONTAINS ’Референдуми’|’в’|’Швейцария’ ORDER BY REL DESC; (Note that “в” is a stopword for Bulgarian so its inclusion in the query wouldn’t actually add any matches.)

Most aspects of the SearchServer relevance value calculation are the same as described last year [ 15 ]. Briefly, SearchServer dampens the term frequency and adjusts for document length in a manner similar to Okapi [8] and dampens the inverse document frequency using an approximation of the logarithm. These calculations are based on the stems of the terms (roughly speaking) when doing morphological searching (i.e. when SET TERM_GENERATOR ‘word!ftelp/inflect’ was previously specified). The SearchServer RELEVANCE_METHOD setting was set to ‘2:3’ and RELEVANCE_DLEN_IMP was set to 750 for all experiments in this paper. 2.4

Diagnostic Runs

For the diagnostic runs listed in Tables 2, the run names consist of a language code (“BG” for Bulgarian, “FR” for French, “HU” for Hungarian and “PT” for Portuguese) followed by one of the following labels: ² “lex”: (FR and PT only): The run used SearchServer lexical stemming. The /inflect option (SET TERM_GENERATOR ‘word!ftelp/inflect’) was specified. ² “lexnos”: Same as “lex” except that /nostop was additionally specified which prevents query terms from being discarded if all of their stems are stopwords (note that stopwords themselves were still not found because they were not indexed). ² “lexall”: Same as “lex” except that a separate index was used which did not stop any words from being indexed (specifying /nostop would make no difference with this index). ² “lexsing”: Same as “lex” except that /single was additionally specified (so that just one stemming interpretation was used at search time). ² “neu” (BG and HU only): Same as “lex” except that the experimental Neuchatel stemmer was used [ 9 ]. ² “neunos”: Same as “lexnos” except that the Neuchatel stemmer was used. ² “neuall”: Same as “lexall” except that the Neuchatel stemmer was used.

Run

BG-neuall

BG-neunos BG-4gram BG-snru BG-neu BG-none FR-sn FR-lex FR-lexnos FR-lexall FR-4gram FR-lexsing FR-none HU-4gram HU-neunos HU-neuall HU-neu HU-neuposs HU-none PT-sn PT-lexall PT-lex PT-lexnos PT-lexsing PT-none PT-4gram ² “neuposs” (HU only): Same as “neu” except that the call to the remove_possessive function was skipped. (Prof. Savoy suggested to us that it was unclear if removing possessive pronouns was a good idea, which we interpreted as uncertainty about the remove_possessive function.) ² “sn” (FR and PT only): Same as “lex” except that the Porter (Snowball) stemmer [7] was used. ² “snru” (BG only): Same as “neu” except that the Porter (Snowball) stemmer for Russian was used. ² “4gram”: Same as “lexall” except that the run used a different index which primarily consisted of the 4-grams of terms, e.g. the word ‘search’ would produce index terms of ‘sear’, ‘earc’ and ‘arch’. No stemming was done; searching used the IS_ABOUT predicate (instead of the CONTAINS predicate) with morphological options disabled to search for the 4-grams of the query terms. ² “none”: The run disabled morphological searching. (The run used the same index as “lex” for FR and PT and the same index as “neu” for HU and BG, but SET TERM_GENERATOR ‘’ was specified so that variations from stemming were not matched.)

Note that all diagnostic runs just used the Title field of the topic. 2.5

Evaluation Measures

Traditionally in ad hoc retrieval experiments, the primary evaluation measure is “average precision”. For a topic, it is the average of the precision after each relevant document is retrieved (using zero as the precision for relevant documents which are not retrieved). By convention, it is based on the first 1000 retrieved documents for the topic. The score ranges from 0.0 (no relevants found) to 1.0 (all relevants found at the top of the list). Average precision takes into account both precision and recall, and it is very good for detecting retrieval differences because even small differences in the ranks of relevant documents affect the score. “Mean Average Precision” (MAP) is the mean of the average precision scores over all of the topics (i.e. all topics are weighted equally).

If one wishes to focus on just the first relevant document, the traditional measure is “Reciprocal Rank” (RR). For a topic, it is 1r where r is the rank of the first row for which a desired page is found, or zero if a desired page was not found. “Mean Reciprocal Rank” (MRR) is the mean of the reciprocal ranks over all the topics.

An experimental measure introduced in this paper (along with the companion web retrieval paper [ 12 ]) is “First Relevant Score” (denoted “FRS”). Like reciprocal rank, it is based on just the rank of the first relevant retrieved for a topic, but it is better suited to per-topic analysis. FRS is 1:081¡r where r is the rank of the first row for which a desired page is found, or zero if a desired page was not found. Like reciprocal rank, finding the first relevant at rank 1 produces a score of 1.0. At rank 2, FRS is just 7 points lower (0.93), whereas RR is 50 points lower (0.50). At rank 3, FRS is another 7 points lower (0.86), whereas RR is 17 points lower (0.33). At rank 10, FRS is 0.50, whereas RR is 0.10. FRS is greater than RR for ranks 2 to 52 and lower for ranks 53 and beyond. A possible interpretation of FRS is that it may be an indicator of the percentage of potential result list reading the system saved the user to get to the first relevant, assuming that users are less and less likely to continue reading as they get deeper into the result list.

“Success@n” is the percentage of topics for which at least one relevant document was returned in the first n rows. Like the other first relevant measures, this measure hides a lot of retrieval differences (particularly in recall), but it is more intuitive and may be an indicator of a user’s impression of a method’s robustness across topics. This paper lists Success@1, Success@5 and Success@10. 2.6

Statistical Significance Tables

For tables comparing 2 diagnostic runs (such as Table 3), the columns are as follows: ² “Expt” specifies the experiment. The language code is given, followed by the labels of the 2 runs being compared. The difference is the first run minus the second run. For example, “FR lex-none” specifies the difference of subtracting the scores of the French ‘none’ run from the French ‘lex’ run (of Table 2). ² “¢MAP” is the difference of the mean average precision scores of the two runs being compared (and “ ¢FRS” is the difference of the (mean) FRS scores). ² “95% Conf” is an approximate 95% confidence interval for the difference (calculated from plus/minus twice the standard error of the mean difference). If zero is not in the interval, the result is “statistically significant” (at the 5% level), i.e. the feature is unlikely to be of neutral impact (on average), though if the average difference is small (e.g. <0.020) it may still be too minor to be considered “significant” in the magnitude sense. ² “vs.” is the number of topics on which the first run scored higher, lower and tied (respectively) compared to the second run. These numbers should always add to the number of topics (49 for Bulgarian, 50 for the others). ² “3 Extreme Diffs (Topic)” lists 3 of the individual topic differences, each followed by the topic number in brackets (the topic numbers range from 251 to 300). The first difference is the largest one of any topic (based on the absolute value). The third difference is the largest difference in the other direction (so the first and third differences give the range of differences observed in this experiment). The middle difference is the largest of the remaining differences (based on the absolute value). 3

Results of Morphological Experiments

In the per-topic analysis, the official topic translations were used as much as possible. Online translation services were consulted at times ([5] was sometimes helpful for Hungarian, and we found the Russian-to-English translations at [1] often worked for Bulgarian). Prof. Savoy also assisted with some Bulgarian words. But any translation errors are the responsibility of the author. 3.1

Impact of Stemming

² HU-279 (Sv´ajci n´epszavaza´sok): Without Hungarian stemming, no document contained both of the query terms. No relevant document contained the query word ‘n´epszavaza´sok’. Only some of the relevant documents even contained ‘Sv´ajci’ (and lots of non-relevants also did). With stemming, average precision was 87 points higher from extra matches such as ‘sv´ajciak’, ‘Sv´ajc’, ‘Sv´ajcban’, ‘Sv´ajcot’, ‘Sv´ajcro´l’, ‘n´epszavaza´son’, ‘n´epszavaza´s’, ‘n´epszavaza´st’ and ‘n´epszavaza´ssal’. ² BG-279 (Референдуми в Швейцария): With Bulgarian stemming, average precision was 58 points higher from extra matches for ‘referendums’ such as референдум and референдума. ² FR-279 (R´ef´erendums en Suisse): This French topic scored lower with stemming (the rank of the first relevant fell from 1 to 13, and average precision fell from 0.10 to 0.01). It appears that the relevant documents were more likely to use the plural ‘R´ef´erendums’ than the singular ‘R´ef´erendum’, and the latter was a more common word which generated lots of matches when stemming. 3.2

Impact of Experimental /nostop Option

² HU-265 (A Deutsche Bank szerzem´enyei (Deutsche Bank Takeovers)): The query word ‘Bank’ stemmed to ‘ban’ (in) which was a stopword, so by default, the word ‘Bank’ was not matched in the documents. With the /nostop option, ‘Bank’ was matched and average precision was 13 points higher. (Incidentally, this issue is presumably why Table 3 shows that stemming scored 12 points lower on HU-265; without stemming, ‘Bank’ was found in the documents.) Perhaps this issue would not have arisen with a lexical stemmer which would preserve the meaning more closely. ² HU-292 (N´emet v´arosok u´jja´´ep´ıt´ese (Rebuilding German Cities)): The query word ‘N´emet’ (German) stemmed to ‘nem’ (not) which was a stopword and so this useful word was dropped from the query by default. With the /nostop option, average precision was 40 points higher. ² HU-282 (El´ıt´eltekkel szembeni durva ba´na´smo´d (Prison Abuse)): In this topic, the default scored higher. Using /nostop changed the rank of the first relevant from 3 to 7. The stopword list contained ‘szemben’ (in front of), and the query word ‘szembeni’ presumably is a related noise word, and discarding it was useful. The /nostop option kept ‘szembeni’, which only occurred in 319 documents, so it had a high enough weighting from inverse document frequency to hurt precision. ² BG-273 (Разширяването на НАТО (NATO Expansion)): НАТО (NATO) stemmed to НА (on) which was a stopword, so the default behaviour removed a key word from the query.

With /nostop, the first relevant score was 80 points higher. ² BG-267 (Най-добрите чуждоезикови филми (Best Foreign Language Films)): The query word филми (films) stemmed to филм (film) which surprisingly was a stopword, so the default behaviour discarded a key query term. Our supplier [ 9 ] has confirmed that this was an error in the Bulgarian stopword list. ² BG-257 (Етническото прочистване на Балканите (Ethnic Cleansing in the Balkans)): The query word Балканите (Balkans) stemmed to балкан (Balkan mountain) which surprisingly was a stopword. Even though it turned out that precision was a little higher without the Balkans term in this case, in general this appears to be another error in the stopword list.

In the topics we examined, in 3 cases the default behaviour of dropping useful terms may have been from the stemmers for Bulgarian and Hungarian being algorithmic instead of lexical (a lexical stemmer typically does not change the meaning of a word, except when words are ambiguous). It appears for algorithmic stemmers it may be better to use the /nostop option by default.

In another 2 cases, it appears the stoplist was in error, which illustrates the usefulness of the CLEF judged test collections: they enable an analyst who does not understand a language to find issues in a resource for the language and make inferences about its quality. 3.3

Impact of Indexing All Words

² HU-292 (N´emet va´rosok u´jj´a´ep´ıt´ese (Rebuilding German Cities)): We saw earlier that this topic benefitted from the /nostop option (average precision up 40 points), but when indexing all words, average precision fell back (33 points). The reason was that the common word ‘nem’ (not) was now indexed, so ‘N´emet’ (German), which stems to ‘nem’ with the algorithmic stemmer, had a much lower inverse document frequency than before, and this useful word received less weight. (Even if it had received more weight, there would have been potential confusion with all the indexed occurrences of ‘nem’.) ² BG-271 (Бракове между хомосексуални (Gay Marriages)): The stopword между (between) was not in the 2 relevant documents. When it was indexed, its inclusion caused some nonrelevants to be preferred, and average precision dropped 55 points. ² BG-295 (Пране на пари (Money Laundering)): This topic scored higher when indexing all words. Surprisingly, the word пари (money) was a stopword, presumably another error (the Bulgarian stoplist apparently needs a review). It seems fine that на (on) was a stopword.

In practice, indexing all words may not be so troublesome because it is typically easy for users to omit noise words from the query, and stemming issues can be worked around by disabling the finding of word variants (SearchServer makes it optional at search-time). 3.4

Comparison to 4-grams

Compound words appear to be fairly common in Hungarian, but the algorithimic stemmer did not perform decompounding, a technique we have found to be useful for languages such as Finnish [ 15 ]. However, [4] has found that using 4-grams as index terms works well in ad hoc ranking experiments for many European languages, including compound-word languages. Table 6 compares our 4-gram runs to the stemming runs which indexed all words (because we did not use stopwords with our 4gram index). As anticipated, there was a statistically significant increase in mean average precision for Hungarian, though there was a decrease for Portuguese which was also statistically significant. We look at the largest per-topic differences for Hungarian: ² HU-255 (Internetfu¨ggo˝k (Internet Junkies)): Average precision was 46 points higher with 4-grams for this topic (a compound word). The stemmer found the 3 relevant documents which contained ‘internetfu¨ggo˝’ or the original query word ‘internetfu¨gg˝ok’. 4-grams matched other variants such as ‘Internetfu¨ggo˝s´eg’ (Internet dependence), ‘internetfu¨gg˝os´eggel’ and ‘internetfu¨ggo˝s´egben’ and found all 6 relevant documents. 4-grams also matched other potentially helpful words such as ‘internet’, ‘internetezo˝k’, ‘internetez´es’, ‘komputerfu¨ggo˝s´eget’ and ‘fu¨ggo˝v´e’. But 4-grams also produced unwanted matches, such as ‘intervallum’ (interval) and ‘Szinte’ (as good as); these both came from the 4-gram ‘inte’. If the stemmer had just additionally matched ‘Internetfu¨ggo˝s´eg’, all 6 relevants would have found, but we’re still investigating if the -seg suffix is one that a Hungarian stemmer should generally remove or not. ² HU-292 (N´emet v´arosok u´jj´a´ep´ıt´ese (Rebuilding German Cities)): On this topic, 4-grams still just found 1 of the 2 relevant documents, but it moved it from rank 3 to 1 (compared to the stemming run). While 4-grams additionally matched ‘u´jja´´ep´ıtik’, the bigger advantage was probably that the 4-gram method did not match ‘nem’ which we know from earlier was a troublesome match for the stemming run. ² HU-283 (James Bond-filmek (James Bond Films)): On this topic, the 4-gram run scored 30 points lower in average precision than the stemming run. The 4-gram run favored documents with the ‘filmek’ pattern (which corresponded to three 4-grams (‘film’, ‘ilme’ and ‘lmek’) and so it received roughly 3 times the weight compared to the stemming run). However, the relevant documents tended not to use ‘filmek’; instead they tended to use other variants matched by the stemmer such as ‘film’, ‘filmet’, ‘filmn´el’, ‘filmben’ and ‘filmhez’. ² HU-286 (Futballs´eru¨l´esek (Football Injuries)): This topic had no matches in the stemming run, but a relevant document was ranked first in the 4-gram run. 4-gram matches in the relevant documents included ‘futballista’, ‘futballkapus’ (goalkeeper), ‘futballv´alogatott’, ‘v´alls´eru¨l´est’, ‘v´alls´eru¨l´essel’, ‘v´alls´eru¨l´es’, ‘s´eru¨l´es’ (injury), ‘s´eru¨lt’ and ‘s´eru¨ltet’. This might be a case for which decompounding would be helpful. ² HU-261 (J¨ovend˝omonda´s (Fortune-telling)): The stemming run only matched the one document which contained ‘j¨ovendo˝monda´st’ and ‘j¨ovendo˝monda´s’ and it was judged nonrelevant, so it scored 0 on this topic. The 4-gram returned 1 of the 3 relevant documents at rank 2 (the others weren’t ranked in the top 100). Matches in the relevant document included ‘j¨ovendo¨l˝ok’ and ‘j¨ovendo˝mondo´k’. The latter of these perhaps could have been matched with additional stemming rules, but the former would require a stemmer to do decompounding (or, if the user had decompounded the query, the latter would require index-time decompounding to match).

SearchServer can find character sequences inside European words without n-gramming if the user specifies wildcards, so for precise searches it’s unclear if n-gram indexes would add value. N-gram approaches typically produce larger indexes and its queries can be slower for common word-searching cases. We’re not aware of them being used in practice for European language retrieval, except perhaps by web search engines for url indexing. 3.5

Comparison to Alternate Stemmers

Table 8 isolates the impact of using the SearchServer /single option. This option only makes a difference for the SearchServer lexical stemmers which can produce more than one stem for a term. Like last year [ 15 ], our method for including all stems without overweighting some of the terms apparently was effective. Even in the high-variance first relevant score measure, the bigger differences favored including all stems. Table 11 isolates the impact of the blind feedback technique (based on using the first 2 returned rows to expand the query). While mean average precision increased for all 4 languages (and the increase was statistically significant for 3 of them), the first relevant score decreased for all 4 languages (and the decrease was statistically significant for the other 1 of them).

The blind feedback technique presumably works best if relevant documents appear in the first 2 rows, in which case first relevant score cannot be improved. If the first 2 rows do not contain relevant documents, then using those rows to expand the query may hurt the query and push down the first relevant even further.

This result may explain in part why blind feedback techniques are not known to be used in practice even though they have been popular with experimenters for several years in ad hoc evaluations (which typically focus on mean average precision). [1] AltaVista’s Babel Fish Translation Service. http://babelfish.altavista.com/tr [2] Cross-Language Evaluation Forum web site. http://www.clef-campaign.org/ [3] Andrew Hodgson. Converting the Fulcrum Search Engine to Unicode. Sixteenth International

Unicode Conference, 2000. [4] Paul McNamee and James Mayfield. JHU/APL Experiments in Tokenization and Non-Word

Translation. Working Notes for the CLEF 2003 Workshop, 2003. [5] MTA SZTAKI: English-Hungarian,

http://dict.sztaki.hu/english-hungarian [6] NTCIR (NII-NACSIS Test Collection http://research.nii.ac.jp/»ntcadm/index-en.html

Hungarian-English Online Dictionary. for IR Systems) Home Page.

[7] M. F. Porter. Snowball: A language for stemming http://snowball.tartarus.org/texts/introduction.html algorithms.

October 2001. [8] S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu and M. Gatford. Okapi at

TREC-3. Proceedings of TREC-3, 1995. Multilingual information retrieval resource

Hummingbird

[9]

Jacques

Savoy . CLEF and http://www.unine.ch/info/clef/

[10]

¨orkur Sigurbjo ¨rnsson, Jaap Kamps and Maarten de Rijke. Overview of WebCLEF 2005 . To appear in Working Notes for the CLEF 2005 Workshop , 2005 .

[11] Text REtrieval Conference (TREC) Home Page . http://trec.nist.gov/

[12]

Stephen

Tomlinson . European Web Retrieval Experiments with Hummingbird SearchServerTM at CLEF 2005 . To appear in Working Notes for the CLEF 2005 Workshop , 2005 .

[13]

Stephen

Tomlinson . Experiments in 8 European Languages with SearchServerTM at CLEF 2002 . Proceedings of CLEF 2002 , 2003 .

[14]

Stephen

Tomlinson . Lexical and Algorithmic Stemming Compared for 9 European Languages with Hummingbird SearchServerTM at CLEF 2003 . Working Notes for the CLEF 2003 Workshop , 2003 .

[15]

Stephen

Tomlinson . Finnish, Portuguese and Russian Retrieval with Hummingbird SearchServerTM at CLEF 2004 . Working Notes for the CLEF 2004 Workshop , 2004 .