Hummingbird Ottawa, Ontario, Canada stephen.tomlinson@hummingbird.com http://www.hummingbird.com/

Hummingbird SearchServerTM

Table

Success@

0 0 Ottawa , Ontario , Canada

2005

Hummingbird participated in the mixed monolingual retrieval task of the WebCLEF Track of the Cross-Language Evaluation Forum (CLEF) 2005. In this task, the system was given 547 known-item queries from 11 languages (134 Spanish, 121 English, 59 Dutch, 59 Portuguese, 57 German, 35 Hungarian, 30 Danish, 30 Russian, 16 Greek, 5 Icelandic and 1 French). The goal was to find the desired page in the 82GB EuroGOV collection (3.4 million pages crawled from government sites of 27 European domains). We experimented with different techniques for web retrieval and analyzed the differences between them. We defined a new measure, First Relevant Score (FRS), to facilitate per-topic analysis, and we focused on analyzing Greek, Danish and Icelandic topics. We found that stopword processing was more important than anticipated, perhaps because words common in one language may tend to be overweighted by inverse document frequency in a mixed language collection. Extra weight on the document title helped significantly, and extra weight on less deep urls significantly helped home page queries. Stemming was of neutral impact on average, but could make a substantial difference for individual queries.

Hummingbird Ottawa, Ontario, Canada stephen.tomlinson@hummingbird.com http://www.hummingbird.com/

Hummingbird SearchServer1 is a toolkit for developing enterprise search and retrieval applications. The SearchServer kernel is also embedded in other Hummingbird products for the enterprise.

1SearchServerTM, SearchSQLTMand Intuitive SearchingTM are trademarks of Hummingbird Ltd. All other copyrights, trademarks and tradenames are the property of their respective owners.

SearchServer works in Unicode internally [ 3 ] and supports most of the world’s major character sets and languages. The major conferences in text retrieval experimentation (CLEF [ 2 ], NTCIR [ 4 ] and TREC [ 10 ]) have provided judged test collections for objective experimentation with SearchServer in more than a dozen languages.

This (draft) paper describes experimental work with SearchServer for the task of finding known home or named pages in 11 European languages (Spanish, English, Dutch, Portuguese, German, Hungarian, Danish, Russian, Greek, Icelandic and French) using the WebCLEF 2005 test collection. 2

Methodology 2.1

Data

For the submitted runs in June 2005, SearchServer experimental development build 7.0.0.707 was used.

The collection to be searched was the EuroGOV collection. It consisted of 3,589,502 pages crawled from government sites of 27 European domains. Uncompressed, it was 88,062,007,676 bytes (82.0 GB). The average document size was 24,533 bytes. More details on this collection are in [ 8 ]. Note that we only indexed 3,417,463 of the pages because the organizers provided a “blacklist” of 172,039 pages to omit (primarily binary documents).

For the mixed monolingual task, there were 547 known-item queries from 11 different languages (134 Spanish, 121 English, 59 Dutch, 59 Portuguese, 57 German, 35 Hungarian, 30 Danish, 30 Russian, 16 Greek, 5 Icelandic and 1 French). Of these, 345 were named page queries and 242 were home page queries. More details on the mixed monolingual task are in the track overview paper [ 9 ]. 2.2

Indexing

Our indexing approach was based on the approach we used for TREC Web tasks the previous three years (described in detail in [12]). Briefly, in addition to full-text indexing, the custom text reader cTREC populated particular columns such as TITLE (if any), URL, URL_TYPE and URL_DEPTH. The URL_TYPE was set to ROOT, SUBROOT, PATH or FILE, based on the convention which worked well in TREC 2001 for the Twente/TNO group [15] on the entry page finding task (also known as the home page finding task). The URL_DEPTH was set to a term indicating the depth of the page in the site. Table 1 contains URL types and depths for example URLs. The exact rules we used are given in [12].

WebCLEF required a few indexing enhancements compared to TREC. In particular, it wouldn’t suffice to assume all the pages were in the ASCII character set. We added a /cs option to our cTREC text reader which used the first recognized ‘charset’ specification in the page (e.g. from the meta http-equiv tag) to indicate from which character set to convert the page to Unicode (Win_1252 was assumed if no charset was specified).

For the baseline task, in which the system was not to make use of any of the topic metadata such as the specified language of the query, we still indexed with English stopwords (even though the majority of the documents were in other languages). We treated the apostrophe as a term separator (which we normally do for languages other than English, but in this collection, it was also a separator for English). No accents were indexed. English stemming was used on the table, but SearchServer also indexed all the surface forms (after Unicode normalizations such as case normalization), and the baseline runs just searched the surface forms, not the stems.

For 2 of our submitted runs, we labelled the runs as making use of the topic and page language metadata (which were always the same in the mixed monolingual task) along with the page’s domain. For these runs, we created a set of language-specific indexes (one for each of the 11 query languages) which used a stemmer and stopfile for that language (for English and Icelandic, we actually used the original baseline index, which had English stems and stopwords). For some of the languages, because we were close to the submission deadline, we also skipped indexing some of the domains to save time (e.g. for Greek, just the ‘gr’ and ‘eu.int’ subsets of EuroGOV were included because it was known all the results were in the ‘gr’ domain) which would have a (probably minor) effect on the inverse document frequencies (minor especially since we always included the ‘eu.int’ subset in each index). For 9 of the languages (Danish, Dutch, English, French, German, Greek, Portuguese, Russian and Spanish), the lexical stemmer in SearchServer (based on internal stemming component 3.7.0.15) was used. For Hungarian, the Neuchatel stemmer [ 7 ] was used (see our companion ad hoc retrieval paper [11] for details). For Icelandic, we used the English index as previously mentioned. For Greek and Russian, we additionally enabled indexing of a few accents because the stemmer was accent-sensitive. When processing queries for these runs, the query was directed to the index for the specified language. 2.3

Searching

We executed 7 runs in June 2005, though only 5 were allowed to be submitted. All 7 are described here. The first 4 runs were ‘baseline’ runs which did not use the topic metadata. The other 3 runs made use of the topic metadata (in particular, the domain, and for the last 2 runs, also the language).

humWC05none: This run was a plain content search of the baseline table. No inflections were used. This run was the analog of the “none” runs described in our ad hoc retrieval paper [11]. It used the ‘2:3’ relevance method and document length normalization (SET RELEVANCE_DLEN_IMP 500). The IS_ABOUT predicate was used instead of the CONTAINS predicate (and hence the VECTOR_GENERATOR was set to blank to disable inflections instead of the TERM_GENERATOR), but the relevance calculation was the same. (This run was not submitted.)

humWC05p run: This submitted run was the same as humWC05none except that it put additional weight on matches in the title, url, first heading and some meta tags, including extra weight on matching the query as a phrase in these fields. Below is an example SearchSQL query. The searches on the ALL_PROPS column (which contained a copy of the title, url, etc. as described in [12]) are the difference from the humWC05none run. Note that the FT_TEXT column indexed the content and also all of the non-content fields except for the URL. More details of the syntax are explained in [13]. This run used the same approach as the TREC 2004 humW04pl run except that linguistic inflections were disabled.

SELECT RELEVANCE(’2:3’) AS REL, DOCNO FROM EGOV WHERE (ALL_PROPS CONTAINS ’Giuseppe Medici’ WEIGHT 1) OR (ALL_PROPS IS_ABOUT ’Giuseppe Medici’ WEIGHT 1) OR (FT_TEXT IS_ABOUT ’Giuseppe Medici’ WEIGHT 10) ORDER BY REL DESC; humWC05dp run: This submitted run was the same as humWC05p except that it put additional weight on urls of depth 4 or less (but not on the url type, though url types were still listed with weight 0 as a way to prevent urls of depth greater than 4 from being excluded). Less deep urls also received higher weight from inverse document frequency because (presumably) they are less common. This run used the same approach as the TREC 2004 humW04dpl run except that linguistic inflections were disabled. Below is an example WHERE clause:

WHERE ((ALL_PROPS CONTAINS ’Giuseppe Medici’ WEIGHT 1) OR (ALL_PROPS IS_ABOUT ’Giuseppe Medici’ WEIGHT 1) OR (FT_TEXT IS_ABOUT ’Giuseppe Medici’ WEIGHT 10) ) AND ( (URL_TYPE CONTAINS ’ROOT’ WEIGHT 0) OR (URL_TYPE CONTAINS ’SUBROOT’ WEIGHT 0) OR (URL_TYPE CONTAINS ’PATH’ WEIGHT 0) OR (URL_TYPE CONTAINS ’FILE’ WEIGHT 0) OR (URL_DEPTH CONTAINS ’URLDEPTHA’ WEIGHT 5) OR (URL_DEPTH CONTAINS ’URLDEPTHAB’ WEIGHT 5) OR (URL_DEPTH CONTAINS ’URLDEPTHABC’ WEIGHT 5) OR (URL_DEPTH CONTAINS ’URLDEPTHABCD’ WEIGHT 5) ) humWC05rdp run: This submitted run was the same as humWC05dp except that it put additional weight on the url type. This run used the same approach as the TREC 2004 humW04rdpl run except that linguistic inflections were disabled. Below is an example WHERE clause: WHERE ((ALL_PROPS CONTAINS ’Giuseppe Medici’ WEIGHT 1) OR (ALL_PROPS IS_ABOUT ’Giuseppe Medici’ WEIGHT 1) OR (FT_TEXT IS_ABOUT ’Giuseppe Medici’ WEIGHT 10) ) AND ( (URL_TYPE CONTAINS ’ROOT’ WEIGHT 10) OR (URL_TYPE CONTAINS ’SUBROOT’ WEIGHT 10) OR (URL_TYPE CONTAINS ’PATH’ WEIGHT 10) OR (URL_TYPE CONTAINS ’FILE’ WEIGHT 0) OR (URL_DEPTH CONTAINS ’URLDEPTHA’ WEIGHT 5) OR (URL_DEPTH CONTAINS ’URLDEPTHAB’ WEIGHT 5) OR (URL_DEPTH CONTAINS ’URLDEPTHABC’ WEIGHT 5) OR (URL_DEPTH CONTAINS ’URLDEPTHABCD’ WEIGHT 5) ) humWC05dpD0 run: This run was the same as humWC05dp except that the domain information of the topic metadata was used to restrict the search to the specified domain. Below is an example of the domain filter added to the WHERE clause for a case in which the page was known to be in the ‘it’ domain (which implied the DOCNO would contain ‘Eit’). This run was not submitted.

AND (DOCNO CONTAINS ’Eit’ WEIGHT 0) humWC05dpD run: This submitted run was the same as humWC05dpD0 except that the language information of the topic metadata was used to direct the search to the table for the specified language (i.e. the WHERE clause was the same as for humWC05dpD0, but the FROM clause specified a different table). Inflections were still not used.

humWC05dplD run: This submitted run was the same as humWC05dpD except that the content and title searches included linguistic expansion from language-specific stemming (this was done with SET VECTOR_GENERATOR ‘word!ftelp/inflect’; note that /decompound (applicable to Dutch and German) is implied for /inflect with SET VECTOR_GENERATOR, unlike with SET TERM_GENERATOR). 2.4

Evaluation Measures

If one wishes to focus on just the first relevant document, the traditional measure is “Reciprocal Rank” (RR). For a topic, it is 1r where r is the rank of the first row for which a desired page is found, or zero if a desired page was not found. “Mean Reciprocal Rank” (MRR) is the mean of the reciprocal ranks over all the topics.

An experimental measure introduced in this paper (along with the companion ad hoc retrieval paper [11]) is “First Relevant Score” (denoted “FRS”). Like reciprocal rank, it is based on just the rank of the first relevant retrieved for a topic, but it is better suited to per-topic analysis. FRS is 1:081¡r where r is the rank of the first row for which a desired page is found, or zero if a desired page was not found. Like reciprocal rank, finding the first relevant at rank 1 produces a score of 1.0. At rank 2, FRS is just 7 points lower (0.93), whereas RR is 50 points lower (0.50). At rank 3, FRS is another 7 points lower (0.86), whereas RR is 17 points lower (0.33). At rank 10, FRS is 0.50, whereas RR is 0.10. FRS is greater than RR for ranks 2 to 52 and lower for ranks 53 and beyond. A possible interpretation of FRS is that it may be an indicator of the percentage of potential result list reading the system saved the user to get to the first relevant, assuming that users are less and less likely to continue reading as they get deeper into the result list.

“Success@n” is the percentage of topics for which at least one relevant document was returned in the first n rows. Like the other first relevant measures, this measure hides a lot of retrieval differences (particularly in recall), but it is more intuitive and may be an indicator of a user’s impression of a method’s robustness across topics. This paper lists Success@1, Success@5 and Success@10. 2.5

Per-Topic Tables

The 7 runs allow us to isolate 6 ‘web techniques’ which are denoted as follows: ² ‘p’ (extra weight for phrases in the Title and other properties plus extra weight for vector search on properties): The humWC05p score minus the humWC05none score. ² ‘d’ (modest extra weight for less deep urls): The humWC05dp score minus the humWC05p score. ² ‘r’ (strong extra weight for urls of root, subroot or path types): The humWC05rdp score minus the humWC05dp score. ² ‘o’ (domain filtering): The humWC05dpD0 score minus the humWC05dp score. ² ‘s’ (stopwords specific to the language and possibly accent-indexing and inverse document frequency changes): The humWC05dpD score minus the humWC05dpD0 score. ² ‘l’ (linguistic expansion from stemming): The humWC05dplD score minus the humWC05dpD score.

For the per-topic tables comparing 2 diagnostic runs (such as Table 3), the columns are as follows: ² “Expt” specifies the experiment. It starts with one of the above 6 web techniques, followed by ‘NP’ for named page queries or ‘HP’ for home page queries, optionally followed by the language code.

² “¢FRS” is the difference of the (mean) first relevant score of the two runs being compared. humWC05dplD humWC05dpD (humWC05dpD0) humWC05rdp humWC05dp humWC05p (humWC05none)

NP: dplD

NP: dpD NP: dpD0 NP: rdp NP: dp NP: p NP: none HP: dplD HP: dpD HP: dpD0 HP: rdp HP: dp HP: p HP: none ² The ‘p’ technique (extra weight for phrases in the Title and other properties plus extra weight for vector search on properties) was of statistically significant benefit for both named pages and home pages, which is consistent with our TREC results [14] except that the benefit was larger at TREC. o-NP s-NP p-NP l-NP d-NP r-NP p-HP d-HP s-HP r-HP l-HP o-HP ² The ‘d’ technique (modest extra weight for less deep urls) was of statistically significant benefit for home pages and neutral on average for named pages, which is consistent with our TREC results except that the benefit for home pages was larger at TREC. ² The ‘r’ technique (strong extra weight for urls of root, subroot or path types) was less detrimental than we expected for named pages and less helpful than we expected for home pages compared to our TREC results. ² The ‘o’ technique (domain filtering), as expected, never caused the score to go down on any topic (as the ‘vs.’ column shows) because it just included rows from the known domain. But the benefit was not large on average, so apparently the unfiltered queries usually were not confused much by the extra domains. ² The ‘s’ technique (stopwords specific to the language and possibly accent-indexing and inverse document frequency changes) was a surprise in that it led to a statistically significant benefit for both named pages and home pages. We look at this more below. ² The ‘l’ technique (linguistic expansion from stemming) was of neutral impact on average, but it could make a substantial difference for individual queries as we will see below.

In the sections that follow, we focus on Greek, Danish and Icelandic because this is the first time we have had judged test collections for these languages. In partciular, we focus on the impact of the ‘s’ and ‘l’ techniques, i.e. the impacts of stopwords (and accents) and stemming. For English, we compare the scores on our own contributed topics to the other English topics. The last section lists the per-topic tables for the remaining languages in descending order by number of topics, for future reference. Table 4 lists the mean scores for the 11 Greek named page queries and 5 Greek home page queries. The top-scoring runs used stemming (run humWC05dplD) or disabled accent-indexing (run humWC05dplD0). The run with accent-indexing and not stemming (humWC05dpD) did not score as highly on average. Table 5 shows that the ‘l’ technique (stemming, i.e. the dplD score minus the dpD score) was positive on average, while the ‘s’ factor (the dpD score minus the dpD0 score, primarily isolating the impact of stopwords specific to the language, including specifying accent-indexing in the Greek case) was negative, and it lists the topics most affected by each technique in each direction, which we examine below. (In the below topic-analysis, the translations are based partly on the official topic translations and partly on the online Greek-toEnglish translation service at [ 1 ]).

WC0112: Table 5 shows that the biggest impact of Greek stemming was on topic 112 (Pl rhs l— a twn upourg‚n kai ufupourg‚n lwn upourge—wn ths Ellhnik s kubŁrnh (sList of ministers and deputy ministers for all the ministries of the Greek government)). The desired page was not retrieved in the top-50 without inflecting because the key query terms were plurals (upourg‚n (ministers), ufupourg‚n (undersecretaries), upourge—wn (ministries)) while the desired page just contained singular forms (Upourg s(Minister), Ufupourg s(Undersecretary), Upourge—o (Ministry)).

WC0395: Table 5 shows that the next biggest impact of Greek stemming was on topic 395 (O ’Ellhnas prwjupourg s kai to m num t(oTuhe Greek Prime Minister and his message)). With stemming, the first relevant was found at rank 13 instead of 39, a 34 point increase in FRS (in the reciprocal rank measure, this would just be a 5 point increase). Without stemming, the only matching word was tou (his), which probably should have been a stopword. With stemming, the query word prwjupourg s(Prime Minister) matched the document’s variant (Prwjupourgo ). Because we enabled indexing of Greek accents for our lexical Greek stemmer, the query word m num (message) did not match the document form M numa(which did not include an accent on the last character; the first letter is just an lowercase-uppercase difference which all runs handled by normalizing Unicode to uppercase). Note that the humWC05dpD0 run did match M numabecause accent-indexing was not enabled for this run; presumably this is why the s-NP-EL line of Table 5 shows that switching to the Greek-specific stopfile (which enabled accent indexing) decreased FRS 34 points for this topic. For most languages, our lexical stemmers are accent-insensitive; we should investigate doing the same for Greek.

WC0432: Table 5 shows that the biggest impact of switching to the Greek-specific stopfile was a detrimental impact on topic 432 (E— dos Ellhnik s i o l—das gia th nŁleu gia to mŁllon ths Eur‚phs (Greek home page of the convention for the future of Europe)). The desired page was found at rank 12 without accent-indexing but was not retrieved in the top-50 with accent-indexing. The humWC05dpD0 run matched the document title terms which were in uppercase and did not have accents, particularly SUNELEUSH (ASSEMBLY), MELLON (FUTURE) and EURWPHS (EUROPE). (The corresponding query words had accents: nŁleu (assembly), mŁllon (future) and Eur‚phs (Europe)). This issue would presumably impair the ‘p’ web technique (extra weight on properties such as the title) because title words are often in uppercase and apparently in Greek uppercase words often omit the accents. (Incidentally, the o-HP-EL line of Table 5 shows that domain filtering (restricting to the .gr domain) was useful for this query; without it, even without accent-indexing, the retrieved pages were mostly from the .eu.int domain.)

WC0445: Table 5 shows that the biggest positive impact of switching to the Greek-specific stopfile was on topic 445 (Plhrofor—es epikoinwn—as lwn twn upourge—wn ths Ellhnik s kubŁrnh s (Contact information of all the ministries of the Greek government)). The reason seems to be that the non-content words in the query (such as twn (of) and ths (her)) generated spurrious matches in the humWC05dpD0 run (which did not use Greek-specific stopwords), pushing down the desired page from rank 28 to beyond the top-50. Normally, common words have little effect on the ranking because they have a low inverse document frequency (idf), but in this mixed language collection, common words in the Greek documents are still fairly uncommon overall, and hence get relatively more weight. This topic illustrates why stopword processing may be of more importance in mixed language collections than in single language collections.

Even though there were just 16 Greek topics, with careful experimental setup and detailed pertopic analysis, we learned a lot about Greek web search in a mixed language collection. Stemming can be quite helpful, accent mismatches are common (especially in the important Title field of web documents), and stopwords common in one language may be over-weighted in a mixed language collection by traditional idf formulations. WC0233: Table 7 shows that the biggest impact of switching to the Danish-specific stopfile was a 71 point increase in FRS on topic 233 (presserum europaeiske kantor for bekaempelse af svig (press room of the european anti fraud office)). Without having ‘af’ as a stopword, the first relevant rank fell from 2 to 21. This appears to be a similar finding to Greek topic WC0445 in that a common word in one language was uncommon enough in the mixed language collection to be assigned a high enough inverse document frequency to cause trouble. (Our Danish stoplist was based on Porter’s [ 5 ].) Incidentally, with stemming enabled, the rank increased from 2 to 1 for this topic, in part because of an extra ‘bekaempelse’ match in the meta keywords and also from an extra ‘Europaeiske’ match in body. It’s good to see that the SearchServer stemmer handled the ae vs. ae variation of Danish (the query words used the one-character ligature (ae) while the document words used two letters (‘a’ and ‘e’)).

WC0392: Another interesting Danish stemming case was on topic 392 (Rigsombudsmanden i Grønland (the high commissioner of greenland)). With stemming, the rank of the desired page increased from 24 to 19. The extra matches from stemming were ‘Rigsombudsmand’ and ‘Groenland’ (the latter occurred in the filenames of img tags, which we indexed). Again, it’s good to see that the SearchServer stemmer matched the query form using the Danish o with stroke (ø) with the two-letter variant (‘oe’).

WC0317: On topic 317 (økologisk landbrug i europa (organic farming in europe)), the rank of the desired page actually fell from 4 to 8 with stemming, even though the additional matches of ‘okologisk’ (in the meta keywords) and ‘landbrugets’ look proper. (As an aside, the compound ‘landbrugspolitik’ was not matched; we’re unsure in general how common compound words are in Danish.) The relevance scores of the top documents were close together for this topic, so the fall in rank appears to be a chance result. Note that the cTREC text reader used for these experiments did not normalize the html entity reference ‘Ø’ to Ø (or most other entity references for that matter, which may have impaired the overall results for some languages). It’s good to see that the SearchServer stemmer matched the query form using the Danish o with stroke (ø) with the one-letter variant (‘o’). For Icelandic, we used English stopwords and English stemming. We review some topics to see what can be learned about Icelandic retrieval.

WC0488: Table 9 shows that the only topic on which English stemming made a difference was topic 488 (framboð ferskvatns í evrópu (Fresh water supplies in europa)). The desired page’s rank fell from 1 to 2 with English stemming because it matched the word ‘Ferskvatn’ which was not in the desired page (the English lexical stemmer was augmented with a stem guesser for unrecognized words). A variant in the desired page, ‘ferskvatni’, was not matched by English stemming. It appears that ‘í’ is a potential Icelandic stopword (‘i’ actually was not in our English list though arguably should be). This topic also shows that Icelandic uses the small letter Eth (ð). SearchServer case normalizes ð to the capital letter Eth (Ð).

WC0456: In topic 456 (upplýsingar um europol (europol factsheet)), English stemming missed apparent variants to the query word ‘upplýsingar’ such as ‘Upplýsingasíða’ and ‘upplýsingamál’. ‘um’ appears to be a potential Icelandic stopword.

WC0243: In (home page) topic 243 (umhverfisstofnun evrópu (european environment agency)), we noticed that some web pages used entity references such as ‘ð’ and ‘þ’ and ‘ý’ which our cTREC text reader did not normalize to the corresponding character, possibly impairing results for some queries.

We were disappointed that the Icelandic thorn (lowercase þ or uppercase Þ) was not used in any of the topic words. But overall, even with just 5 topics in the test set, we have learned at least that an Icelandic stemmer would potentially be helpful for Icelandic retrieval. WebCLEF participants were requested to contribute at least 30 known-item topics. Each topic consisted of a query, the correct answer page in EuroGOV, and a list of duplicate and translated pages in EuroGOV. We contributed 30 English topics. Tables 10 and 11 separate the results for our topics from the other English topics. Based on the scores, it appears that our named page topics may have been easier than the others, but our home page topics may have been harder.

To create a topic, we typically started by randomly selecting an English language page from the EuroGOV collection. (The organizers had provided a languages.tar.gz file which listed the languages detected in each document; we reduced this file to the 252,574 pages labelled just as ‘english’, then randomly selected pages from this list.) We alternated between creating named page queries and home page queries.

If we wanted a named page query, we tried to understand the random page well enough to create an unambiguous query for it. Sometimes we rejected a page for being too obscure, and tried browsing to a related page for which a clearer query could be made. (Browsing was done on the live web; then we would find the new page in EuroGOV by extracting a phrase and searching EuroGOV with SearchServer.) If browsing was not fruitful, we started over with a new random page. Sometimes we started over because the area we were browsing looked too similar to an area for which we had already made a query.

Expt o-NP-EN-oth l-NP-EN-oth p-NP-EN-oth s-NP-EN-oth d-NP-EN-oth r-NP-EN-oth p-HP-EN-oth l-HP-EN-oth o-HP-EN-oth d-HP-EN-oth s-HP-EN-oth r-HP-EN-oth o-NP-EN-hum p-NP-EN-hum d-NP-EN-hum l-NP-EN-hum s-NP-EN-hum r-NP-EN-hum p-HP-EN-hum r-HP-EN-hum o-HP-EN-hum d-HP-EN-hum l-HP-EN-hum s-HP-EN-hum

For home page queries, usually the random start page was not a home page, so we would typically try to browse to the closest home page for that page (again, typically on the live web, by following links or truncating the url).

To find duplicates, typically we extracted a phrase from the document and used SearchServer to find other pages with that phrase, then checked those pages to confirm they were duplicates. If a page had more duplicates than we were willing to record, we started over with a new page.

To find translations, typically we browsed the live web for links to translated pages, then used SearchServer to find them in EuroGOV. Finding translations took a lot of detective work. Sometimes the url was the same except for a language tag, making it easy to find the translations with SearchServer. Sometimes sites had direct links to the translations, which was also easy. But sometimes sites just had links to the top-level page for each language, so we would see how to browse down for English, and then try to do analogous browsing for the translation language, grasping for clues such as possible word translations or similar pictures, to get to the proper translated page. It’s quite possible we missed some translations.

For the query itself, we tried to make it as realistic as possible (e.g. short and general) but also unambiguous. This could depend on what other pages were available; e.g. for a biography of Giuseppe Medici, it was enough just to specify ‘Giuseppe Medici’ as the query because there were no other (English) pages focused on that person. Usually we tried candidate queries with the organizer-provided engines or a web search engine to see if there might be other valid interpretations of the query we hadn’t expected, so that we could adjust the query accordingly.

It seemed that a lot of times, our query ended up being fairly similar to the document title. Table 11 shows that in the ‘p’ experiment (which isolates giving more weight to the title and other meta properties), our queries did tend to be helped by weighting the title more. But the other groups’ English queries actually benefited even more often from this weighting. Unfortunately, we have run out of time to walk through topics for more languages. But for future reference, we list the per-topic tables for the remaining languages (in descending order by number of topics). for

Systems) Home Page.

October Multilingual information retrieval resource

[1] AltaVista's Babel Fish Translation Service . http://babelfish.altavista.com/tr

[2] Cross-Language Evaluation Forum web site . http://www.clef-campaign.org/

[3]

Andrew

Hodgson . Converting the Fulcrum Search Engine to Unicode . Sixteenth International Unicode Conference, 2000 .

[4] NTCIR (NII-NACSIS Test Collection http://research.nii.ac.jp/»ntcadm/index-en.html

[5]

M. F.

Porter . Snowball: A language for stemming http://snowball .tartarus.org/texts/introduction.html

[6]

S. E.

Robertson ,

Walker ,

Jones ,

M. M.

Hancock-Beaulieu and

Gatford . Okapi at TREC-3. Proceedings of TREC-3 , 1995 .

[7]

Jacques

Savoy . CLEF and http://www.unine.ch/info/clef/

[8]

Börkur

Sigurbjörnsson , Jaap Kamps and Maarten de Rijke. EuroGOV: Engineering a Multilingual Web Corpus . To appear in Working Notes for the CLEF 2005 Workshop.

[9]

Börkur

Sigurbjörnsson , Jaap Kamps and Maarten de Rijke. Overview of WebCLEF 2005 . To appear in Working Notes for the CLEF 2005 Workshop.

[10] Text REtrieval Conference (TREC) Home Page . http://trec.nist.gov/