European Web Retrieval Experiments with
      Hummingbird SearchServerTM at CLEF 2005
                                        Stephen Tomlinson
                                          Hummingbird
                                     Ottawa, Ontario, Canada
                               stephen.tomlinson@hummingbird.com
                                  http://www.hummingbird.com/

                                         August 22, 2005


                                               Abstract
      Hummingbird participated in the mixed monolingual retrieval task of the WebCLEF
      Track of the Cross-Language Evaluation Forum (CLEF) 2005. In this task, the system
      was given 547 known-item queries from 11 languages (134 Spanish, 121 English, 59
      Dutch, 59 Portuguese, 57 German, 35 Hungarian, 30 Danish, 30 Russian, 16 Greek, 5
      Icelandic and 1 French). The goal was to find the desired page in the 82GB EuroGOV
      collection (3.4 million pages crawled from government sites of 27 European domains).
      We experimented with different techniques for web retrieval and analyzed the dif-
      ferences between them. We defined a new measure, First Relevant Score (FRS), to
      facilitate per-topic analysis, and we focused on analyzing Greek, Danish and Icelandic
      topics. We found that stopword processing was more important than anticipated, per-
      haps because words common in one language may tend to be overweighted by inverse
      document frequency in a mixed language collection. Extra weight on the document
      title helped significantly, and extra weight on less deep urls significantly helped home
      page queries. Stemming was of neutral impact on average, but could make a substantial
      difference for individual queries.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval

General Terms
Measurement, Performance, Experimentation

Keywords
Greek Retrieval, Danish Retrieval, Icelandic Retrieval, First Relevant Score, Per-Topic Analysis


1    Introduction
Hummingbird SearchServer1 is a toolkit for developing enterprise search and retrieval applications.
The SearchServer kernel is also embedded in other Hummingbird products for the enterprise.
   1 SearchServerTM , SearchSQLTM and Intuitive SearchingTM are trademarks of Hummingbird Ltd. All other

copyrights, trademarks and tradenames are the property of their respective owners.
    SearchServer works in Unicode internally [3] and supports most of the world’s major char-
acter sets and languages. The major conferences in text retrieval experimentation (CLEF [2],
NTCIR [4] and TREC [10]) have provided judged test collections for objective experimentation
with SearchServer in more than a dozen languages.
    This (draft) paper describes experimental work with SearchServer for the task of finding known
home or named pages in 11 European languages (Spanish, English, Dutch, Portuguese, German,
Hungarian, Danish, Russian, Greek, Icelandic and French) using the WebCLEF 2005 test collec-
tion.


2     Methodology
For the submitted runs in June 2005, SearchServer experimental development build 7.0.0.707 was
used.

2.1    Data
The collection to be searched was the EuroGOV collection. It consisted of 3,589,502 pages crawled
from government sites of 27 European domains. Uncompressed, it was 88,062,007,676 bytes
(82.0 GB). The average document size was 24,533 bytes. More details on this collection are in
[8]. Note that we only indexed 3,417,463 of the pages because the organizers provided a “blacklist”
of 172,039 pages to omit (primarily binary documents).
    For the mixed monolingual task, there were 547 known-item queries from 11 different languages
(134 Spanish, 121 English, 59 Dutch, 59 Portuguese, 57 German, 35 Hungarian, 30 Danish, 30
Russian, 16 Greek, 5 Icelandic and 1 French). Of these, 345 were named page queries and 242
were home page queries. More details on the mixed monolingual task are in the track overview
paper [9].

2.2    Indexing
Our indexing approach was based on the approach we used for TREC Web tasks the previous
three years (described in detail in [12]). Briefly, in addition to full-text indexing, the custom text
reader cTREC populated particular columns such as TITLE (if any), URL, URL_TYPE and
URL_DEPTH. The URL_TYPE was set to ROOT, SUBROOT, PATH or FILE, based on the
convention which worked well in TREC 2001 for the Twente/TNO group [15] on the entry page
finding task (also known as the home page finding task). The URL_DEPTH was set to a term
indicating the depth of the page in the site. Table 1 contains URL types and depths for example
URLs. The exact rules we used are given in [12].
    WebCLEF required a few indexing enhancements compared to TREC. In particular, it wouldn’t
suffice to assume all the pages were in the ASCII character set. We added a /cs option to our
cTREC text reader which used the first recognized ‘charset’ specification in the page (e.g. from
the meta http-equiv tag) to indicate from which character set to convert the page to Unicode
(Win_1252 was assumed if no charset was specified).
    For the baseline task, in which the system was not to make use of any of the topic metadata
such as the specified language of the query, we still indexed with English stopwords (even though
the majority of the documents were in other languages). We treated the apostrophe as a term
separator (which we normally do for languages other than English, but in this collection, it was
also a separator for English). No accents were indexed. English stemming was used on the table,
but SearchServer also indexed all the surface forms (after Unicode normalizations such as case
normalization), and the baseline runs just searched the surface forms, not the stems.
    For 2 of our submitted runs, we labelled the runs as making use of the topic and page language
metadata (which were always the same in the mixed monolingual task) along with the page’s
domain. For these runs, we created a set of language-specific indexes (one for each of the 11
query languages) which used a stemmer and stopfile for that language (for English and Icelandic,
                       Table 1: Examples of URL Type and Depth Values
            URL                          Type         Depth Depth Term
            http://nasa.gov/               ROOT              1      URLDEPTHA
            http://www.nasa.gov/           ROOT              1      URLDEPTHA
            http://jpl.nasa.gov/           ROOT              2      URLDEPTHAB
            http://fred.jpl.nasa.gov/      ROOT              3      URLDEPTHABC
            http://nasa.gov/jpl/           SUBROOT           2      URLDEPTHAB
            http://nasa.gov/jpl/fred/      PATH              3      URLDEPTHABC
            http://nasa.gov/index.html     ROOT              1      URLDEPTHA
            http://nasa.gov/fred.html      FILE              2      URLDEPTHAB


we actually used the original baseline index, which had English stems and stopwords). For some
of the languages, because we were close to the submission deadline, we also skipped indexing
some of the domains to save time (e.g. for Greek, just the ‘gr’ and ‘eu.int’ subsets of EuroGOV
were included because it was known all the results were in the ‘gr’ domain) which would have
a (probably minor) effect on the inverse document frequencies (minor especially since we always
included the ‘eu.int’ subset in each index). For 9 of the languages (Danish, Dutch, English, French,
German, Greek, Portuguese, Russian and Spanish), the lexical stemmer in SearchServer (based
on internal stemming component 3.7.0.15) was used. For Hungarian, the Neuchatel stemmer [7]
was used (see our companion ad hoc retrieval paper [11] for details). For Icelandic, we used the
English index as previously mentioned. For Greek and Russian, we additionally enabled indexing
of a few accents because the stemmer was accent-sensitive. When processing queries for these
runs, the query was directed to the index for the specified language.

2.3    Searching
We executed 7 runs in June 2005, though only 5 were allowed to be submitted. All 7 are described
here. The first 4 runs were ‘baseline’ runs which did not use the topic metadata. The other 3
runs made use of the topic metadata (in particular, the domain, and for the last 2 runs, also the
language).
    humWC05none: This run was a plain content search of the baseline table. No inflections
were used. This run was the analog of the “none” runs described in our ad hoc retrieval pa-
per [11]. It used the ‘2:3’ relevance method and document length normalization (SET RELE-
VANCE_DLEN_IMP 500). The IS_ABOUT predicate was used instead of the CONTAINS
predicate (and hence the VECTOR_GENERATOR was set to blank to disable inflections instead
of the TERM_GENERATOR), but the relevance calculation was the same. (This run was not
submitted.)
    humWC05p run: This submitted run was the same as humWC05none except that it put addi-
tional weight on matches in the title, url, first heading and some meta tags, including extra weight
on matching the query as a phrase in these fields. Below is an example SearchSQL query. The
searches on the ALL_PROPS column (which contained a copy of the title, url, etc. as described
in [12]) are the difference from the humWC05none run. Note that the FT_TEXT column indexed
the content and also all of the non-content fields except for the URL. More details of the syntax
are explained in [13]. This run used the same approach as the TREC 2004 humW04pl run except
that linguistic inflections were disabled.

  SELECT RELEVANCE(’2:3’) AS REL, DOCNO
  FROM EGOV
  WHERE
   (ALL_PROPS CONTAINS ’Giuseppe Medici’ WEIGHT 1) OR
   (ALL_PROPS IS_ABOUT ’Giuseppe Medici’ WEIGHT 1) OR
   (FT_TEXT IS_ABOUT ’Giuseppe Medici’ WEIGHT 10)
  ORDER BY REL DESC;

    humWC05dp run: This submitted run was the same as humWC05p except that it put additional
weight on urls of depth 4 or less (but not on the url type, though url types were still listed with
weight 0 as a way to prevent urls of depth greater than 4 from being excluded). Less deep urls
also received higher weight from inverse document frequency because (presumably) they are less
common. This run used the same approach as the TREC 2004 humW04dpl run except that
linguistic inflections were disabled. Below is an example WHERE clause:

  WHERE
  ((ALL_PROPS CONTAINS ’Giuseppe Medici’ WEIGHT 1) OR
   (ALL_PROPS IS_ABOUT ’Giuseppe Medici’ WEIGHT 1) OR
   (FT_TEXT IS_ABOUT ’Giuseppe Medici’ WEIGHT 10)
  ) AND (
   (URL_TYPE CONTAINS ’ROOT’ WEIGHT 0) OR
   (URL_TYPE CONTAINS ’SUBROOT’ WEIGHT 0) OR
   (URL_TYPE CONTAINS ’PATH’ WEIGHT 0) OR
   (URL_TYPE CONTAINS ’FILE’ WEIGHT 0) OR
   (URL_DEPTH CONTAINS ’URLDEPTHA’ WEIGHT 5) OR
   (URL_DEPTH CONTAINS ’URLDEPTHAB’ WEIGHT 5) OR
   (URL_DEPTH CONTAINS ’URLDEPTHABC’ WEIGHT 5) OR
   (URL_DEPTH CONTAINS ’URLDEPTHABCD’ WEIGHT 5) )

    humWC05rdp run: This submitted run was the same as humWC05dp except that it put addi-
tional weight on the url type. This run used the same approach as the TREC 2004 humW04rdpl
run except that linguistic inflections were disabled. Below is an example WHERE clause:

  WHERE
  ((ALL_PROPS CONTAINS ’Giuseppe Medici’ WEIGHT 1) OR
   (ALL_PROPS IS_ABOUT ’Giuseppe Medici’ WEIGHT 1) OR
   (FT_TEXT IS_ABOUT ’Giuseppe Medici’ WEIGHT 10)
  ) AND (
   (URL_TYPE CONTAINS ’ROOT’ WEIGHT 10) OR
   (URL_TYPE CONTAINS ’SUBROOT’ WEIGHT 10) OR
   (URL_TYPE CONTAINS ’PATH’ WEIGHT 10) OR
   (URL_TYPE CONTAINS ’FILE’ WEIGHT 0) OR
   (URL_DEPTH CONTAINS ’URLDEPTHA’ WEIGHT 5) OR
   (URL_DEPTH CONTAINS ’URLDEPTHAB’ WEIGHT 5) OR
   (URL_DEPTH CONTAINS ’URLDEPTHABC’ WEIGHT 5) OR
   (URL_DEPTH CONTAINS ’URLDEPTHABCD’ WEIGHT 5) )

   humWC05dpD0 run: This run was the same as humWC05dp except that the domain infor-
mation of the topic metadata was used to restrict the search to the specified domain. Below is
an example of the domain filter added to the WHERE clause for a case in which the page was
known to be in the ‘it’ domain (which implied the DOCNO would contain ‘Eit’). This run was
not submitted.

  AND (DOCNO CONTAINS ’Eit’ WEIGHT 0)

    humWC05dpD run: This submitted run was the same as humWC05dpD0 except that the
language information of the topic metadata was used to direct the search to the table for the
specified language (i.e. the WHERE clause was the same as for humWC05dpD0, but the FROM
clause specified a different table). Inflections were still not used.
    humWC05dplD run: This submitted run was the same as humWC05dpD except that the
content and title searches included linguistic expansion from language-specific stemming (this was
done with SET VECTOR_GENERATOR ‘word!ftelp/inflect’; note that /decompound (applicable
to Dutch and German) is implied for /inflect with SET VECTOR_GENERATOR, unlike with
SET TERM_GENERATOR).

2.4    Evaluation Measures
If one wishes to focus on just the first relevant document, the traditional measure is “Reciprocal
Rank” (RR). For a topic, it is 1r where r is the rank of the first row for which a desired page is
found, or zero if a desired page was not found. “Mean Reciprocal Rank” (MRR) is the mean of
the reciprocal ranks over all the topics.
    An experimental measure introduced in this paper (along with the companion ad hoc retrieval
paper [11]) is “First Relevant Score” (denoted “FRS”). Like reciprocal rank, it is based on just the
rank of the first relevant retrieved for a topic, but it is better suited to per-topic analysis. FRS is
1.081−r where r is the rank of the first row for which a desired page is found, or zero if a desired
page was not found. Like reciprocal rank, finding the first relevant at rank 1 produces a score of
1.0. At rank 2, FRS is just 7 points lower (0.93), whereas RR is 50 points lower (0.50). At rank
3, FRS is another 7 points lower (0.86), whereas RR is 17 points lower (0.33). At rank 10, FRS
is 0.50, whereas RR is 0.10. FRS is greater than RR for ranks 2 to 52 and lower for ranks 53
and beyond. A possible interpretation of FRS is that it may be an indicator of the percentage of
potential result list reading the system saved the user to get to the first relevant, assuming that
users are less and less likely to continue reading as they get deeper into the result list.
    “Success@n” is the percentage of topics for which at least one relevant document was returned
in the first n rows. Like the other first relevant measures, this measure hides a lot of retrieval
differences (particularly in recall), but it is more intuitive and may be an indicator of a user’s
impression of a method’s robustness across topics. This paper lists Success@1, Success@5 and
Success@10.

2.5    Per-Topic Tables
The 7 runs allow us to isolate 6 ‘web techniques’ which are denoted as follows:

   • ‘p’ (extra weight for phrases in the Title and other properties plus extra weight for vector
     search on properties): The humWC05p score minus the humWC05none score.
   • ‘d’ (modest extra weight for less deep urls): The humWC05dp score minus the humWC05p
     score.
   • ‘r’ (strong extra weight for urls of root, subroot or path types): The humWC05rdp score
     minus the humWC05dp score.
   • ‘o’ (domain filtering): The humWC05dpD0 score minus the humWC05dp score.
   • ‘s’ (stopwords specific to the language and possibly accent-indexing and inverse document
     frequency changes): The humWC05dpD score minus the humWC05dpD0 score.
   • ‘l’ (linguistic expansion from stemming): The humWC05dplD score minus the humWC05dpD
     score.

    For the per-topic tables comparing 2 diagnostic runs (such as Table 3), the columns are as
follows:
   • “Expt” specifies the experiment. It starts with one of the above 6 web techniques, followed
     by ‘NP’ for named page queries or ‘HP’ for home page queries, optionally followed by the
     language code.
   • “∆FRS” is the difference of the (mean) first relevant score of the two runs being compared.
                        Table 2: Mean Scores of Submitted WebCLEF Runs
 Run                     FRS       Success@1         Success@5      Success@10                  MRR
 humWC05dplD             0.635     212/547 (39%)       315/547 (58%)       356/547 (65%)        0.478
 humWC05dpD              0.627     204/547 (37%)       314/547 (57%)       353/547 (65%)        0.471
 (humWC05dpD0)           0.603     197/547 (36%)       303/547 (55%)       343/547 (63%)        0.449
 humWC05rdp              0.589     195/547 (36%)       293/547 (54%)       330/547 (60%)        0.441
 humWC05dp               0.583     190/547 (35%)       292/547 (53%)       327/547 (60%)        0.433
 humWC05p                0.559     182/547 (33%)       276/547 (50%)       318/547 (58%)        0.415
 (humWC05none)           0.513     152/547 (28%)       253/547 (46%)       284/547 (52%)        0.365
 NP: dplD                0.726     139/305 (46%)       204/305 (67%)       229/305 (75%)        0.560
 NP: dpD                 0.720     142/305 (47%)       207/305 (68%)       228/305 (75%)        0.571
 NP: dpD0                0.698     141/305 (46%)       202/305 (66%)       223/305 (73%)        0.552
 NP: rdp                 0.662     132/305 (43%)       187/305 (61%)       206/305 (68%)        0.517
 NP: dp                  0.669     134/305 (44%)       192/305 (63%)       210/305 (69%)        0.526
 NP: p                   0.669     133/305 (44%)       193/305 (63%)       212/305 (70%)        0.527
 NP: none                0.648     119/305 (39%)       187/305 (61%)       203/305 (67%)        0.492
 HP: dplD                0.521     73/242 (30%)        111/242 (46%)       127/242 (52%)        0.375
 HP: dpD                 0.509      62/242 (26%)       107/242 (44%)       125/242 (52%)        0.345
 HP: dpD0                0.484      56/242 (23%)       101/242 (42%)       120/242 (50%)        0.319
 HP: rdp                 0.497      63/242 (26%)       106/242 (44%)       124/242 (51%)        0.345
 HP: dp                  0.474      56/242 (23%)       100/242 (41%)       117/242 (48%)        0.317
 HP: p                   0.420      49/242 (20%)        83/242 (34%)       106/242 (44%)        0.275
 HP: none                0.343      33/242 (14%)       66/242 (27%)         81/242 (33%)        0.205


    • “95% Conf” is an approximate 95% confidence interval for the difference (calculated from
      plus/minus twice the standard error of the mean difference). If zero is not in the interval,
      the result is “statistically significant” (at the 5% level), i.e. the feature is unlikely to be of
      neutral impact (on average), though if the average difference is small (e.g. <0.020) it may
      still be too minor to be considered “significant” in the magnitude sense.
    • “vs.” is the number of topics on which the first run scored higher, lower and tied (respectively)
      compared to the second run. These numbers should always add to the number of topics in
      the particular experiment.
    • “3 Extreme Diffs (Topic)” lists 3 of the individual topic differences, each followed by the topic
      number in brackets (the topic numbers range from 1 to 547). The first difference is the largest
      one of any topic (based on the absolute value). The third difference is the largest difference
      in the other direction (so the first and third differences give the range of differences observed
      in this experiment). The middle difference is the largest of the remaining differences (based
      on the absolute value).


3     Results of Web Experiments
Table 2 lists the mean scores of the 5 submitted runs (and the 2 other diagnostic runs in brackets).
It also lists the mean scores over just named page (NP) and home page (HP) queries.
    Table 3 isolates the differences in ‘first relevant score’ (FRS) between the runs of Table 2.

    • The ‘p’ technique (extra weight for phrases in the Title and other properties plus extra weight
      for vector search on properties) was of statistically significant benefit for both named pages
      and home pages, which is consistent with our TREC results [14] except that the benefit was
      larger at TREC.
                 Table 3: Impact of Web Techniques on First Relevant Score
    Expt      ∆FRS       95% Conf          vs.             3 Extreme Diffs (Topic)
    o-NP      0.029      ( 0.015, 0.042)    44-0-261        1.00 (285), 0.87 (292), 0.00 (289)
    s-NP       0.022     ( 0.007, 0.037)    43-24-238     −0.88 (527), 0.76 (477), 0.87 (116)
    p-NP       0.021     ( 0.007, 0.035)    65-28-212     −0.83 (292), −0.64 (477), 0.59 (415)
    l-NP      0.006     (−0.014, 0.025)     47-59-199      1.00 (112), 0.95 (402), −0.78 (157)
    d-NP       0.000    (−0.005, 0.005)     24-37-244      0.40 (377), 0.17 (524), −0.14 (423)
    r-NP      −0.008    (−0.018, 0.003)     18-49-238     −0.87 (469), −0.46 (528), 0.68 (418)
    p-HP       0.077     ( 0.050, 0.105)    82-19-141      1.00 (101), 0.98 (313), −0.92 (435)
    d-HP       0.054     ( 0.032, 0.075)    64-20-158       1.00 (453), 0.91 (52), −0.40 (290)
    s-HP       0.025     ( 0.005, 0.044)    53-21-168      0.91 (39), −0.76 (346), −0.79 (20)
    r-HP       0.023    (−0.009, 0.054)     53-48-141     −1.00 (148), −0.93 (246), 0.92 (32)
    l-HP      0.012     (−0.011, 0.036)     41-50-151      0.96 (123), 0.93 (124), −0.68 (324)
    o-HP      0.010      ( 0.003, 0.017)    22-0-220        0.43 (432), 0.40 (507), 0.00 (546)


   • The ‘d’ technique (modest extra weight for less deep urls) was of statistically significant
     benefit for home pages and neutral on average for named pages, which is consistent with our
     TREC results except that the benefit for home pages was larger at TREC.

   • The ‘r’ technique (strong extra weight for urls of root, subroot or path types) was less
     detrimental than we expected for named pages and less helpful than we expected for home
     pages compared to our TREC results.
   • The ‘o’ technique (domain filtering), as expected, never caused the score to go down on any
     topic (as the ‘vs.’ column shows) because it just included rows from the known domain. But
     the benefit was not large on average, so apparently the unfiltered queries usually were not
     confused much by the extra domains.

   • The ‘s’ technique (stopwords specific to the language and possibly accent-indexing and in-
     verse document frequency changes) was a surprise in that it led to a statistically significant
     benefit for both named pages and home pages. We look at this more below.

   • The ‘l’ technique (linguistic expansion from stemming) was of neutral impact on average,
     but it could make a substantial difference for individual queries as we will see below.

    In the sections that follow, we focus on Greek, Danish and Icelandic because this is the first
time we have had judged test collections for these languages. In partciular, we focus on the impact
of the ‘s’ and ‘l’ techniques, i.e. the impacts of stopwords (and accents) and stemming. For English,
we compare the scores on our own contributed topics to the other English topics. The last section
lists the per-topic tables for the remaining languages in descending order by number of topics, for
future reference.
                   Table 4: Mean Scores of WebCLEF Runs on Greek Queries
          Run              FRS     Success@1    Success@5    Success@10            MRR
          dplD-NP-EL      0.536     3/11 (27%)      5/11 (45%)      6/11 (55%)      0.363
          dpD0-NP-EL      0.442     3/11 (27%)      5/11 (45%)      5/11 (45%)      0.316
          dpD-NP-EL       0.398     1/11 ( 9%)      4/11 (36%)      5/11 (45%)      0.206
          dp-NP-EL        0.306     3/11 (27%)      3/11 (27%)      3/11 (27%)      0.279
          rdp-NP-EL       0.297     2/11 (18%)      3/11 (27%)      3/11 (27%)      0.233
          p-NP-EL         0.291     3/11 (27%)      3/11 (27%)      3/11 (27%)      0.277
          none-NP-EL      0.287     2/11 (18%)      3/11 (27%)      3/11 (27%)      0.232
          dpD0-HP-EL      0.657      2/5 (40%)      3/5 (60%)        3/5 (60%)      0.483
          dplD-HP-EL      0.571      2/5 (40%)       3/5 (60%)       3/5 (60%)      0.467
          rdp-HP-EL       0.571      2/5 (40%)      3/5 (60%)        3/5 (60%)      0.467
          dp-HP-EL        0.571      2/5 (40%)       3/5 (60%)       3/5 (60%)      0.467
          dpD-HP-EL       0.532      1/5 (20%)       3/5 (60%)       3/5 (60%)      0.340
          p-HP-EL         0.480      1/5 (20%)      2/5 (40%)        3/5 (60%)      0.289
          none-HP-EL      0.430      1/5 (20%)       2/5 (40%)      2/5 (40%)       0.278


           Table 5: Impact of Web Techniques on First Relevant Score, Greek Queries
   Expt          ∆FRS         95% Conf        vs.           3 Extreme Diffs (Topic)
   l-NP-EL       0.138      (−0.056, 0.333)     5-2-4    1.00 (112), 0.34 (395), −0.15 (403)
   o-NP-EL       0.136      (−0.015, 0.288)     3-0-8     0.74 (403), 0.40 (395), 0.00 (266)
   d-NP-EL        0.015     (−0.016, 0.046)    1-0-10     0.17 (524), 0.00 (151), 0.00 (266)
   p-NP-EL        0.004     (−0.012, 0.020)     1-1-9    0.07 (184), 0.00 (151), −0.03 (524)
   r-NP-EL       −0.009     (−0.024, 0.005)     0-2-9    −0.07 (184), −0.03 (524), 0.00 (266)
   s-NP-EL       −0.045     (−0.117, 0.028)     1-4-6    −0.34 (395), −0.14 (498), 0.13 (445)
   d-HP-EL        0.092     (−0.092, 0.276)     1-0-4      0.46 (366), 0.00 (271), 0.00 (25)
   o-HP-EL       0.086      (−0.086, 0.258)     1-0-4      0.43 (432), 0.00 (366), 0.00 (25)
   p-HP-EL        0.050     (−0.050, 0.150)     1-0-4      0.25 (366), 0.00 (271), 0.00 (25)
   l-HP-EL       0.039      (−0.012, 0.090)     2-0-3      0.12 (366), 0.07 (271), 0.00 (25)
   r-HP-EL       0.000           n/a           0-0-5       0.00 (271), 0.00 (366), 0.00 (25)
   s-HP-EL       −0.125     (−0.316, 0.066)     1-2-2    −0.43 (432), −0.26 (366), 0.07 (271)


3.1    Greek Retrieval
Table 4 lists the mean scores for the 11 Greek named page queries and 5 Greek home page
queries. The top-scoring runs used stemming (run humWC05dplD) or disabled accent-indexing
(run humWC05dplD0). The run with accent-indexing and not stemming (humWC05dpD) did
not score as highly on average. Table 5 shows that the ‘l’ technique (stemming, i.e. the dplD
score minus the dpD score) was positive on average, while the ‘s’ factor (the dpD score minus
the dpD0 score, primarily isolating the impact of stopwords specific to the language, including
specifying accent-indexing in the Greek case) was negative, and it lists the topics most affected
by each technique in each direction, which we examine below. (In the below topic-analysis, the
translations are based partly on the official topic translations and partly on the online Greek-to-
English translation service at [1]).
    WC0112: Table 5 shows that the biggest impact of Greek stemming was on topic 112 (Pl rhs
lÐa twn upourg¸n kai ufupourg¸n ìlwn upourgeÐwn ths Ellhnik s kubèrnhs (List of minis-
ters and deputy ministers for all the ministries of the Greek government)). The desired page was
not retrieved in the top-50 without inflecting because the key query terms were plurals (upourg¸n
(ministers), ufupourg¸n (undersecretaries), upourgeÐwn (ministries)) while the desired page just
contained singular forms (Upourgìs (Minister), Ufupourgìs (Undersecretary), UpourgeÐo (Min-
istry)).
     WC0395: Table 5 shows that the next biggest impact of Greek stemming was on topic 395 (O
'Ellhnas prwjupourgìs kai to m num tou (The Greek Prime Minister and his message)). With
stemming, the first relevant was found at rank 13 instead of 39, a 34 point increase in FRS (in
the reciprocal rank measure, this would just be a 5 point increase). Without stemming, the only
matching word was tou (his), which probably should have been a stopword. With stemming, the
query word prwjupourgìs (Prime Minister) matched the document’s variant (PrwjupourgoÔ).
Because we enabled indexing of Greek accents for our lexical Greek stemmer, the query word
m num (message) did not match the document form M numa (which did not include an accent on
the last character; the first letter is just an lowercase-uppercase difference which all runs handled by
normalizing Unicode to uppercase). Note that the humWC05dpD0 run did match M numa because
accent-indexing was not enabled for this run; presumably this is why the s-NP-EL line of Table
5 shows that switching to the Greek-specific stopfile (which enabled accent indexing) decreased
FRS 34 points for this topic. For most languages, our lexical stemmers are accent-insensitive; we
should investigate doing the same for Greek.
     WC0432: Table 5 shows that the biggest impact of switching to the Greek-specific stopfile was
a detrimental impact on topic 432 (EÐdos Ellhnik s iolÐdas gia th nèleu gia to mèllon
ths Eur¸phs (Greek home page of the convention for the future of Europe)). The desired page was
found at rank 12 without accent-indexing but was not retrieved in the top-50 with accent-indexing.
The humWC05dpD0 run matched the document title terms which were in uppercase and did not
have accents, particularly SUNELEUSH (ASSEMBLY), MELLON (FUTURE) and EURWPHS
(EUROPE). (The corresponding query words had accents: nèleu (assembly), mèllon (future)
and Eur¸phs (Europe)). This issue would presumably impair the ‘p’ web technique (extra weight
on properties such as the title) because title words are often in uppercase and apparently in Greek
uppercase words often omit the accents. (Incidentally, the o-HP-EL line of Table 5 shows that
domain filtering (restricting to the .gr domain) was useful for this query; without it, even without
accent-indexing, the retrieved pages were mostly from the .eu.int domain.)
     WC0445: Table 5 shows that the biggest positive impact of switching to the Greek-specific stop-
file was on topic 445 (PlhroforÐes epikoinwnÐas ìlwn twn upourgeÐwn ths Ellhnik s kubèrnhs
(Contact information of all the ministries of the Greek government)). The reason seems to be that
the non-content words in the query (such as twn (of) and ths (her)) generated spurrious matches
in the humWC05dpD0 run (which did not use Greek-specific stopwords), pushing down the desired
page from rank 28 to beyond the top-50. Normally, common words have little effect on the ranking
because they have a low inverse document frequency (idf), but in this mixed language collection,
common words in the Greek documents are still fairly uncommon overall, and hence get relatively
more weight. This topic illustrates why stopword processing may be of more importance in mixed
language collections than in single language collections.
     Even though there were just 16 Greek topics, with careful experimental setup and detailed per-
topic analysis, we learned a lot about Greek web search in a mixed language collection. Stemming
can be quite helpful, accent mismatches are common (especially in the important Title field of web
documents), and stopwords common in one language may be over-weighted in a mixed language
collection by traditional idf formulations.
                  Table 6: Mean Scores of WebCLEF Runs on Danish Queries
       Run              FRS      Success@1      Success@5     Success@10              MRR
       dplD-NP-DA         0.807    12/19 (63%)       14/19 (74%)     15/19 (79%)      0.693
       dpD-NP-DA          0.798    11/19 (58%)       15/19 (79%)     16/19 (84%)      0.661
       dpD0-NP-DA         0.759    11/19 (58%)       14/19 (74%)     15/19 (79%)      0.632
       p-NP-DA            0.758    10/19 (53%)       13/19 (68%)     15/19 (79%)      0.616
       dp-NP-DA           0.754    11/19 (58%)       13/19 (68%)     15/19 (79%)      0.629
       rdp-NP-DA          0.743    10/19 (53%)       13/19 (68%)     15/19 (79%)      0.593
       none-NP-DA         0.704     9/19 (47%)       12/19 (63%)     14/19 (74%)      0.550
       rdp-HP-DA          0.336     1/11 ( 9%)        2/11 (18%)     4/11 (36%)       0.158
       dpD-HP-DA          0.320     1/11 ( 9%)        2/11 (18%)      4/11 (36%)      0.147
       dpD0-HP-DA         0.310     0/11 ( 0%)        2/11 (18%)     3/11 (27%)       0.108
       dp-HP-DA           0.310     0/11 ( 0%)       2/11 (18%)       3/11 (27%)      0.108
       dplD-HP-DA         0.301     1/11 ( 9%)        1/11 ( 9%)      3/11 (27%)      0.135
       p-HP-DA            0.242     0/11 ( 0%)        1/11 ( 9%)      2/11 (18%)      0.067
       none-HP-DA         0.163     0/11 ( 0%)        1/11 ( 9%)      1/11 ( 9%)      0.052


          Table 7: Impact of Web Techniques on First Relevant Score, Danish Queries
   Expt          ∆FRS        95% Conf        vs.            3 Extreme Diffs (Topic)
   p-NP-DA        0.053      ( 0.003, 0.104)     7-1-11    0.39 (311), 0.25 (329), −0.07 (264)
   s-NP-DA        0.040     (−0.037, 0.116)      2-1-16    0.71 (233), 0.12 (329), −0.08 (58)
   l-NP-DA        0.008     (−0.050, 0.067)      4-1-14    −0.43 (329), 0.13 (311), 0.25 (219)
   o-NP-DA       0.005      (−0.002, 0.013)      2-0-17     0.05 (329), 0.04 (58), 0.00 (481)
   d-NP-DA       −0.004     (−0.035, 0.027)      2-4-13    0.14 (454), 0.14 (329), −0.14 (58)
   r-NP-DA       −0.011     (−0.027, 0.005)      0-3-16   −0.14 (211), −0.05 (329), 0.00 (232)
   p-HP-DA        0.079      ( 0.013, 0.144)      6-0-5       0.27 (80), 0.23 (48), 0.00 (429)
   d-HP-DA        0.068     (−0.076, 0.211)       2-1-8    0.78 (286), 0.02 (392), −0.05 (53)
   r-HP-DA        0.027     (−0.013, 0.066)       3-0-8     0.21 (385), 0.07 (286), 0.00 (429)
   s-HP-DA        0.011     (−0.034, 0.055)       4-3-4     0.15 (80), 0.07 (286), −0.13 (48)
   o-HP-DA       0.000             n/a           0-0-11      0.00 (317), 0.00 (53), 0.00 (429)
   l-HP-DA       −0.019     (−0.069, 0.030)       3-2-6   −0.21 (317), −0.13 (48), 0.08 (392)


3.2    Danish Retrieval
WC0233: Table 7 shows that the biggest impact of switching to the Danish-specific stopfile was a
71 point increase in FRS on topic 233 (presserum europæiske kantor for bekæmpelse af svig (press
room of the european anti fraud office)). Without having ‘af’ as a stopword, the first relevant rank
fell from 2 to 21. This appears to be a similar finding to Greek topic WC0445 in that a common
word in one language was uncommon enough in the mixed language collection to be assigned a
high enough inverse document frequency to cause trouble. (Our Danish stoplist was based on
Porter’s [5].) Incidentally, with stemming enabled, the rank increased from 2 to 1 for this topic,
in part because of an extra ‘bekaempelse’ match in the meta keywords and also from an extra
‘Europaeiske’ match in body. It’s good to see that the SearchServer stemmer handled the æ vs.
ae variation of Danish (the query words used the one-character ligature (æ) while the document
words used two letters (‘a’ and ‘e’)).
     WC0392: Another interesting Danish stemming case was on topic 392 (Rigsombudsmanden i
Grønland (the high commissioner of greenland)). With stemming, the rank of the desired page
increased from 24 to 19. The extra matches from stemming were ‘Rigsombudsmand’ and ‘Groen-
land’ (the latter occurred in the filenames of img tags, which we indexed). Again, it’s good to see
that the SearchServer stemmer matched the query form using the Danish o with stroke (ø) with
the two-letter variant (‘oe’).
    WC0317: On topic 317 (økologisk landbrug i europa (organic farming in europe)), the rank
of the desired page actually fell from 4 to 8 with stemming, even though the additional matches
of ‘okologisk’ (in the meta keywords) and ‘landbrugets’ look proper. (As an aside, the compound
‘landbrugspolitik’ was not matched; we’re unsure in general how common compound words are in
Danish.) The relevance scores of the top documents were close together for this topic, so the fall in
rank appears to be a chance result. Note that the cTREC text reader used for these experiments
did not normalize the html entity reference ‘&Oslash;’ to Ø (or most other entity references for
that matter, which may have impaired the overall results for some languages). It’s good to see
that the SearchServer stemmer matched the query form using the Danish o with stroke (ø) with
the one-letter variant (‘o’).
                 Table 8: Mean Scores of WebCLEF Runs on Icelandic Queries
          Run             FRS     Success@1    Success@5    Success@10     MRR
          dpD-NP-IS        0.745     2/4 (50%)      2/4 (50%)      3/4 (75%)       0.550
          dpD0-NP-IS       0.745     2/4 (50%)      2/4 (50%)      3/4 (75%)       0.550
          dp-NP-IS         0.731     2/4 (50%)      2/4 (50%)      3/4 (75%)       0.548
          dplD-NP-IS       0.727     1/4 (25%)      2/4 (50%)      3/4 (75%)       0.425
          p-NP-IS          0.727     2/4 (50%)      2/4 (50%)      3/4 (75%)       0.546
          rdp-NP-IS        0.670     2/4 (50%)      2/4 (50%)      2/4 (50%)       0.534
          none-NP-IS       0.629     2/4 (50%)      2/4 (50%)      2/4 (50%)       0.527
          rdp-HP-IS        0.500     0/1 ( 0%)      0/1 ( 0%)      1/1 (100%)      0.100
          dplD-HP-IS       0.271     0/1 ( 0%)      0/1 ( 0%)       0/1 ( 0%)      0.056
          dpD-HP-IS        0.271     0/1 ( 0%)      0/1 ( 0%)       0/1 ( 0%)      0.056
          dpD0-HP-IS       0.271     0/1 ( 0%)      0/1 ( 0%)       0/1 ( 0%)      0.056
          dp-HP-IS         0.271     0/1 ( 0%)      0/1 ( 0%)       0/1 ( 0%)      0.056
          p-HP-IS          0.271     0/1 ( 0%)      0/1 ( 0%)       0/1 ( 0%)      0.056
          none-HP-IS       0.232     0/1 ( 0%)      0/1 ( 0%)       0/1 ( 0%)      0.050


         Table 9: Impact of Web Techniques on First Relevant Score, Icelandic Queries
      Expt        ∆FRS       95% Conf        vs.           3 Extreme Diffs (Topic)
      p-NP-IS     0.098      (−0.018, 0.215)     2-0-2       0.22 (456), 0.17 (46), 0.00 (6)
      o-NP-IS      0.014     (−0.015, 0.043)     1-0-3       0.06 (46), 0.00 (456), 0.00 (6)
      d-NP-IS     0.004      (−0.025, 0.034)     1-1-2     0.04 (456), 0.00 (488), −0.03 (46)
      s-NP-IS     0.000           n/a            0-0-4       0.00 (46), 0.00 (456), 0.00 (6)
      l-NP-IS     −0.019     (−0.056, 0.019)     0-1-3      −0.07 (488), 0.00 (6), 0.00 (46)
      r-NP-IS     −0.061     (−0.137, 0.015)     0-2-2    −0.15 (456), −0.09 (46), 0.00 (488)


3.3    Icelandic Retrieval
For Icelandic, we used English stopwords and English stemming. We review some topics to see
what can be learned about Icelandic retrieval.
     WC0488: Table 9 shows that the only topic on which English stemming made a difference
was topic 488 (framboð ferskvatns í evrópu (Fresh water supplies in europa)). The desired page’s
rank fell from 1 to 2 with English stemming because it matched the word ‘Ferskvatn’ which was
not in the desired page (the English lexical stemmer was augmented with a stem guesser for
unrecognized words). A variant in the desired page, ‘ferskvatni’, was not matched by English
stemming. It appears that ‘í’ is a potential Icelandic stopword (‘i’ actually was not in our English
list though arguably should be). This topic also shows that Icelandic uses the small letter Eth (ð).
SearchServer case normalizes ð to the capital letter Eth (Ð).
     WC0456: In topic 456 (upplýsingar um europol (europol factsheet)), English stemming missed
apparent variants to the query word ‘upplýsingar’ such as ‘Upplýsingasíða’ and ‘upplýsingamál’.
‘um’ appears to be a potential Icelandic stopword.
     WC0243: In (home page) topic 243 (umhverfisstofnun evrópu (european environment agency)),
we noticed that some web pages used entity references such as ‘&eth;’ and ‘&thorn;’ and ‘&yacute;’
which our cTREC text reader did not normalize to the corresponding character, possibly impairing
results for some queries.
     We were disappointed that the Icelandic thorn (lowercase þ or uppercase Þ) was not used in
any of the topic words. But overall, even with just 5 topics in the test set, we have learned at
least that an Icelandic stemmer would potentially be helpful for Icelandic retrieval.
                 Table 10: Mean Scores of WebCLEF Runs on English Queries
   Run                     FRS     Success@1     Success@5       Success@10           MRR
   dplD-NP-EN-other       0.761     25/56 (45%)     38/56 (68%)      44/56 (79%)      0.570
   dpD-NP-EN-other        0.737     26/56 (46%)     39/56 (70%)      42/56 (75%)      0.579
   dpD0-NP-EN-other       0.737     26/56 (46%)     39/56 (70%)      42/56 (75%)      0.579
   dp-NP-EN-other         0.690     24/56 (43%)     37/56 (66%)      38/56 (68%)      0.541
   p-NP-EN-other          0.690     23/56 (41%)     37/56 (66%)      38/56 (68%)      0.535
   rdp-NP-EN-other        0.684     24/56 (43%)     38/56 (68%)      38/56 (68%)      0.531
   none-NP-EN-other       0.678     23/56 (41%)     37/56 (66%)      38/56 (68%)      0.525
   dplD-HP-EN-other       0.652     14/35 (40%)     22/35 (63%)      24/35 (69%)      0.499
   dpD-HP-EN-other        0.572     12/35 (34%)     18/35 (51%)      21/35 (60%)      0.432
   dpD0-HP-EN-other       0.572     12/35 (34%)     18/35 (51%)      21/35 (60%)      0.432
   dp-HP-EN-other         0.544     12/35 (34%)     18/35 (51%)      19/35 (54%)      0.426
   p-HP-EN-other          0.531     12/35 (34%)     17/35 (49%)      18/35 (51%)      0.413
   rdp-HP-EN-other        0.472     11/35 (31%)     15/35 (43%)      16/35 (46%)      0.380
   none-HP-EN-other       0.399      9/35 (26%)     12/35 (34%)      14/35 (40%)      0.302
   dplD-NP-EN-hum         0.956     11/15 (73%)     14/15 (93%)      15/15 (100%)     0.832
   dpD-NP-EN-hum          0.956     11/15 (73%)     14/15 (93%)      15/15 (100%)     0.832
   dpD0-NP-EN-hum         0.956     11/15 (73%)     14/15 (93%)      15/15 (100%)     0.832
   dp-NP-EN-hum           0.836      8/15 (53%)     12/15 (80%)      13/15 (87%)      0.651
   rdp-NP-EN-hum          0.833     8/15 (53%)      12/15 (80%)       13/15 (87%)     0.651
   p-NP-EN-hum            0.832     9/15 (60%)      12/15 (80%)       13/15 (87%)     0.682
   none-NP-EN-hum         0.803      9/15 (60%)     12/15 (80%)      12/15 (80%)      0.686
   rdp-HP-EN-hum          0.538     6/15 (40%)       8/15 (53%)        8/15 (53%)     0.461
   dplD-HP-EN-hum         0.521      7/15 (47%)      7/15 (47%)       7/15 (47%)      0.480
   dpD-HP-EN-hum          0.516      6/15 (40%)      7/15 (47%)        7/15 (47%)     0.432
   dpD0-HP-EN-hum         0.516     6/15 (40%)       7/15 (47%)        7/15 (47%)     0.432
   dp-HP-EN-hum           0.490      6/15 (40%)      7/15 (47%)       7/15 (47%)      0.427
   p-HP-EN-hum            0.464     6/15 (40%)       7/15 (47%)        7/15 (47%)     0.418
   none-HP-EN-hum         0.410      4/15 (27%)     6/15 (40%)        6/15 (40%)      0.323


3.4   English Topic Contributions
WebCLEF participants were requested to contribute at least 30 known-item topics. Each topic
consisted of a query, the correct answer page in EuroGOV, and a list of duplicate and translated
pages in EuroGOV. We contributed 30 English topics. Tables 10 and 11 separate the results for
our topics from the other English topics. Based on the scores, it appears that our named page
topics may have been easier than the others, but our home page topics may have been harder.
    To create a topic, we typically started by randomly selecting an English language page from
the EuroGOV collection. (The organizers had provided a languages.tar.gz file which listed the
languages detected in each document; we reduced this file to the 252,574 pages labelled just as
‘english’, then randomly selected pages from this list.) We alternated between creating named
page queries and home page queries.
    If we wanted a named page query, we tried to understand the random page well enough to
create an unambiguous query for it. Sometimes we rejected a page for being too obscure, and
tried browsing to a related page for which a clearer query could be made. (Browsing was done on
the live web; then we would find the new page in EuroGOV by extracting a phrase and searching
EuroGOV with SearchServer.) If browsing was not fruitful, we started over with a new random
page. Sometimes we started over because the area we were browsing looked too similar to an area
for which we had already made a query.
       Table 11: Impact of Web Techniques on First Relevant Score, English Queries
Expt             ∆FRS       95% Conf        vs.             3 Extreme Diffs (Topic)
o-NP-EN-oth      0.047    ( 0.008, 0.087)   12-0-44     0.87 (292), 0.54 (384), 0.00 (532)
l-NP-EN-oth     0.023    (−0.016, 0.062)    10-8-38     0.74 (2), 0.45 (479), −0.30 (165)
p-NP-EN-oth      0.011   (−0.030, 0.052)    15-5-36    −0.83 (292), −0.32 (38), 0.50 (76)
s-NP-EN-oth      0.000          n/a          0-0-56      0.00 (333), 0.00 (34), 0.00 (532)
d-NP-EN-oth      0.000   (−0.010, 0.011)     5-7-44     0.15 (314), 0.11 (38), −0.14 (423)
r-NP-EN-oth     −0.006   (−0.036, 0.024)    4-12-40    0.68 (418), −0.19 (91), −0.27 (88)
p-HP-EN-oth      0.133    ( 0.042, 0.224)   15-1-19    1.00 (101), 0.98 (313), −0.07 (436)
l-HP-EN-oth     0.080     ( 0.010, 0.150)    9-2-24      0.80 (1), 0.75 (275), −0.13 (246)
o-HP-EN-oth      0.028    ( 0.000, 0.056)    8-0-27      0.38 (287), 0.23 (85), 0.00 (539)
d-HP-EN-oth      0.012   (−0.003, 0.028)     6-5-24    0.18 (190), 0.13 (100), −0.07 (436)
s-HP-EN-oth      0.000          n/a          0-0-35      0.00 (275), 0.00 (85), 0.00 (539)
r-HP-EN-oth     −0.072   (−0.148, 0.004)    3-10-22   −0.93 (246), −0.86 (190), 0.17 (335)
o-NP-EN-hum     0.120    (−0.028, 0.268)    3-0-12      1.00 (285), 0.54 (129), 0.00 (538)
p-NP-EN-hum      0.029   (−0.030, 0.088)    2-1-12     0.43 (129), 0.08 (325), −0.07 (295)
d-NP-EN-hum      0.003   (−0.014, 0.021)    2-1-12     0.09 (325), 0.03 (129), −0.07 (139)
l-NP-EN-hum     0.000    (−0.015, 0.015)    1-1-13     0.07 (139), 0.00 (129), −0.07 (513)
s-NP-EN-hum      0.000          n/a         0-0-15       0.00 (295), 0.00 (94), 0.00 (538)
r-NP-EN-hum     −0.003   (−0.024, 0.018)    1-2-12    0.10 (325), −0.05 (513), −0.10 (129)
p-HP-EN-hum      0.054    ( 0.000, 0.109)   5-0-10      0.37 (408), 0.21 (167), 0.00 (507)
r-HP-EN-hum      0.048   (−0.058, 0.154)     3-3-9     0.69 (476), 0.21 (408), −0.23 (345)
o-HP-EN-hum     0.026    (−0.027, 0.080)    1-0-14      0.40 (507), 0.00 (167), 0.00 (207)
d-HP-EN-hum      0.025   (−0.003, 0.053)    4-1-10     0.17 (476), 0.12 (345), −0.01 (399)
l-HP-EN-hum     0.005    (−0.030, 0.040)    2-3-10     0.21 (408), 0.04 (476), −0.13 (507)
s-HP-EN-hum      0.000          n/a         0-0-15      0.00 (207), 0.00 (141), 0.00 (507)
    For home page queries, usually the random start page was not a home page, so we would
typically try to browse to the closest home page for that page (again, typically on the live web,
by following links or truncating the url).
    To find duplicates, typically we extracted a phrase from the document and used SearchServer
to find other pages with that phrase, then checked those pages to confirm they were duplicates. If
a page had more duplicates than we were willing to record, we started over with a new page.
    To find translations, typically we browsed the live web for links to translated pages, then
used SearchServer to find them in EuroGOV. Finding translations took a lot of detective work.
Sometimes the url was the same except for a language tag, making it easy to find the translations
with SearchServer. Sometimes sites had direct links to the translations, which was also easy. But
sometimes sites just had links to the top-level page for each language, so we would see how to
browse down for English, and then try to do analogous browsing for the translation language,
grasping for clues such as possible word translations or similar pictures, to get to the proper
translated page. It’s quite possible we missed some translations.
    For the query itself, we tried to make it as realistic as possible (e.g. short and general) but
also unambiguous. This could depend on what other pages were available; e.g. for a biography
of Giuseppe Medici, it was enough just to specify ‘Giuseppe Medici’ as the query because there
were no other (English) pages focused on that person. Usually we tried candidate queries with
the organizer-provided engines or a web search engine to see if there might be other valid inter-
pretations of the query we hadn’t expected, so that we could adjust the query accordingly.
    It seemed that a lot of times, our query ended up being fairly similar to the document title.
Table 11 shows that in the ‘p’ experiment (which isolates giving more weight to the title and other
meta properties), our queries did tend to be helped by weighting the title more. But the other
groups’ English queries actually benefited even more often from this weighting.
                 Table 12: Mean Scores of WebCLEF Runs on Spanish Queries
       Run              FRS      Success@1     Success@5     Success@10             MRR
       dpD-NP-ES         0.758    32/67 (48%)      47/67 (70%)     53/67 (79%)      0.595
       dplD-NP-ES        0.720    27/67 (40%)      42/67 (63%)     52/67 (78%)      0.529
       dpD0-NP-ES        0.670    26/67 (39%)      42/67 (63%)     47/67 (70%)      0.497
       p-NP-ES           0.650    25/67 (37%)      41/67 (61%)     46/67 (69%)      0.478
       dp-NP-ES          0.648    26/67 (39%)      40/67 (60%)     44/67 (66%)      0.486
       rdp-NP-ES         0.639    25/67 (37%)      39/67 (58%)     43/67 (64%)      0.475
       none-NP-ES        0.624    21/67 (31%)      39/67 (58%)     42/67 (63%)      0.433
       dplD-HP-ES        0.446    15/67 (22%)      26/67 (39%)     31/67 (46%)      0.297
       rdp-HP-ES         0.437    15/67 (22%)      26/67 (39%)     29/67 (43%)      0.307
       dpD-HP-ES         0.388    11/67 (16%)      21/67 (31%)     27/67 (40%)      0.240
       dpD0-HP-ES        0.369    11/67 (16%)      20/67 (30%)     25/67 (37%)      0.235
       dp-HP-ES          0.364    11/67 (16%)      20/67 (30%)     24/67 (36%)      0.234
       p-HP-ES           0.325     9/67 (13%)      19/67 (28%)     23/67 (34%)      0.201
       none-HP-ES        0.279     5/67 ( 7%)      13/67 (19%)     18/67 (27%)      0.142


         Table 13: Impact of Web Techniques on First Relevant Score, Spanish Queries
  Expt          ∆FRS        95% Conf         vs.            3 Extreme Diffs (Topic)
  s-NP-ES       0.087      ( 0.042, 0.133)    20-4-43       0.87 (116), 0.58 (344), −0.10 (45)
  p-NP-ES       0.026      ( 0.000, 0.052)   16-10-41      0.59 (118), 0.30 (200), −0.26 (449)
  o-NP-ES        0.022     ( 0.006, 0.038)    13-0-54       0.34 (489), 0.25 (502), 0.00 (544)
  d-NP-ES       −0.002    (−0.011, 0.007)    7-10-50      0.12 (309), −0.10 (84), −0.11 (502)
  r-NP-ES       −0.010    (−0.021, 0.002)    2-14-51       −0.20 (98), −0.18 (309), 0.19 (45)
  l-NP-ES       −0.037    (−0.075, 0.000)    7-21-39      −0.78 (157), −0.39 (330), 0.34 (483)
  r-HP-ES        0.073     ( 0.021, 0.126)   16-7-44         0.92 (32), 0.92 (542), −0.12 (13)
  l-HP-ES        0.058     ( 0.000, 0.115)   13-14-40      0.96 (123), 0.93 (124), −0.56 (397)
  p-HP-ES       0.045     (−0.003, 0.094)    15-8-44       0.88 (393), 0.78 (468), −0.54 (473)
  d-HP-ES       0.039      ( 0.012, 0.066)   16-3-48       0.50 (414), 0.43 (220), −0.07 (299)
  s-HP-ES       0.019      ( 0.002, 0.037)    15-4-48      0.31 (522), 0.27 (543), −0.18 (467)
  o-HP-ES        0.005    (−0.001, 0.010)     6-0-61        0.13 (130), 0.08 (154), 0.00 (543)


3.5    Other Languages
Unfortunately, we have run out of time to walk through topics for more languages. But for future
reference, we list the per-topic tables for the remaining languages (in descending order by number
of topics).
             Table 14: Mean Scores of WebCLEF Runs on Dutch Queries
   Run              FRS     Success@1      Success@5    Success@10            MRR
   dpD-NP-NL         0.958   26/34 (76%)       33/34 (97%)    33/34 (97%)     0.860
   p-NP-NL           0.952   27/34 (79%)       33/34 (97%)    33/34 (97%)     0.864
   dpD0-NP-NL        0.951   27/34 (79%)       33/34 (97%)    33/34 (97%)     0.865
   dp-NP-NL          0.946   26/34 (76%)       33/34 (97%)    33/34 (97%)     0.845
   none-NP-NL        0.936   26/34 (76%)       31/34 (91%)    33/34 (97%)     0.833
   dplD-NP-NL        0.918   24/34 (71%)       30/34 (88%)    31/34 (91%)     0.791
   rdp-NP-NL         0.903   25/34 (74%)       30/34 (88%)    32/34 (94%)     0.804
   dpD-HP-NL         0.723    9/25 (36%)       16/25 (64%)    19/25 (76%)     0.496
   dplD-HP-NL        0.688   10/25 (40%)       15/25 (60%)    18/25 (72%)     0.488
   dpD0-HP-NL        0.649    8/25 (32%)       15/25 (60%)    16/25 (64%)     0.445
   dp-HP-NL          0.649    8/25 (32%)       15/25 (60%)    16/25 (64%)     0.445
   p-HP-NL           0.617    8/25 (32%)       13/25 (52%)    18/25 (72%)     0.419
   rdp-HP-NL         0.607    7/25 (28%)       14/25 (56%)    16/25 (64%)     0.390
   none-HP-NL        0.571    5/25 (20%)       13/25 (52%)    15/25 (60%)     0.348


     Table 15: Impact of Web Techniques on First Relevant Score, Dutch Queries
Expt        ∆FRS         95% Conf       vs.            3 Extreme Diffs (Topic)
p-NP-NL      0.016     (−0.005, 0.037)     3-0-31     0.30 (296), 0.18 (516), 0.00 (547)
s-NP-NL      0.006     (−0.006, 0.018)     3-1-30    0.15 (547), 0.07 (296), −0.07 (308)
o-NP-NL     0.005      (−0.001, 0.012)     3-0-31     0.07 (269), 0.07 (438), 0.00 (386)
d-NP-NL     −0.006     (−0.016, 0.005)     1-3-30    −0.12 (516), −0.07 (3), 0.07 (269)
l-NP-NL     −0.040     (−0.090, 0.011)     2-5-27   −0.60 (509), −0.50 (338), 0.13 (547)
r-NP-NL     −0.043     (−0.103, 0.017)     2-3-29   −0.87 (469), −0.46 (528), 0.07 (269)
s-HP-NL      0.075     (−0.033, 0.183)     6-4-15     0.91 (39), 0.68 (486), −0.30 (506)
p-HP-NL      0.046     (−0.017, 0.108)     7-3-15      0.50 (90), 0.41 (75), −0.19 (140)
d-HP-NL      0.032     (−0.028, 0.092)     7-5-13    −0.40 (290), 0.32 (358), 0.36 (535)
o-HP-NL     0.000           n/a            0-0-25      0.00 (221), 0.00 (26), 0.00 (546)
l-HP-NL     −0.035     (−0.097, 0.026)     2-5-18   −0.68 (324), −0.21 (140), 0.17 (517)
r-HP-NL     −0.041     (−0.155, 0.072)     2-8-15       0.84 (39), 0.52 (21), −0.59 (67)
           Table 16: Mean Scores of WebCLEF Runs on Portuguese Queries
   Run              FRS      Success@1     Success@5     Success@10           MRR
   dplD-NP-PT        0.579     5/30 (17%)     16/30 (53%)    18/30 (60%)      0.325
   dpD0-NP-PT        0.534     4/30 (13%)     13/30 (43%)    20/30 (67%)      0.276
   dpD-NP-PT         0.551    6/30 (20%)      13/30 (43%)    18/30 (60%)      0.328
   rdp-NP-PT         0.532     4/30 (13%)     11/30 (37%)    16/30 (53%)      0.275
   dp-NP-PT          0.516     4/30 (13%)     12/30 (40%)    18/30 (60%)      0.264
   p-NP-PT           0.511     4/30 (13%)     12/30 (40%)    18/30 (60%)      0.270
   none-NP-PT        0.469     2/30 ( 7%)     11/30 (37%)    17/30 (57%)      0.219
   rdp-HP-PT         0.665    14/29 (48%)     18/29 (62%)    19/29 (66%)      0.546
   dpD-HP-PT         0.628    13/29 (45%)     16/29 (55%)    18/29 (62%)      0.507
   dplD-HP-PT        0.621    14/29 (48%)     16/29 (55%)    17/29 (59%)      0.522
   dpD0-HP-PT        0.545    11/29 (38%)     14/29 (48%)    17/29 (59%)      0.438
   dp-HP-PT          0.544    11/29 (38%)     14/29 (48%)    17/29 (59%)      0.438
   p-HP-PT           0.435     8/29 (28%)     10/29 (34%)    13/29 (45%)      0.324
   none-HP-PT        0.263     3/29 (10%)      6/29 (21%)     7/29 (24%)      0.148


    Table 17: Impact of Web Techniques on First Relevant Score, Portuguese Queries
Expt         ∆FRS        95% Conf         vs.            3 Extreme Diffs (Topic)
p-NP-PT     0.041      (−0.018, 0.101)    8-4-18      0.59 (415), 0.58 (303), −0.19 (4)
l-NP-PT      0.028     (−0.037, 0.094)    5-9-16      0.86 (248), 0.30 (226), −0.17 (4)
o-NP-PT      0.018     (−0.009, 0.046)    3-0-27      0.40 (377), 0.08 (216), 0.00 (529)
s-NP-PT     0.017      (−0.007, 0.041)    9-5-16      0.19 (415), 0.14 (215), −0.10 (4)
r-NP-PT      0.016     (−0.017, 0.049)    5-6-19       0.34 (69), 0.23 (377), −0.10 (4)
d-NP-PT     0.005      (−0.024, 0.034)    2-4-24    0.40 (377), −0.07 (215), −0.08 (303)
p-HP-PT     0.172       ( 0.081, 0.262)   15-0-14     0.71 (390), 0.71 (163), 0.00 (260)
r-HP-PT      0.121      ( 0.034, 0.207)   13-2-14     0.86 (362), 0.72 (96), −0.07 (52)
d-HP-PT     0.110       ( 0.028, 0.192)   11-0-18      0.91 (52), 0.75 (381), 0.00 (260)
s-HP-PT     0.083       ( 0.031, 0.135)   10-0-19      0.43 (362), 0.37 (96), 0.00 (164)
o-HP-PT      0.000     (−0.001, 0.001)    1-0-28       0.01 (326), 0.00 (33), 0.00 (545)
l-HP-PT     −0.006     (−0.018, 0.005)    1-4-24    −0.10 (526), −0.08 (382), 0.07 (164)
            Table 18: Mean Scores of WebCLEF Runs on German Queries
   Run             FRS      Success@1      Success@5    Success@10            MRR
   dplD-NP-DE        0.706   16/34 (47%)       24/34 (71%)    25/34 (74%)     0.556
   dpD-NP-DE         0.628   13/34 (38%)       21/34 (62%)    22/34 (65%)     0.484
   rdp-NP-DE         0.606   15/34 (44%)       19/34 (56%)    22/34 (65%)     0.495
   dpD0-NP-DE        0.602   14/34 (41%)       19/34 (56%)    21/34 (62%)     0.480
   dp-NP-DE          0.595   14/34 (41%)       19/34 (56%)    21/34 (62%)     0.478
   p-NP-DE           0.591   14/34 (41%)       19/34 (56%)    21/34 (62%)     0.479
   none-NP-DE        0.589   12/34 (35%)       19/34 (56%)    20/34 (59%)     0.444
   dpD-HP-DE         0.526    4/23 (17%)       11/23 (48%)    12/23 (52%)     0.306
   dpD0-HP-DE        0.518    3/23 (13%)       10/23 (43%)    13/23 (57%)     0.259
   dp-HP-DE          0.512    3/23 (13%)       10/23 (43%)    13/23 (57%)     0.257
   dplD-HP-DE        0.472    4/23 (17%)       10/23 (43%)    11/23 (48%)     0.295
   rdp-HP-DE         0.466    5/23 (22%)       9/23 (39%)     12/23 (52%)     0.296
   p-HP-DE           0.451    2/23 ( 9%)        8/23 (35%)    11/23 (48%)     0.219
   none-HP-DE        0.385    2/23 ( 9%)        7/23 (30%)    10/23 (43%)     0.189


     Table 19: Impact of Web Techniques on First Relevant Score, German Queries
Expt         ∆FRS        95% Conf        vs.            3 Extreme Diffs (Topic)
l-NP-DE     0.078      (−0.005, 0.161)     8-5-21    0.95 (402), 0.79 (212), −0.13 (288)
s-NP-DE     0.026      (−0.021, 0.074)     6-5-23    0.76 (477), 0.25 (351), −0.07 (339)
r-NP-DE     0.011      (−0.019, 0.041)     3-4-27    0.46 (477), 0.07 (347), −0.14 (197)
o-NP-DE     0.007      (−0.006, 0.020)     3-0-31     0.21 (197), 0.02 (316), 0.00 (536)
d-NP-DE     0.004      (−0.010, 0.017)     3-5-26     0.16 (197), 0.09 (536), −0.07 (95)
p-NP-DE     0.002      (−0.047, 0.051)     7-5-22    −0.64 (477), 0.25 (95), 0.31 (351)
p-HP-DE     0.066      (−0.018, 0.149)     7-4-12     0.86 (300), 0.28 (241), −0.11 (10)
d-HP-DE     0.062      (−0.034, 0.157)     5-2-16     1.00 (453), 0.35 (47), −0.13 (20)
s-HP-DE     0.007      (−0.084, 0.099)     7-3-13     −0.79 (20), 0.32 (412), 0.42 (47)
o-HP-DE     0.006       ( 0.000, 0.013)    4-0-19     0.05 (433), 0.04 (412), 0.00 (453)
r-HP-DE     −0.047     (−0.126, 0.032)     2-9-12    −0.72 (20), −0.29 (236), 0.42 (47)
l-HP-DE     −0.054     (−0.114, 0.007)     4-10-9   −0.47 (133), −0.40 (214), 0.19 (396)
           Table 20: Mean Scores of WebCLEF Runs on Hungarian Queries
   Run              FRS     Success@1      Success@5    Success@10            MRR
   p-NP-HU           0.766   12/19 (63%)       14/19 (74%)    15/19 (79%)     0.665
   dpD0-NP-HU        0.763   12/19 (63%)       14/19 (74%)    15/19 (79%)     0.664
   rdp-NP-HU         0.763   12/19 (63%)       14/19 (74%)    15/19 (79%)     0.664
   dp-NP-HU          0.763   12/19 (63%)       14/19 (74%)    15/19 (79%)     0.664
   none-NP-HU        0.763    9/19 (47%)       15/19 (79%)    15/19 (79%)     0.595
   dpD-NP-HU         0.706    9/19 (47%)       12/19 (63%)    14/19 (74%)     0.559
   dplD-NP-HU        0.656   9/19 (47%)        12/19 (63%)    13/19 (68%)     0.533
   dpD0-HP-HU        0.579    3/16 (19%)        9/16 (56%)    10/16 (63%)     0.326
   dpD-HP-HU         0.575    4/16 (25%)        8/16 (50%)    9/16 (56%)      0.362
   dp-HP-HU          0.569    3/16 (19%)       8/16 (50%)     10/16 (63%)     0.322
   dplD-HP-HU        0.553   4/16 (25%)         7/16 (44%)     8/16 (50%)     0.352
   rdp-HP-HU         0.543    2/16 (13%)        5/16 (31%)    10/16 (63%)     0.265
   p-HP-HU           0.433    3/16 (19%)        4/16 (25%)    7/16 (44%)      0.262
   none-HP-HU        0.415    4/16 (25%)        4/16 (25%)     5/16 (31%)     0.288


   Table 21: Impact of Web Techniques on First Relevant Score, Hungarian Queries
Expt         ∆FRS       95% Conf         vs.           3 Extreme Diffs (Topic)
p-NP-HU      0.002     (−0.031, 0.036)     3-1-15   −0.25 (448), 0.07 (283), 0.14 (527)
o-NP-HU     0.000           n/a            0-0-19    0.00 (283), 0.00 (102), 0.00 (527)
r-NP-HU      0.000          n/a            0-0-19    0.00 (283), 0.00 (102), 0.00 (527)
d-NP-HU     −0.003     (−0.008, 0.003)     0-1-18   −0.05 (448), 0.00 (527), 0.00 (283)
l-NP-HU     −0.050     (−0.119, 0.019)     2-5-12   −0.59 (225), −0.26 (110), 0.07 (463)
s-NP-HU     −0.057     (−0.150, 0.036)     0-4-15   −0.88 (527), −0.07 (24), 0.00 (283)
d-HP-HU      0.136     (−0.006, 0.279)      7-3-6    0.78 (435), 0.74 (43), −0.14 (245)
p-HP-HU      0.017     (−0.135, 0.169)      5-2-9   −0.92 (435), 0.45 (346), 0.54 (148)
o-HP-HU     0.009      (−0.010, 0.029)     1-0-15      0.15 (51), 0.00 (49), 0.00 (510)
s-HP-HU     −0.003     (−0.116, 0.110)      6-4-6    −0.76 (346), 0.19 (494), 0.26 (43)
l-HP-HU     −0.022     (−0.101, 0.057)      3-4-9    −0.50 (51), −0.14 (298), 0.27 (9)
r-HP-HU     −0.026     (−0.236, 0.183)      7-7-2   −1.00 (148), −0.66 (346), 0.64 (494)
            Table 22: Mean Scores of WebCLEF Runs on Russian Queries
    Run              FRS     Success@1     Success@5   Success@10           MRR
    dplD-NP-RU        0.401   5/15 (33%)       6/15 (40%)    6/15 (40%)     0.354
    dpD-NP-RU         0.392   4/15 (27%)       6/15 (40%)    6/15 (40%)     0.335
    dpD0-NP-RU        0.381   4/15 (27%)       6/15 (40%)    6/15 (40%)     0.317
    p-NP-RU           0.372   3/15 (20%)       6/15 (40%)    6/15 (40%)     0.272
    dp-NP-RU          0.368   3/15 (20%)       6/15 (40%)    6/15 (40%)     0.267
    rdp-NP-RU         0.362   4/15 (27%)       5/15 (33%)    6/15 (40%)     0.293
    none-NP-RU        0.357   3/15 (20%)       5/15 (33%)    6/15 (40%)     0.260
    rdp-HP-RU         0.359   0/15 ( 0%)       6/15 (40%)    6/15 (40%)     0.134
    dpD-HP-RU         0.355   1/15 ( 7%)       5/15 (33%)    5/15 (33%)     0.163
    dpD0-HP-RU        0.300   0/15 ( 0%)       3/15 (20%)    5/15 (33%)     0.084
    dp-HP-RU          0.300   0/15 ( 0%)       3/15 (20%)    5/15 (33%)     0.084
    dplD-HP-RU        0.282   2/15 (13%)       4/15 (27%)    5/15 (33%)     0.174
    p-HP-RU           0.249   0/15 ( 0%)       2/15 (13%)    4/15 (27%)     0.069
    none-HP-RU        0.174   0/15 ( 0%)       2/15 (13%)    3/15 (20%)     0.044


     Table 23: Impact of Web Techniques on First Relevant Score, Russian Queries
Expt          ∆FRS        95% Conf        vs.           3 Extreme Diffs (Topic)
p-NP-RU      0.015     (−0.016, 0.046)     1-0-14     0.23 (457), 0.00 (63), 0.00 (540)
o-NP-RU      0.014     (−0.014, 0.042)     1-0-14     0.21 (359), 0.00 (63), 0.00 (540)
s-NP-RU      0.011     (−0.007, 0.029)     2-0-13     0.13 (457), 0.03 (63), 0.00 (540)
l-NP-RU      0.008     (−0.020, 0.036)     3-1-11     −0.13 (457), 0.07 (83), 0.12 (63)
d-NP-RU      −0.004    (−0.013, 0.005)     0-1-14   −0.06 (457), 0.00 (540), 0.00 (263)
r-NP-RU      −0.006    (−0.031, 0.019)     1-1-13     −0.16 (457), 0.00 (63), 0.07 (83)
p-HP-RU      0.075     (−0.003, 0.153)      5-1-9    0.39 (71), 0.39 (136), −0.05 (466)
r-HP-RU      0.059     (−0.073, 0.191)     3-2-10   0.79 (181), 0.32 (520), −0.26 (136)
s-HP-RU      0.055     (−0.017, 0.127)     4-1-10    0.46 (520), 0.23 (22), −0.11 (17)
d-HP-RU      0.051     (−0.021, 0.123)     5-0-10    0.54 (520), 0.11 (466), 0.00 (240)
o-HP-RU      0.000          n/a            0-0-15     0.00 (240), 0.00 (22), 0.00 (520)
l-HP-RU      −0.073    (−0.153, 0.007)      2-6-7   −0.46 (136), −0.27 (240), 0.14 (22)


             Table 24: Mean Scores of WebCLEF Runs on French Queries
    Run              FRS      Success@1    Success@5    Success@10          MRR
    dplD-NP-FR        1.000   1/1 (100%)       1/1 (100%)    1/1 (100%)     1.000
    dpD-NP-FR         1.000   1/1 (100%)       1/1 (100%)    1/1 (100%)     1.000
    dpD0-NP-FR        1.000   1/1 (100%)       1/1 (100%)    1/1 (100%)     1.000
    rdp-NP-FR         1.000   1/1 (100%)       1/1 (100%)    1/1 (100%)     1.000
    dp-NP-FR          1.000   1/1 (100%)       1/1 (100%)    1/1 (100%)     1.000
    p-NP-FR           1.000   1/1 (100%)       1/1 (100%)    1/1 (100%)     1.000
    none-NP-FR        1.000   1/1 (100%)       1/1 (100%)    1/1 (100%)     1.000
References
[1] AltaVista’s Babel Fish Translation Service. http://babelfish.altavista.com/tr
[2] Cross-Language Evaluation Forum web site. http://www.clef-campaign.org/
[3] Andrew Hodgson. Converting the Fulcrum Search Engine to Unicode. Sixteenth International
   Unicode Conference, 2000.
[4] NTCIR       (NII-NACSIS      Test   Collection     for    IR      Systems)       Home      Page.
   http://research.nii.ac.jp/∼ntcadm/index-en.html
[5] M. F. Porter. Snowball:        A language for stemming           algorithms.    October    2001.
   http://snowball.tartarus.org/texts/introduction.html
[6] S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu and M. Gatford. Okapi at
   TREC-3. Proceedings of TREC-3, 1995.
[7] Jacques Savoy. CLEF and            Multilingual    information    retrieval     resource   page.
   http://www.unine.ch/info/clef/

[8] Börkur Sigurbjörnsson, Jaap Kamps and Maarten de Rijke. EuroGOV: Engineering a Multi-
   lingual Web Corpus. To appear in Working Notes for the CLEF 2005 Workshop.

[9] Börkur Sigurbjörnsson, Jaap Kamps and Maarten de Rijke. Overview of WebCLEF 2005. To
   appear in Working Notes for the CLEF 2005 Workshop.

[10] Text REtrieval Conference (TREC) Home Page. http://trec.nist.gov/
[11] Stephen Tomlinson. European Ad Hoc Retrieval Experiments with Hummingbird
   SearchServerTM at CLEF 2005. To appear in Working Notes for the CLEF 2005 Workshop.

[12] Stephen Tomlinson. Experiments in Named Page Finding and Arabic Retrieval with Hum-
   mingbird SearchServerTM at TREC 2002. Proceedings of TREC 2002.

[13] Stephen Tomlinson. Robust, Web and Genomic Retrieval with Hummingbird SearchServerTM
   at TREC 2003. Proceedings of TREC 2003.

[14] Stephen Tomlinson. Robust, Web and Terabyte Retrieval with Hummingbird SearchServerTM
   at TREC 2004. Proceedings of TREC 2004.

[15] Thijs Westerveld, Wessel Kraaij and Djoerd Hiemstra. Retrieving Web Pages using Content,
   Links, URLs and Anchors. Proceedings of TREC 2001.