<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hummingbird SearchServerTM</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Table</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Success@</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Success@</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ottawa</institution>
          ,
          <addr-line>Ontario</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2005</year>
      </pub-date>
      <abstract>
        <p>Hummingbird participated in the mixed monolingual retrieval task of the WebCLEF Track of the Cross-Language Evaluation Forum (CLEF) 2005. In this task, the system was given 547 known-item queries from 11 languages (134 Spanish, 121 English, 59 Dutch, 59 Portuguese, 57 German, 35 Hungarian, 30 Danish, 30 Russian, 16 Greek, 5 Icelandic and 1 French). The goal was to find the desired page in the 82GB EuroGOV collection (3.4 million pages crawled from government sites of 27 European domains). We experimented with different techniques for web retrieval and analyzed the differences between them. We defined a new measure, First Relevant Score (FRS), to facilitate per-topic analysis, and we focused on analyzing Greek, Danish and Icelandic topics. We found that stopword processing was more important than anticipated, perhaps because words common in one language may tend to be overweighted by inverse document frequency in a mixed language collection. Extra weight on the document title helped significantly, and extra weight on less deep urls significantly helped home page queries. Stemming was of neutral impact on average, but could make a substantial difference for individual queries.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Hummingbird Ottawa, Ontario, Canada stephen.tomlinson@hummingbird.com http://www.hummingbird.com/</title>
      <p>Hummingbird SearchServer1 is a toolkit for developing enterprise search and retrieval applications.
The SearchServer kernel is also embedded in other Hummingbird products for the enterprise.</p>
      <p>1SearchServerTM, SearchSQLTMand Intuitive SearchingTM are trademarks of Hummingbird Ltd. All other
copyrights, trademarks and tradenames are the property of their respective owners.</p>
      <p>
        SearchServer works in Unicode internally [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and supports most of the world’s major
character sets and languages. The major conferences in text retrieval experimentation (CLEF [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
NTCIR [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and TREC [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]) have provided judged test collections for objective experimentation
with SearchServer in more than a dozen languages.
      </p>
      <p>This (draft) paper describes experimental work with SearchServer for the task of finding known
home or named pages in 11 European languages (Spanish, English, Dutch, Portuguese, German,
Hungarian, Danish, Russian, Greek, Icelandic and French) using the WebCLEF 2005 test
collection.
2</p>
      <p>Methodology
2.1</p>
      <sec id="sec-1-1">
        <title>Data</title>
        <p>For the submitted runs in June 2005, SearchServer experimental development build 7.0.0.707 was
used.</p>
        <p>
          The collection to be searched was the EuroGOV collection. It consisted of 3,589,502 pages crawled
from government sites of 27 European domains. Uncompressed, it was 88,062,007,676 bytes
(82.0 GB). The average document size was 24,533 bytes. More details on this collection are in
[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Note that we only indexed 3,417,463 of the pages because the organizers provided a “blacklist”
of 172,039 pages to omit (primarily binary documents).
        </p>
        <p>
          For the mixed monolingual task, there were 547 known-item queries from 11 different languages
(134 Spanish, 121 English, 59 Dutch, 59 Portuguese, 57 German, 35 Hungarian, 30 Danish, 30
Russian, 16 Greek, 5 Icelandic and 1 French). Of these, 345 were named page queries and 242
were home page queries. More details on the mixed monolingual task are in the track overview
paper [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
2.2
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>Indexing</title>
        <p>Our indexing approach was based on the approach we used for TREC Web tasks the previous
three years (described in detail in [12]). Briefly, in addition to full-text indexing, the custom text
reader cTREC populated particular columns such as TITLE (if any), URL, URL_TYPE and
URL_DEPTH. The URL_TYPE was set to ROOT, SUBROOT, PATH or FILE, based on the
convention which worked well in TREC 2001 for the Twente/TNO group [15] on the entry page
finding task (also known as the home page finding task). The URL_DEPTH was set to a term
indicating the depth of the page in the site. Table 1 contains URL types and depths for example
URLs. The exact rules we used are given in [12].</p>
        <p>WebCLEF required a few indexing enhancements compared to TREC. In particular, it wouldn’t
suffice to assume all the pages were in the ASCII character set. We added a /cs option to our
cTREC text reader which used the first recognized ‘charset’ specification in the page (e.g. from
the meta http-equiv tag) to indicate from which character set to convert the page to Unicode
(Win_1252 was assumed if no charset was specified).</p>
        <p>For the baseline task, in which the system was not to make use of any of the topic metadata
such as the specified language of the query, we still indexed with English stopwords (even though
the majority of the documents were in other languages). We treated the apostrophe as a term
separator (which we normally do for languages other than English, but in this collection, it was
also a separator for English). No accents were indexed. English stemming was used on the table,
but SearchServer also indexed all the surface forms (after Unicode normalizations such as case
normalization), and the baseline runs just searched the surface forms, not the stems.</p>
        <p>
          For 2 of our submitted runs, we labelled the runs as making use of the topic and page language
metadata (which were always the same in the mixed monolingual task) along with the page’s
domain. For these runs, we created a set of language-specific indexes (one for each of the 11
query languages) which used a stemmer and stopfile for that language (for English and Icelandic,
we actually used the original baseline index, which had English stems and stopwords). For some
of the languages, because we were close to the submission deadline, we also skipped indexing
some of the domains to save time (e.g. for Greek, just the ‘gr’ and ‘eu.int’ subsets of EuroGOV
were included because it was known all the results were in the ‘gr’ domain) which would have
a (probably minor) effect on the inverse document frequencies (minor especially since we always
included the ‘eu.int’ subset in each index). For 9 of the languages (Danish, Dutch, English, French,
German, Greek, Portuguese, Russian and Spanish), the lexical stemmer in SearchServer (based
on internal stemming component 3.7.0.15) was used. For Hungarian, the Neuchatel stemmer [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
was used (see our companion ad hoc retrieval paper [11] for details). For Icelandic, we used the
English index as previously mentioned. For Greek and Russian, we additionally enabled indexing
of a few accents because the stemmer was accent-sensitive. When processing queries for these
runs, the query was directed to the index for the specified language.
2.3
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>Searching</title>
        <p>We executed 7 runs in June 2005, though only 5 were allowed to be submitted. All 7 are described
here. The first 4 runs were ‘baseline’ runs which did not use the topic metadata. The other 3
runs made use of the topic metadata (in particular, the domain, and for the last 2 runs, also the
language).</p>
        <p>humWC05none: This run was a plain content search of the baseline table. No inflections
were used. This run was the analog of the “none” runs described in our ad hoc retrieval
paper [11]. It used the ‘2:3’ relevance method and document length normalization (SET
RELEVANCE_DLEN_IMP 500). The IS_ABOUT predicate was used instead of the CONTAINS
predicate (and hence the VECTOR_GENERATOR was set to blank to disable inflections instead
of the TERM_GENERATOR), but the relevance calculation was the same. (This run was not
submitted.)</p>
        <p>humWC05p run: This submitted run was the same as humWC05none except that it put
additional weight on matches in the title, url, first heading and some meta tags, including extra weight
on matching the query as a phrase in these fields. Below is an example SearchSQL query. The
searches on the ALL_PROPS column (which contained a copy of the title, url, etc. as described
in [12]) are the difference from the humWC05none run. Note that the FT_TEXT column indexed
the content and also all of the non-content fields except for the URL. More details of the syntax
are explained in [13]. This run used the same approach as the TREC 2004 humW04pl run except
that linguistic inflections were disabled.</p>
        <p>SELECT RELEVANCE(’2:3’) AS REL, DOCNO
FROM EGOV
WHERE
(ALL_PROPS CONTAINS ’Giuseppe Medici’ WEIGHT 1) OR
(ALL_PROPS IS_ABOUT ’Giuseppe Medici’ WEIGHT 1) OR
(FT_TEXT IS_ABOUT ’Giuseppe Medici’ WEIGHT 10)
ORDER BY REL DESC;
humWC05dp run: This submitted run was the same as humWC05p except that it put additional
weight on urls of depth 4 or less (but not on the url type, though url types were still listed with
weight 0 as a way to prevent urls of depth greater than 4 from being excluded). Less deep urls
also received higher weight from inverse document frequency because (presumably) they are less
common. This run used the same approach as the TREC 2004 humW04dpl run except that
linguistic inflections were disabled. Below is an example WHERE clause:</p>
        <p>WHERE
((ALL_PROPS CONTAINS ’Giuseppe Medici’ WEIGHT 1) OR
(ALL_PROPS IS_ABOUT ’Giuseppe Medici’ WEIGHT 1) OR
(FT_TEXT IS_ABOUT ’Giuseppe Medici’ WEIGHT 10)
) AND (
(URL_TYPE CONTAINS ’ROOT’ WEIGHT 0) OR
(URL_TYPE CONTAINS ’SUBROOT’ WEIGHT 0) OR
(URL_TYPE CONTAINS ’PATH’ WEIGHT 0) OR
(URL_TYPE CONTAINS ’FILE’ WEIGHT 0) OR
(URL_DEPTH CONTAINS ’URLDEPTHA’ WEIGHT 5) OR
(URL_DEPTH CONTAINS ’URLDEPTHAB’ WEIGHT 5) OR
(URL_DEPTH CONTAINS ’URLDEPTHABC’ WEIGHT 5) OR
(URL_DEPTH CONTAINS ’URLDEPTHABCD’ WEIGHT 5) )
humWC05rdp run: This submitted run was the same as humWC05dp except that it put
additional weight on the url type. This run used the same approach as the TREC 2004 humW04rdpl
run except that linguistic inflections were disabled. Below is an example WHERE clause:
WHERE
((ALL_PROPS CONTAINS ’Giuseppe Medici’ WEIGHT 1) OR
(ALL_PROPS IS_ABOUT ’Giuseppe Medici’ WEIGHT 1) OR
(FT_TEXT IS_ABOUT ’Giuseppe Medici’ WEIGHT 10)
) AND (
(URL_TYPE CONTAINS ’ROOT’ WEIGHT 10) OR
(URL_TYPE CONTAINS ’SUBROOT’ WEIGHT 10) OR
(URL_TYPE CONTAINS ’PATH’ WEIGHT 10) OR
(URL_TYPE CONTAINS ’FILE’ WEIGHT 0) OR
(URL_DEPTH CONTAINS ’URLDEPTHA’ WEIGHT 5) OR
(URL_DEPTH CONTAINS ’URLDEPTHAB’ WEIGHT 5) OR
(URL_DEPTH CONTAINS ’URLDEPTHABC’ WEIGHT 5) OR
(URL_DEPTH CONTAINS ’URLDEPTHABCD’ WEIGHT 5) )
humWC05dpD0 run: This run was the same as humWC05dp except that the domain
information of the topic metadata was used to restrict the search to the specified domain. Below is
an example of the domain filter added to the WHERE clause for a case in which the page was
known to be in the ‘it’ domain (which implied the DOCNO would contain ‘Eit’). This run was
not submitted.</p>
        <p>AND (DOCNO CONTAINS ’Eit’ WEIGHT 0)
humWC05dpD run: This submitted run was the same as humWC05dpD0 except that the
language information of the topic metadata was used to direct the search to the table for the
specified language (i.e. the WHERE clause was the same as for humWC05dpD0, but the FROM
clause specified a different table). Inflections were still not used.</p>
        <p>humWC05dplD run: This submitted run was the same as humWC05dpD except that the
content and title searches included linguistic expansion from language-specific stemming (this was
done with SET VECTOR_GENERATOR ‘word!ftelp/inflect’; note that /decompound (applicable
to Dutch and German) is implied for /inflect with SET VECTOR_GENERATOR, unlike with
SET TERM_GENERATOR).
2.4</p>
      </sec>
      <sec id="sec-1-4">
        <title>Evaluation Measures</title>
        <p>If one wishes to focus on just the first relevant document, the traditional measure is “Reciprocal
Rank” (RR). For a topic, it is 1r where r is the rank of the first row for which a desired page is
found, or zero if a desired page was not found. “Mean Reciprocal Rank” (MRR) is the mean of
the reciprocal ranks over all the topics.</p>
        <p>An experimental measure introduced in this paper (along with the companion ad hoc retrieval
paper [11]) is “First Relevant Score” (denoted “FRS”). Like reciprocal rank, it is based on just the
rank of the first relevant retrieved for a topic, but it is better suited to per-topic analysis. FRS is
1:081¡r where r is the rank of the first row for which a desired page is found, or zero if a desired
page was not found. Like reciprocal rank, finding the first relevant at rank 1 produces a score of
1.0. At rank 2, FRS is just 7 points lower (0.93), whereas RR is 50 points lower (0.50). At rank
3, FRS is another 7 points lower (0.86), whereas RR is 17 points lower (0.33). At rank 10, FRS
is 0.50, whereas RR is 0.10. FRS is greater than RR for ranks 2 to 52 and lower for ranks 53
and beyond. A possible interpretation of FRS is that it may be an indicator of the percentage of
potential result list reading the system saved the user to get to the first relevant, assuming that
users are less and less likely to continue reading as they get deeper into the result list.</p>
        <p>“Success@n” is the percentage of topics for which at least one relevant document was returned
in the first n rows. Like the other first relevant measures, this measure hides a lot of retrieval
differences (particularly in recall), but it is more intuitive and may be an indicator of a user’s
impression of a method’s robustness across topics. This paper lists Success@1, Success@5 and
Success@10.
2.5</p>
      </sec>
      <sec id="sec-1-5">
        <title>Per-Topic Tables</title>
        <p>The 7 runs allow us to isolate 6 ‘web techniques’ which are denoted as follows:
² ‘p’ (extra weight for phrases in the Title and other properties plus extra weight for vector
search on properties): The humWC05p score minus the humWC05none score.
² ‘d’ (modest extra weight for less deep urls): The humWC05dp score minus the humWC05p
score.
² ‘r’ (strong extra weight for urls of root, subroot or path types): The humWC05rdp score
minus the humWC05dp score.
² ‘o’ (domain filtering): The humWC05dpD0 score minus the humWC05dp score.
² ‘s’ (stopwords specific to the language and possibly accent-indexing and inverse document
frequency changes): The humWC05dpD score minus the humWC05dpD0 score.
² ‘l’ (linguistic expansion from stemming): The humWC05dplD score minus the humWC05dpD
score.</p>
        <p>For the per-topic tables comparing 2 diagnostic runs (such as Table 3), the columns are as
follows:
² “Expt” specifies the experiment. It starts with one of the above 6 web techniques, followed
by ‘NP’ for named page queries or ‘HP’ for home page queries, optionally followed by the
language code.</p>
        <p>² “¢FRS” is the difference of the (mean) first relevant score of the two runs being compared.
humWC05dplD
humWC05dpD
(humWC05dpD0)
humWC05rdp
humWC05dp
humWC05p
(humWC05none)</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>NP: dplD</title>
      <p>NP: dpD
NP: dpD0
NP: rdp
NP: dp
NP: p
NP: none
HP: dplD
HP: dpD
HP: dpD0
HP: rdp
HP: dp
HP: p
HP: none
² The ‘p’ technique (extra weight for phrases in the Title and other properties plus extra weight
for vector search on properties) was of statistically significant benefit for both named pages
and home pages, which is consistent with our TREC results [14] except that the benefit was
larger at TREC.
o-NP
s-NP
p-NP
l-NP
d-NP
r-NP
p-HP
d-HP
s-HP
r-HP
l-HP
o-HP
² The ‘d’ technique (modest extra weight for less deep urls) was of statistically significant
benefit for home pages and neutral on average for named pages, which is consistent with our
TREC results except that the benefit for home pages was larger at TREC.
² The ‘r’ technique (strong extra weight for urls of root, subroot or path types) was less
detrimental than we expected for named pages and less helpful than we expected for home
pages compared to our TREC results.
² The ‘o’ technique (domain filtering), as expected, never caused the score to go down on any
topic (as the ‘vs.’ column shows) because it just included rows from the known domain. But
the benefit was not large on average, so apparently the unfiltered queries usually were not
confused much by the extra domains.
² The ‘s’ technique (stopwords specific to the language and possibly accent-indexing and
inverse document frequency changes) was a surprise in that it led to a statistically significant
benefit for both named pages and home pages. We look at this more below.
² The ‘l’ technique (linguistic expansion from stemming) was of neutral impact on average,
but it could make a substantial difference for individual queries as we will see below.</p>
      <p>
        In the sections that follow, we focus on Greek, Danish and Icelandic because this is the first
time we have had judged test collections for these languages. In partciular, we focus on the impact
of the ‘s’ and ‘l’ techniques, i.e. the impacts of stopwords (and accents) and stemming. For English,
we compare the scores on our own contributed topics to the other English topics. The last section
lists the per-topic tables for the remaining languages in descending order by number of topics, for
future reference.
Table 4 lists the mean scores for the 11 Greek named page queries and 5 Greek home page
queries. The top-scoring runs used stemming (run humWC05dplD) or disabled accent-indexing
(run humWC05dplD0). The run with accent-indexing and not stemming (humWC05dpD) did
not score as highly on average. Table 5 shows that the ‘l’ technique (stemming, i.e. the dplD
score minus the dpD score) was positive on average, while the ‘s’ factor (the dpD score minus
the dpD0 score, primarily isolating the impact of stopwords specific to the language, including
specifying accent-indexing in the Greek case) was negative, and it lists the topics most affected
by each technique in each direction, which we examine below. (In the below topic-analysis, the
translations are based partly on the official topic translations and partly on the online
Greek-toEnglish translation service at [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]).
      </p>
      <p>WC0112: Table 5 shows that the biggest impact of Greek stemming was on topic 112 (Pl rhs
l— a twn upourg‚n kai ufupourg‚n lwn upourge—wn ths Ellhnik s kubŁrnh (sList of
ministers and deputy ministers for all the ministries of the Greek government)). The desired page was
not retrieved in the top-50 without inflecting because the key query terms were plurals (upourg‚n
(ministers), ufupourg‚n (undersecretaries), upourge—wn (ministries)) while the desired page just
contained singular forms (Upourg s(Minister), Ufupourg s(Undersecretary), Upourge—o
(Ministry)).</p>
      <p>WC0395: Table 5 shows that the next biggest impact of Greek stemming was on topic 395 (O
’Ellhnas prwjupourg s kai to m num t(oTuhe Greek Prime Minister and his message)). With
stemming, the first relevant was found at rank 13 instead of 39, a 34 point increase in FRS (in
the reciprocal rank measure, this would just be a 5 point increase). Without stemming, the only
matching word was tou (his), which probably should have been a stopword. With stemming, the
query word prwjupourg s(Prime Minister) matched the document’s variant (Prwjupourgo ).
Because we enabled indexing of Greek accents for our lexical Greek stemmer, the query word
m num (message) did not match the document form M numa(which did not include an accent on
the last character; the first letter is just an lowercase-uppercase difference which all runs handled by
normalizing Unicode to uppercase). Note that the humWC05dpD0 run did match M numabecause
accent-indexing was not enabled for this run; presumably this is why the s-NP-EL line of Table
5 shows that switching to the Greek-specific stopfile (which enabled accent indexing) decreased
FRS 34 points for this topic. For most languages, our lexical stemmers are accent-insensitive; we
should investigate doing the same for Greek.</p>
      <p>WC0432: Table 5 shows that the biggest impact of switching to the Greek-specific stopfile was
a detrimental impact on topic 432 (E— dos Ellhnik s i o l—das gia th nŁleu gia to mŁllon
ths Eur‚phs (Greek home page of the convention for the future of Europe)). The desired page was
found at rank 12 without accent-indexing but was not retrieved in the top-50 with accent-indexing.
The humWC05dpD0 run matched the document title terms which were in uppercase and did not
have accents, particularly SUNELEUSH (ASSEMBLY), MELLON (FUTURE) and EURWPHS
(EUROPE). (The corresponding query words had accents: nŁleu (assembly), mŁllon (future)
and Eur‚phs (Europe)). This issue would presumably impair the ‘p’ web technique (extra weight
on properties such as the title) because title words are often in uppercase and apparently in Greek
uppercase words often omit the accents. (Incidentally, the o-HP-EL line of Table 5 shows that
domain filtering (restricting to the .gr domain) was useful for this query; without it, even without
accent-indexing, the retrieved pages were mostly from the .eu.int domain.)</p>
      <p>WC0445: Table 5 shows that the biggest positive impact of switching to the Greek-specific
stopfile was on topic 445 (Plhrofor—es epikoinwn—as lwn twn upourge—wn ths Ellhnik s kubŁrnh s
(Contact information of all the ministries of the Greek government)). The reason seems to be that
the non-content words in the query (such as twn (of) and ths (her)) generated spurrious matches
in the humWC05dpD0 run (which did not use Greek-specific stopwords), pushing down the desired
page from rank 28 to beyond the top-50. Normally, common words have little effect on the ranking
because they have a low inverse document frequency (idf), but in this mixed language collection,
common words in the Greek documents are still fairly uncommon overall, and hence get relatively
more weight. This topic illustrates why stopword processing may be of more importance in mixed
language collections than in single language collections.</p>
      <p>
        Even though there were just 16 Greek topics, with careful experimental setup and detailed
pertopic analysis, we learned a lot about Greek web search in a mixed language collection. Stemming
can be quite helpful, accent mismatches are common (especially in the important Title field of web
documents), and stopwords common in one language may be over-weighted in a mixed language
collection by traditional idf formulations.
WC0233: Table 7 shows that the biggest impact of switching to the Danish-specific stopfile was a
71 point increase in FRS on topic 233 (presserum europaeiske kantor for bekaempelse af svig (press
room of the european anti fraud office)). Without having ‘af’ as a stopword, the first relevant rank
fell from 2 to 21. This appears to be a similar finding to Greek topic WC0445 in that a common
word in one language was uncommon enough in the mixed language collection to be assigned a
high enough inverse document frequency to cause trouble. (Our Danish stoplist was based on
Porter’s [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].) Incidentally, with stemming enabled, the rank increased from 2 to 1 for this topic,
in part because of an extra ‘bekaempelse’ match in the meta keywords and also from an extra
‘Europaeiske’ match in body. It’s good to see that the SearchServer stemmer handled the ae vs.
ae variation of Danish (the query words used the one-character ligature (ae) while the document
words used two letters (‘a’ and ‘e’)).
      </p>
      <p>WC0392: Another interesting Danish stemming case was on topic 392 (Rigsombudsmanden i
Grønland (the high commissioner of greenland)). With stemming, the rank of the desired page
increased from 24 to 19. The extra matches from stemming were ‘Rigsombudsmand’ and
‘Groenland’ (the latter occurred in the filenames of img tags, which we indexed). Again, it’s good to see
that the SearchServer stemmer matched the query form using the Danish o with stroke (ø) with
the two-letter variant (‘oe’).</p>
      <p>WC0317: On topic 317 (økologisk landbrug i europa (organic farming in europe)), the rank
of the desired page actually fell from 4 to 8 with stemming, even though the additional matches
of ‘okologisk’ (in the meta keywords) and ‘landbrugets’ look proper. (As an aside, the compound
‘landbrugspolitik’ was not matched; we’re unsure in general how common compound words are in
Danish.) The relevance scores of the top documents were close together for this topic, so the fall in
rank appears to be a chance result. Note that the cTREC text reader used for these experiments
did not normalize the html entity reference ‘&amp;Oslash;’ to Ø (or most other entity references for
that matter, which may have impaired the overall results for some languages). It’s good to see
that the SearchServer stemmer matched the query form using the Danish o with stroke (ø) with
the one-letter variant (‘o’).
For Icelandic, we used English stopwords and English stemming. We review some topics to see
what can be learned about Icelandic retrieval.</p>
      <p>WC0488: Table 9 shows that the only topic on which English stemming made a difference
was topic 488 (framboð ferskvatns í evrópu (Fresh water supplies in europa)). The desired page’s
rank fell from 1 to 2 with English stemming because it matched the word ‘Ferskvatn’ which was
not in the desired page (the English lexical stemmer was augmented with a stem guesser for
unrecognized words). A variant in the desired page, ‘ferskvatni’, was not matched by English
stemming. It appears that ‘í’ is a potential Icelandic stopword (‘i’ actually was not in our English
list though arguably should be). This topic also shows that Icelandic uses the small letter Eth (ð).
SearchServer case normalizes ð to the capital letter Eth (Ð).</p>
      <p>WC0456: In topic 456 (upplýsingar um europol (europol factsheet)), English stemming missed
apparent variants to the query word ‘upplýsingar’ such as ‘Upplýsingasíða’ and ‘upplýsingamál’.
‘um’ appears to be a potential Icelandic stopword.</p>
      <p>WC0243: In (home page) topic 243 (umhverfisstofnun evrópu (european environment agency)),
we noticed that some web pages used entity references such as ‘&amp;eth;’ and ‘&amp;thorn;’ and ‘&amp;yacute;’
which our cTREC text reader did not normalize to the corresponding character, possibly impairing
results for some queries.</p>
      <p>We were disappointed that the Icelandic thorn (lowercase þ or uppercase Þ) was not used in
any of the topic words. But overall, even with just 5 topics in the test set, we have learned at
least that an Icelandic stemmer would potentially be helpful for Icelandic retrieval.
WebCLEF participants were requested to contribute at least 30 known-item topics. Each topic
consisted of a query, the correct answer page in EuroGOV, and a list of duplicate and translated
pages in EuroGOV. We contributed 30 English topics. Tables 10 and 11 separate the results for
our topics from the other English topics. Based on the scores, it appears that our named page
topics may have been easier than the others, but our home page topics may have been harder.</p>
      <p>To create a topic, we typically started by randomly selecting an English language page from
the EuroGOV collection. (The organizers had provided a languages.tar.gz file which listed the
languages detected in each document; we reduced this file to the 252,574 pages labelled just as
‘english’, then randomly selected pages from this list.) We alternated between creating named
page queries and home page queries.</p>
      <p>If we wanted a named page query, we tried to understand the random page well enough to
create an unambiguous query for it. Sometimes we rejected a page for being too obscure, and
tried browsing to a related page for which a clearer query could be made. (Browsing was done on
the live web; then we would find the new page in EuroGOV by extracting a phrase and searching
EuroGOV with SearchServer.) If browsing was not fruitful, we started over with a new random
page. Sometimes we started over because the area we were browsing looked too similar to an area
for which we had already made a query.</p>
      <p>Expt
o-NP-EN-oth
l-NP-EN-oth
p-NP-EN-oth
s-NP-EN-oth
d-NP-EN-oth
r-NP-EN-oth
p-HP-EN-oth
l-HP-EN-oth
o-HP-EN-oth
d-HP-EN-oth
s-HP-EN-oth
r-HP-EN-oth
o-NP-EN-hum
p-NP-EN-hum
d-NP-EN-hum
l-NP-EN-hum
s-NP-EN-hum
r-NP-EN-hum
p-HP-EN-hum
r-HP-EN-hum
o-HP-EN-hum
d-HP-EN-hum
l-HP-EN-hum
s-HP-EN-hum</p>
      <p>For home page queries, usually the random start page was not a home page, so we would
typically try to browse to the closest home page for that page (again, typically on the live web,
by following links or truncating the url).</p>
      <p>To find duplicates, typically we extracted a phrase from the document and used SearchServer
to find other pages with that phrase, then checked those pages to confirm they were duplicates. If
a page had more duplicates than we were willing to record, we started over with a new page.</p>
      <p>To find translations, typically we browsed the live web for links to translated pages, then
used SearchServer to find them in EuroGOV. Finding translations took a lot of detective work.
Sometimes the url was the same except for a language tag, making it easy to find the translations
with SearchServer. Sometimes sites had direct links to the translations, which was also easy. But
sometimes sites just had links to the top-level page for each language, so we would see how to
browse down for English, and then try to do analogous browsing for the translation language,
grasping for clues such as possible word translations or similar pictures, to get to the proper
translated page. It’s quite possible we missed some translations.</p>
      <p>For the query itself, we tried to make it as realistic as possible (e.g. short and general) but
also unambiguous. This could depend on what other pages were available; e.g. for a biography
of Giuseppe Medici, it was enough just to specify ‘Giuseppe Medici’ as the query because there
were no other (English) pages focused on that person. Usually we tried candidate queries with
the organizer-provided engines or a web search engine to see if there might be other valid
interpretations of the query we hadn’t expected, so that we could adjust the query accordingly.</p>
      <p>It seemed that a lot of times, our query ended up being fairly similar to the document title.
Table 11 shows that in the ‘p’ experiment (which isolates giving more weight to the title and other
meta properties), our queries did tend to be helped by weighting the title more. But the other
groups’ English queries actually benefited even more often from this weighting.
Unfortunately, we have run out of time to walk through topics for more languages. But for future
reference, we list the per-topic tables for the remaining languages (in descending order by number
of topics).
for</p>
      <p>IR</p>
    </sec>
    <sec id="sec-3">
      <title>Systems)</title>
    </sec>
    <sec id="sec-4">
      <title>Home</title>
    </sec>
    <sec id="sec-5">
      <title>Page.</title>
      <p>October
Multilingual
information
retrieval
resource</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] AltaVista's Babel Fish Translation Service</article-title>
          . http://babelfish.altavista.com/tr
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Cross-Language Evaluation</surname>
          </string-name>
          <article-title>Forum web site</article-title>
          . http://www.clef-campaign.org/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Hodgson</surname>
          </string-name>
          .
          <article-title>Converting the Fulcrum Search Engine to Unicode</article-title>
          . Sixteenth International Unicode Conference,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>NTCIR (NII-NACSIS Test</surname>
          </string-name>
          Collection http://research.nii.ac.jp/»ntcadm/index-en.html
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Porter</surname>
          </string-name>
          .
          <article-title>Snowball: A language for stemming http://snowball</article-title>
          .tartarus.org/texts/introduction.html
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. M.</given-names>
            <surname>Hancock-Beaulieu</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Gatford</surname>
          </string-name>
          .
          <source>Okapi at TREC-3. Proceedings of TREC-3</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Jacques</given-names>
            <surname>Savoy</surname>
          </string-name>
          . CLEF and http://www.unine.ch/info/clef/
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Börkur</given-names>
            <surname>Sigurbjörnsson</surname>
          </string-name>
          , Jaap Kamps and Maarten de Rijke.
          <article-title>EuroGOV: Engineering a Multilingual Web Corpus</article-title>
          . To appear
          <source>in Working Notes for the CLEF 2005 Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Börkur</given-names>
            <surname>Sigurbjörnsson</surname>
          </string-name>
          , Jaap Kamps and Maarten de Rijke.
          <source>Overview of WebCLEF</source>
          <year>2005</year>
          . To appear
          <source>in Working Notes for the CLEF 2005 Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Text REtrieval Conference (TREC) Home</surname>
          </string-name>
          <article-title>Page</article-title>
          . http://trec.nist.gov/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>