<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Finnish, Portuguese and Russian Retrieval with Hummingbird SearchServerTM at CLEF 2004</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stephen Tomlinson</string-name>
          <email>stephen.tomlinson@hummingbird.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ottawa</institution>
          ,
          <addr-line>Ontario</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2002</year>
      </pub-date>
      <abstract>
        <p>Hummingbird participated in the Finnish, Portuguese, Russian and French monolingual information retrieval tasks of the Cross-Language Evaluation Forum (CLEF) 2004: for the natural language queries, find all the relevant documents (with high precision) in the CLEF 2004 document sets. SearchServer's experimental lexical stemmers significantly increased mean average precision for each of the 4 languages. For Finnish, mean average precision was significantly higher with SearchServer's experimental decompounding option enabled. For each language, the submitted SearchServer run returned a relevant document in the first row for more than half of the short (Titleonly) queries. At least one relevant document was returned in the first ten rows for 75-90% of the short queries.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Hummingbird Ottawa, Ontario, Canada stephen.tomlinson@hummingbird.com http://www.hummingbird.com/</title>
      <sec id="sec-1-1">
        <title>Introduction</title>
        <p>2.1</p>
      </sec>
      <sec id="sec-1-2">
        <title>Methodology</title>
        <sec id="sec-1-2-1">
          <title>Data</title>
          <p>The CLEF 2004 document sets consisted of tagged (SGML-formatted) news articles (mostly from
1995) in 4 different languages: Finnish, Portuguese, Russian and French. Table 1 gives the sizes.</p>
          <p>The CLEF organizers created 50 natural language “topics” (numbered 201-250) and translated
them into many languages. Each topic contained a “Title” (subject of the topic), “Description”
1SearchServerTM, SearchSQLTMand Intuitive SearchingTM are trademarks of Hummingbird Ltd. All other
copyrights, trademarks and tradenames are the property of their respective owners.</p>
          <p>Language</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>French Portuguese Finnish Russian</title>
      <p>
        (a one-sentence specification of the information need) and “Narrative” (more detailed guidelines
for what a relevant document should or should not contain). The participants were asked to
use the Title and Description fields for at least one automatic submission per task this year to
facilitate comparison of results. Some topics were discarded for some languages because no relevant
documents existed for them. Table 1 gives the final number of topics for each language and their
average number of relevant documents. For more information on the CLEF test collections, see
the CLEF web site [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
2.2
      </p>
      <sec id="sec-2-1">
        <title>Indexing</title>
        <p>Our indexing approach was the mostly the same as last year [10]. Accents were not indexed except
for the combining breve in Russian. The apostrophe was treated as a word separator for the 4
investigated languages. Our custom text reader, cTREC, was updated to maintain support for
the CLEF guidelines of only indexing specifically tagged fields (the new Portuguese collection
necessitated a minor update).</p>
        <p>
          Some stop words were excluded from indexing (e.g. “the”, “by” and “of” in English). For these
experiments, our stop word lists for Portuguese and Russian were based on the Porter lists [
          <xref ref-type="bibr" rid="ref4">5</xref>
          ],
and this year we based on our Finnish list on Savoy’s [7]. We used our own list for French.
        </p>
        <p>By default, the SearchServer index supports both exact matching (after some Unicode-based
normalizations, such as decompositions and conversion to upper-case) and morphological matching
(e.g. inflections, derivations and compounds, depending on the linguistic component used).</p>
        <p>For many languages (including the 4 European languages investigated in CLEF 2004),
SearchServer includes the option of finding inflections based on lexical stemming (i.e. stemming based
on a dictionary or lexicon for the language). For example, in English, “baby”, “babied”, “babies”,
“baby’s” and “babying” all have “baby” as a stem. Specifying an inflected search for any of these
terms will match all of the others. The lexical stemming of the post-5.x experimental development
version of SearchServer used for the experiments in this paper was based on internal stemming
component 3.6.3.4 for the submitted runs and 3.7.0.15 for the diagnostic runs. We treat each
linguistic component as a black box in this paper.</p>
        <p>SearchServer typically does “inflectional” stemming which generally retains the part of speech
(e.g. a plural of a noun is typically stemmed to the singular form). It typically does not do
“derivational” stemming which would often change the part of speech or the meaning more substantially
(e.g. “performer” is not stemmed to “perform”).</p>
        <p>SearchServer’s lexical stemming includes compound-splitting (decompounding) for compound
words in Finnish (and also some other languages not investigated this year, such as German, Dutch
and Swedish). For example, in German, “babykost” (baby food) has “baby” and “kost” as stems.</p>
        <p>SearchServer’s lexical stemming also supports some spelling variations. In English, British and
American spellings have the same stems, e.g. “labour” stems to “labor”, “hospitalisation” stems to
“hospitalization” and “plough” stems to “plow”.</p>
        <p>Lexical stemmers can produce more than one stem, even for non-compound words. For
example, in English, “axes” has both “axe” and “axis” as stems (different meanings), and in French,
“important” has both “important” (adjective) and “importer” (verb) as stems (different parts of
speech). SearchServer records all the stem mappings at index-time to support maximum recall
and does so in a way to allow searching to weight some inflections higher than others.
2.3
Unlike previous years, this year we experimented with SearchServer’s CONTAINS predicate
(instead of the IS_ABOUT predicate) though it should not make a difference to the ranking. Our
test application specified SearchSQL to perform a boolean-OR of the query words. For example,
for Russian topic 250 whose Title was “Бешенство у людей” (Rabies in Humans), a corresponding
SearchSQL query would be:
SELECT RELEVANCE(’2:3’) AS REL, DOCNO
FROM CLEF04RU
WHERE FT_TEXT CONTAINS ’Бешенство’|’у’|’людей’
ORDER BY REL DESC;
(Note that “у” is a stopword for Russian so its inclusion in the query won’t actually add any
matches.)</p>
        <p>
          Most aspects of SearchServer’s relevance value calculation are the same as described last year
[10]. Briefly, SearchServer dampens the term frequency and adjusts for document length in a
manner similar to Okapi [
          <xref ref-type="bibr" rid="ref5">6</xref>
          ] and dampens the inverse document frequency using an approximation
of the logarithm. These calculations are based on the stems of the terms when doing morphological
searching (i.e. when SET TERM_GENERATOR ‘word!ftelp/inflect’ was previously specified).
        </p>
        <p>An experimental new default is that SearchServer only includes morphological matches from
compound words if all of its stems (from a particular stemming interpretation) are in the same or
consecutive words. For example, in German, a morphological search for the compound “babykost”
(baby food) will no longer match “baby” or “kost” by themselves, but it will match “babykost” and
“baby kost” (and if SET PHRASE_DISTANCE 1 is specified, it will also match the hyphenated
“baby-kost”). Words (and compounds) still match inside compounds (and larger compounds), e.g.
a search for “kost” still matches “babykost”. To restore the old behaviour of matching if just one
stem is in common, one can specify the /decompound option (e.g. SET TERM_GENERATOR
‘word!ftelp/inflect/decompound’). See Section 3.3.1 for several more decompounding examples.</p>
        <p>This year’s experimental SearchServer version contains an enhancement for handling multiple
stemming interpretations. For each document, only the interpretation that produces the highest
score for the document is used in the relevance calculation (but all interpretations are still used for
matching and search term highlighting). Sometimes this enhancement causes the original query
form of the word to get more weight than some of its inflections (and it never gets less weight).
This approach overcomes the previous issue of terms with multiple stemming interpretations being
over-weighted; it used to be better for CLEF experiments to workaround by using the /single or
/noalt options, but Section 3.5 verifies that this is no longer the case.</p>
        <p>SearchServer’s RELEVANCE_METHOD setting can be used to optionally square the
importance of the inverse document frequency (by choosing a RELEVANCE_METHOD of ‘2:4’ instead
of ‘2:3’). The importance of document length to the ranking is controlled by SearchServer’s
RELEVANCE_DLEN_IMP setting (scale of 0 to 1000). For all experiments in this paper,
RELEVANCE_METHOD was set to ‘2:3’ and RELEVANCE_DLEN_IMP was set to 750.
2.4</p>
      </sec>
      <sec id="sec-2-2">
        <title>Diagnostic Runs</title>
        <p>For the diagnostic runs listed in Table 2, the run names consist of a language code (“FI” for Finnish,
“FR” for French, “PT” for Portuguese and “RU” for Russian) followed by one of the following labels:
² “lex”: The run used SearchServer’s lexical stemming with decompounding enabled, i.e. SET
TERM_GENERATOR ‘word!ftelp/inflect/decompound’. (Of the investigated languages,
decompounding only makes a difference for Finnish.)
² “compound” (Finnish only): Same as “lex” except that /decompound was not specified.
² “single”: Same as “lex” except that /single was additionally specified (so that just one
stemming interpretation was used).
The primary evaluation measure in this paper is “mean average precision” based on the first 1000
retrieved documents for each topic (denoted “AvgP” in Tables 2 and 9). “Average precision” for a
topic is the average of the precision after each relevant document is retrieved (using zero as the
precision for relevant documents which are not retrieved). The score ranges from 0.0 (no relevants
found) to 1.0 (all relevants found at the top of the list). For a set of topics, all topics are weighted
equally by the mean. Average precision takes into account both precision and recall, and it is
very good for detecting retrieval differences because even small differences in the ranks of relevant
documents affect the score.</p>
        <p>
          A more experimental measure is “robustness at 10 documents” (denoted “Robust@10”) which
is the percentage of topics for which at least one relevant document was returned in the first 10
rows (this was one of the measures investigated in the TREC Robust Retrieval track last year
[11]). This measure hides a lot of retrieval differences (particularly in recall), but it may be an
indicator of a user’s impression of a method’s robustness across topics. We also list the Robust@1
and Robust@5 variants.
² “AvgDiff” is the difference of the mean scores of the two runs being compared (the table
heading says which evaluation measure is being compared).
² “95% Conf” is an approximate 95% confidence interval for the difference calculated using
Efron’s bootstrap percentile method2 [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] (using 100,000 iterations). If zero is not in the
interval, the result is “statistically significant” (at the 5% level), i.e. the feature is unlikely
to be of neutral impact, though if the average difference is small (e.g. &lt;0.020) it may still
be too minor to be considered “significant” in the magnitude sense.
² “vs.” is the number of topics on which the first run scored higher, lower and tied (respectively)
compared to the second run. These numbers should always add to the number of topics (45
for Finnish, 49 for French, 46 for Portuguese, 34 for Russian).
² “3 Extreme Diffs (Topic)” lists 3 of the individual topic differences, each followed by the
topic number in brackets (the topic numbers range from 201 to 250). The first difference
is the largest one of any topic (based on the absolute value). The third difference is the
largest difference in the other direction (so the first and third differences give the range of
differences observed in this experiment). The middle difference is the largest of the remaining
differences (based on the absolute value).
3
        </p>
        <sec id="sec-2-2-1">
          <title>Results of Morphological Experiments</title>
          <p>This section looks at the differences between the runs of Table 2 in more detail.
3.1</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Impact of Lexical Stemming</title>
        <p>2See [9] for some comparisons of confidence intervals from the bootstrap percentile, Wilcoxon signed rank and
standard error methods for both average precision and Precision@10.
Topic PT-229: Table 5 shows that the largest difference between the stemming approaches for
Portuguese was on topic 229 (Constru¸ca˜o de Barragens (Dam Building)) in which average
precision was 53 points higher with SearchServer’s stemmer. The main reason was that, unlike the
algorithmic stemmer, the SearchServer stemmer matched “Barragem”, an inflection used in many
relevant documents. SearchServer additionally matched “construc¸o˜es” which may also have been
helpful.</p>
        <p>Topic PT-217: The next largest difference for Portuguese was on topic 217 (Sida em A´frica
(AIDS in Africa)) for which Table 5 shows that average precision was 32 points higher with
SearchServer’s stemmer. The main reason was that, unlike SearchServer, the algorithmic
stemmer matched “sido”, a common word unrelated to AIDS, which decreased precision substantially.
SearchServer additionally matched “africanos” which may also have been helpful.</p>
        <p>Topic PT-204: The largest negative difference was on topic 204 (V´ıtimas de Avalanches
(Victims of Avalanches)) for which using the algorithmic stemmer scored 27 points higher. Both
stemmers matched “Avalanche” but the algorithmic stemmer additionally matched “avalancha”
which was the only variant used in 3 of the relevant documents. We should investigate this case
further.
3.2.2</p>
        <p>Russian Stemming
Topic RU-227: Table 5 shows that the largest difference between the stemming approaches for
Russian was on topic 227 (Алтайская амазонка (Altai Ice Maiden)) for which average precision
was 40 points higher with SearchServer’s stemmer. SearchServer internally produced 2 stems
for “Алтайская” (Altai), itself and “Алтайскай”. The words which had “Алтайская” as a stem
(such as “Алтайской”, “Алтайские”, “Алтайскую” and “алтайских”) were less common in the
documents than the words which shared the “Алтайскай” stem (the same words plus more such
as “Алтайского”, “Алтайском” and “Алтайскому”), so SearchServer’s experimental new scoring
scheme for alternative stems gives the former group a higher weight from inverse document
frequency than the latter group. In this case, it turned out just 1 relevant document was matched
by either stemmer and it just used the original word “Алтайская”. The algorithmic stemmer
produced just one stem for these words, so its weighting did not have a preference for the query form
and some documents with the second group of terms ended up ranking higher. The algorithmic
stemmer additionally matched “Алтайске” which was not helpful in this case. This topic illustrates
a benefit from SearchServer’s experimental new handling of multiple stemming interpretations.</p>
        <p>Topic RU-202: The next largest difference was on topic 202 (Арест Ника Леесон (Nick Leeson’s
Arrest)) for which the score was 20 points higher with SearchServer’s stemmer. The 3 relevant
documents used different spellings for “Leeson” (“Лисон”, “Лизона”, “Лизон” and “Лисона”) which
did not match the query form of “Леесон” with either stemmer. And inflections of “Арест” (Arrest)
did not appear in the relevant documents. So the matches just came from variants of “Nick”.
Both stemmers matched the forms used in the relevant documents (“Ника” and “Ник”). But the
algorithmic stemmer additionally matched other terms such as “Никому” and “никого” which
lowered precision substantially in this case.
3.3</p>
      </sec>
      <sec id="sec-2-4">
        <title>Impact of Decompounding (Finnish)</title>
        <p>The first row of Table 6 (“FI lex-cmpd”) isolates the impact of SearchServer’s experimental new
“/decompound” option for Finnish (decompounding is not new to SearchServer for Finnish, but
an option to control its impact separately from inflectional stemming at search-time is). This
option allows words to match if they share any stem of query compound words. Without the
/decompound option, the (experimental new) default is to require all the stems of a compound
word to be in the same or consecutive words to be considered a match. Table 6 shows that mean
average precision was 9 points higher with /decompound set, and this difference was statistically
significant.</p>
        <p>The second row of Table 6 (“FI cmpd-none”) shows that even without the /decompound option,
use of SearchServer’s stemming for Finnish scored 14 points higher than not using stemming. (Note
that the first two rows of Table 6 add up to the 23 point gain from lexical stemming shown in
Table 3.)</p>
        <p>The third row of Table 6 (“FI cmpd-alg”) compares SearchServer’s stemming without the
/decompound option to algorithmic stemming (which does not even decompound at index-time)
and shows that using SearchServer’s stemmer scored 4.5 points higher, though this difference did
not quite pass the statistical significance test. (SearchServer’s stemming with the /decompound
option is compared to algorithmic stemming in Table 5 in which the difference is the sum of the
differences of rows 1 and 3 of Table 6.)</p>
        <p>We look at some Finnish topics in more detail to understand these results better.
Topic FI-210: Table 6 shows that the largest impact of Finnish decompounding was on topic
210 (Nobel rauhanpalkintoehdokkaat (Nobel Peace Prize Candidates)) for which using
SearchServer’s stemmer with the /decompound option scored 98 points higher than not using
/decompound (and also 98 points higher than using the algorithmic stemmer according to Table 5). This
topic had just 1 relevant document, and the only match for the non-decompounding approaches
was the word “Nobel” which occurred in lots of documents, so the relevant document did not
stand out among them. With SearchServer’s decompounding, many more words in the relevant
document matched such as “rauhan”, “rauhanpalkituksi”, “rauhanpalkinnon”, “rauhanva¨litta¨ja¨na¨”,
“ehdokasta” and “ehdokkaina” because these words shared at least one (but not all) the stems of
the query compound “rauhanpalkintoehdokkaat”, and the relevant document was ranked first.</p>
        <p>Topic FI-226: Table 6 shows that the next largest impact of Finnish decompounding was on
topic 226 (Sukupuolenvaihdosleikkaukset (Sex-change Operations)) for which using SearchServer’s
stemmer with the /decompound option scored 72 points higher than not using /decompound
(and also 86 points higher than using the algorithmic stemmer according to Table 5). The
algorithmic stemmer just found the one of the 13 relevant documents which contained the query
word “Sukupuolenvaihdosleikkaukset”. SearchServer without /decompound matched that
document plus 3 other relevants, two which contained “sukupuolen vaihdosleikkaukseen” (an example of
a consecutive-word match) and one which contained “Sukupuolenvaihdosleikkausta”. SearchServer
with /decompound matched all 13 relevant documents; the key additional matches appeared to
be “Sukupuolen-vaihdos”, “sukupuolenvaihtoleikkaukset”, “sukupuolenvaihdot”,
“Sukupuolenvaihdoshan”, “sukupuolenkorjausleikkausten” and “sukupuolenvahvistusleikkaus”, though other
matching words may also have been helpful such as “leikkaussali”, “sukupuoli” and “vaihdos”.</p>
        <p>Topic FI-219: Table 6 shows that the largest negative impact of Finnish decompounding
was on topic 219 (EU:n komissaariehdokkaat (EU Commissioner Candidates)) for which using
SearchServer’s stemmer with the /decompound option scored 18 points lower than not using
/decompound (and also 15 points lower than using the algorithmic stemmer according to Table 5).
Without the /decompound option, SearchServer found a lot of precise matches in relevant
documents such as “komissaariehdokasta”, “komissaariehdokkaalle”, “komissaariehdokkaista”,
“komissaariehdokkaalta”, “komissaariehdokkaiden” and “komissaariehdokkaan”. Furthermore, in some
relevant documents it found matches in larger compounds (which the algorithmic stemmer could
not) such as “naiskomissaariehdokasta” and “tanskalaiseltakomissaariehdokkaalta”. With
/decompound set, SearchServer would also find all these matches, but precision was substantially hurt in
this case by additionally matching terms in non-relevant documents such as “j¨asenehdokkaiden”,
“j¨asenehdokkaita”, “ykko¨sehdokkaista”, “tutkimuskomissaari” and “henkil¨osto¨komissaari”. This
topic shows why a user may prefer to have /decompound not set; in cases where the user does not
need the component words to occur together, the user can either manually separate the terms or
set the /decompound option. But for automatic ad hoc searches for topics, it is better on average
to use the /decompound option.
Table 7 shows the impact of applying the algorithmic stemmer to the result of SearchServer’s
stemmer (this is possible because SearchServer’s stemmer returns real words; the other order would
not work because the algorithmic stemmer often truncates to a non-word). This approach would
still produce all the matches of SearchServer’s stemming and may sometimes produce additional
matches from algorithmic stemming. However, there was a decrease in mean average precision for
Russian which was borderline significant. The other differences were not statistically significant.
While algorithmic stemming may occasionally add a helpful match, it can also add poor matches
that hurt precision. In a future experiment, perhaps it would be better to treat algorithmic stems
as alternative stemming interpretations (instead of replacing the lexical stem) so that lexical
inflections are likely to get higher weight when the algorithmic stem is too common.
In the identifiers of the runs submitted for assessment in May 2004 (e.g. “humFI04tde”), the first
3 letters “hum” indicate a Hummingbird submission, the next 2 letters are the language code, and
the number “04” indicates CLEF 2004. “t”, “d” and “n” indicate that the Title, Description and
Narrative field of the topic were used (respectively). “e” indicates that query expansion from blind
feedback on the first 2 rows was used (see last year’s paper [10] for more details). The submitted
runs all used inflections from SearchServer’s lexical stemming (including decompounding where
applicable). The scores of the submitted runs are listed in Table 9.</p>
        <p>The submitted Title-only runs (e.g. “humFI04t” of Table 9) correspond to the “lex” diagnostic
runs (e.g. “FI-lex” of Table 2) except that the submitted runs used an older experimental version
of SearchServer (including an older version of the lexical stemming component) so the scores are
not exactly the same.
humFI04t
humFI04td
humFI04tde
for</p>
        <p>IR</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Systems)</title>
    </sec>
    <sec id="sec-4">
      <title>Home</title>
      <p>Page.
[4] NTCIR (NII-NACSIS Test Collection
http://research.nii.ac.jp/»ntcadm/index-en.html</p>
    </sec>
    <sec id="sec-5">
      <title>Robust</title>
      <p>Conference</p>
    </sec>
    <sec id="sec-6">
      <title>Retrieval</title>
      <p>(TREC</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Cross-Language Evaluation</surname>
          </string-name>
          <article-title>Forum web site</article-title>
          . http://www.clef-campaign.org/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Bradley</given-names>
            <surname>Efron</surname>
          </string-name>
          and
          <string-name>
            <given-names>Robert J.</given-names>
            <surname>Tibshirani</surname>
          </string-name>
          .
          <article-title>An Introduction to the Bootstrap</article-title>
          .
          <year>1993</year>
          . Chapman &amp; Hall/CRC.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Hodgson</surname>
          </string-name>
          .
          <article-title>Converting the Fulcrum Search Engine to Unicode</article-title>
          . In Sixteenth International Unicode Conference, Amsterdam, The Netherlands,
          <year>March 2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Porter</surname>
          </string-name>
          .
          <article-title>Snowball: A language for stemming algorithms</article-title>
          .
          <source>October</source>
          <year>2001</year>
          . http://snowball.tartarus.org/texts/introduction.html
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. M.</given-names>
            <surname>Hancock-Beaulieu</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          (City University.)
          <article-title>Okapi at TREC-3</article-title>
          . In D. K. Harman, editor,
          <source>Overview Third Text REtrieval Conference (TREC-3)</source>
          . NIST Special Publication http://trec.nist.
          <source>gov/pubs/trec3/t3_proceedings.html Gatford. of the 500-226.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>