<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Report on CLEF-2003 Monolingual Tracks: Fusion of Probabilistic Models for Effective Monolingual Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jacques Savoy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institut interfacultaire d'informatique, Université de Neuchâtel</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2003</year>
      </pub-date>
      <abstract>
        <p>For our third participation in the CLEF evaluation campaign, our first objective was to propose more effective and general stopword lists for the Swedish, Finnish and Russian languages along with an improved, more efficient and simpler stemming procedure for these three languages. Our second goal was to suggest a combined search approach based on a data fusion strategy that would work with various European languages. Included in this combined approach is a decompounding strategy for the German, Dutch, Swedish and Finnish languages.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>1. Overview of the Test-Collections</title>
      <p>Tables!1a and 1b compare also the number of relevant documents per request, with the mean always being
greater than the median (e.g., for the English collection, the average number of relevant documents per query is
18.63 with the corresponding median being 7). These findings indicate that each collection contains numerous
queries, yet only a rather small number of relevant items are found. For each collection, 60 queries have been
created. However, relevant documents cannot be found for each request and each language. For the English
collection, the Queries #149, #161, #166, #186, #191, and #195 do not have any relevant items; for the French
corpus, these requests are #146, #160, #161, #166, #169, #172, #191, #194; for the German collection (Queries
#144, #146, #170, #191); for the Spanish collection (Queries #169, #188, #195); for the Italian collection
(Queries #144, #146, #158, #160, #169, #170, #172, #175, #191); for the Dutch collection (Queries #160,
#166, #191, #194); for the Swedish collection (Queries #146, #160, #167, #191, #194, #197, #198); for the
Finnish corpus (Queries #141, #144, #145, #146, #160, #167, #169, #175, #182, #186, #188, #189, #191,
#194, #195). Appearing for the first time in a CLEF evaluation campaign is the Russian corpus, for which we
have only 28 requests.</p>
      <p>During the indexing process of our automatic runs, we retained only the following logical sections from the
original documents: &lt;TITLE&gt;, &lt;HEADLINE&gt;, &lt;TEXT&gt;, &lt;LEAD&gt;, &lt;LEAD1&gt;, &lt;TX&gt;, &lt;LD&gt;, &lt;TI&gt; and &lt;ST&gt;.
From the topic descriptions we automatically removed certain phrases such as "Relevant document report …",
"Find documents …", "Trouver des documents qui parlent …", "Sono valide le discussioni e le decisioni …",
"Relevante Dokumente berichten …" or "Los documentos relevantes proporcionan información …".</p>
      <p>English French
Size (in MB) 579 MB 331 MB
# of documents 169,477 129,806
# of distinct terms 426,757 355,691
Number of distinct indexing terms / document
Mean 156.9 118.5
Standard deviation 118.77 95.72
Median 129 89
Maximum 1,881 1,621
Minimum 2 3
Number of queries
Number rel. items
Mean rel. / request
Standard deviation
Median
Maximum
Minimum</p>
      <p>54
1,006
18.63
28.61</p>
      <p>7
139 (#Q:157)
1 (#Q:141)</p>
    </sec>
    <sec id="sec-3">
      <title>2. Stopword Lists and Stemming Procedures</title>
      <p>
        In order to define general stopword lists, we first accounted for the top 200 most frequent words found in the
various languages, together with articles, pronouns, prepositions, conjunctions or very frequently occurring verb
forms (e.g., to be, is, has, etc.). As compared to last year's stopword lists [
        <xref ref-type="bibr" rid="ref11">Savoy 2002</xref>
        ], we only modified
those for the Swedish and Finnish languages, and we created a new one for the Russian language (these lists are
available at www.unine.ch/info/clef/). For English we used the list provided by the SMART system (571
words), while for the other European languages, our stopword list contained 430 words for Italian, 463 for
French, 603 for German, 351 for Spanish, 1,315 for Dutch, 747 for Finnish, 386 for Swedish and 420 for
Russian.
      </p>
      <p>Once it removes high-frequency words, an indexing procedure generally applies a stemming algorithm in an
attempt to conflate word variants into the same stem or root. In developing this procedure for various European
languages, we first wanted to remove only inflectional suffixes such as singular and plural word forms, and also
feminine and masculine forms, such that they conflate to the same root. Our suggested stemmers also try to
reduce various word declensions into the same stem, such as those used in the German, Finnish and Russian
languages.</p>
      <p>
        More sophisticated schemes have already been proposed for the removal of derivational suffixes (e.g., "-ize",
"-ably", "-ship" in the English language), the stemmer developed by
        <xref ref-type="bibr" rid="ref8">Lovins [1968</xref>
        ] (based on a list of over 260
suffixes), or that of
        <xref ref-type="bibr" rid="ref9">Porter [1980</xref>
        ] (which looks for about 60 suffixes). For the French language only, our
stemming approach tried to remove some derivational suffixes (e.g., "communicateur" -&gt; "communiquer",
"faiblesse" -&gt; "faible"). For the Dutch language we used the
        <xref ref-type="bibr" rid="ref7">Kraaij &amp; Pohlmann's stemmer [Kraaij 1996</xref>
        ]. Our
various stemming procedures can be found at www.unine.ch/info/clef/. Currently, it is not clear whether a
stemming procedure such ours removes only inflectional suffixes from nouns and adjectives, and better retrieval
effectiveness may be achieved by a stemming approach that also accounts for verbs or that removes both
inflectional and derivational suffixes.
      </p>
      <p>Finally, diacritic characters are usually not present in English collections (with some exceptions, such as
"résumé"); and as with the Italian, Dutch, Finnish, Swedish, German, Spanish and Russian languages, these
characters are replaced by their corresponding non-accentuated letter. For this latter language, we convert and
normalize the Cyrillic Unicode characters into Latin alphabet (perl script available at www.unine.ch/clef/).</p>
    </sec>
    <sec id="sec-4">
      <title>3. Decompounding Words</title>
      <p>
        Most European languages manifest other morphological characteristics with compound word constructions
being just one example (e.g., handgun, worldwide). In German for example, compound words are widely used
and they may cause more difficulties than do those in English. For example, an insurance company would be
"Versicherungsgesellschaft" ("Versicherung" + "S" + "Gesellschaft"). However the morphological marker ("S")
is not always present (e.g., "Atomtests" built as "Atom" + "Tests"), and sometimes the letter "S" belongs to the
decompounded word (e.g., "Wintersports" for "Winter" + "Sports"). In Finnish, we also encounter similar
constructions as such as "rakkauskirje" ("rakkaus" + "kirje" for love &amp; letter) or "työviikko" ("työ" + "viikko"
for work &amp; week). Recently,
        <xref ref-type="bibr" rid="ref3">Braschler [2003</xref>
        ] shows that decompounding German words may significantly
improve retrieval performance.
      </p>
      <p>
        Our proposed decompounding approach shares some similarity with
        <xref ref-type="bibr" rid="ref5">Chen's algorithm [2002</xref>
        ]. Before using
it, we create a word list composed of all words appearing in the given collection (without stemming).
Associated with each word, we also store the number of its occurrences in the collection (some examples are
given in Table 2).
      </p>
      <p>computer
computers
sicherheit
sicher
heit
bank
bund
bundes
bundesbank
präsident</p>
      <p>In order to present an overview of our decompounding approach, we will take as an example the German
word "Computersicherheit," composed of "Computer" + "Sicherheit" (security). This compound word does not
appear in our German word list as depicted in Table 2, so our algorithm starts the decompounding process by
attempting to split a word following the k = 4 last letters (given the two strings "computersicher" and "heit").
During the entire procedure, we only consider words having a length greater than a given threshold (fixed at 3
for all languages in our experiments). If both components appear in the word list, then we have a candidate for
decompounding; otherwise the k limit is increased by one. Since, in our case, the string "computersiche" does
not appear in the German word list, splitting is rejected. When k = 9, our algorithm will find the word
"computers" in the word list, but will fail to find the word "icherheit". With k = 10, our algorithm will find
both the word "computer" and "sicherheit" in the German word list (see Table 2) and this solution becomes the
top level decompounding suggestion. Recursively, the system now tries to decompound the two parts, namely
the words "computer" and "sicherheit". During this recursive process, the system is allowed to ignore some
short sequences of letters at the end of a word (such as "-s" or "-es" in German, or "-s" for the Swedish language)
because such morphological markers may indicate the genitive form (such as "'s" in the noun phrase "John's
book").</p>
      <p>After this generative part, the system responds a tree of possible formats in which the compound
construction can be broken down, and with each component, we find the number of its occurrences in the
corpus. In our example, the answer will be (computer 2452, sicherheit 6583 (sicher 4522, heit 4)). Thus, from
this result, we know that the word "Sicherheit" appears 6583 times in the corpus, and we may consider
decompounding this term into the words "sicher" and "heit". From this we can add (or replace) the compound
word in the document (or in the request) by all decompound candidates ("computer" + "sicherheit", and
"computer" + "sicher" + "heit" in our case) or only by decompounding only the minimum number of terms
("computer" + "sicherheit" in our case).</p>
      <p>However, when faced with multiple candidates, our algorithm will try to select the single "best" one. To
achieve this, our system will consider the total number of occurrences for the component words and if this value
is greater than the number of occurrences for the compound construction, the decompounded candidate will be
selected. In our example, the system will not decompound the word "Sicherheit" because the number of
occurrences of the words "sicher" (4522) and "heit" (4) will not produce a total (4526) greater than the number of
occurrences of the word "sicherheit" (6583).</p>
      <p>If we consider the German word "Bundesbankpräsident" (president of the (German) federal bank), the
generative part of our algorithm would return (bundesbank 1453 (bund 7032, bank 9657), präsident 24041) and
the final decompounding approach would return (bund 7032, bank 9657, präsident 24041). In this case, the
number of occurrences of "bundesbank" (1453) is smaller than the sum of the occurrences of the words "bund"
and "bank". However, our approach does not always generate the appropriate components of a compounded
term. For example, based on the compound construction "wintersports", the system answers with (winter 1643,
port 1091) instead of (winter 1643, sport 1483). This problem is due to the fact that the first part of our
approach ignores backtracking and will stop when it encounters the first splitting of the compound into two
parts.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Indexing and Searching Strategy</title>
      <p>In order to obtain a broader view of the relative merit of various retrieval models, we first adopted a binary
indexing scheme within which each document (or request) is represented by a set of keywords, without any
weight. To measure the similarity between documents and requests, we computed the inner product (retrieval
model denoted "doc=bnn, query=bnn" or "bnn-bnn"). In order to weight the presence of each indexing term in a
document surrogate (or in a query), we could account for the term occurrence frequency (retrieval model notation:
"doc=nnn, query=nnn" or "nnn-nnn") or we might also account for their frequency in the collection (or more
precisely the inverse document frequency, denoted by idfj). Moreover, a cosine normalization could prove
beneficial and each indexing weight could vary within the range of 0 to 1 (retrieval model notation: "ntc-ntc",
Table 3 depicts the exact weighting formulation).</p>
      <p>
        Other variants might also be created. For example, the tf component may be computed as 0.5 + 0.5 · [tf /
max tf in a document] (retrieval model denoted "doc=atn"). We might also consider that a term's presence in a
shorter document provides stronger evidence than it does in a longer document, leading to more complex IR
models; for example, the IR model denoted by "doc=Lnu" [
        <xref ref-type="bibr" rid="ref4">Buckley 1996</xref>
        ], "doc=dtu" [
        <xref ref-type="bibr" rid="ref12">Singhal 1999</xref>
        ].
      </p>
      <p>
        Besides the previous models based on the vector-space approach, we also considered probabilistic models.
In this vein, we used the Okapi probabilistic model [
        <xref ref-type="bibr" rid="ref10">Robertson 2000</xref>
        ] within with:
      </p>
      <p>K = k1 · [(1 - b) + b · (li / avdl)]
represents the ratio between the length of Di measured by li (sum of tfi j) and the collection mean noted by avdl.
In Table 3, the value of nti indicates the number of distinct indexing terms including in the representation of Di.</p>
      <p>
        As a second probabilistic approach, we implemented the Prosit (PRObabilistic Sift of Information Terms)
approach [
        <xref ref-type="bibr" rid="ref1 ref2">Amati 2002</xref>
        a, 2002b] which is based on the following indexing formula:
wi j = Inf1i j · Inf2i j = (1 - Prob1i j) · Inf2i j with
Prob1i j = tfni j / (tfni j + 1)
      </p>
      <p>with tfni j = tfi j · log2[1 + ((C · mean dl) / li)]
Inf2i j = -log2[1 / (1+lj)] - tfni j · log2 [lj / (1+lj)]
with lj = tcj / n
in which tcj indicates the number of occurrences of term tj in the collection and n the number of documents in
the corpus. In our experiments, the constants b, k1, avdl, pivot, slope, C and mean dl are fixed according to
values listed in Table!4.</p>
      <p>bnn
ltn
dtn
Okapi
lnc
(K + tf i j)
ltc
dtu</p>
      <sec id="sec-5-1">
        <title>Language</title>
      </sec>
      <sec id="sec-5-2">
        <title>English</title>
        <p>French
Spanish
German
German
Italian
Dutch
Dutch
Finnish
Finnish
Swedish
Swedish
Russian
Russian
Russian</p>
        <p>Index
word
word
word
word
5-gram
word
word
5-gram
word
5-gram
word
4-gram
word
5-gram
4-gram
b
(ln(tfi j) + 1)⋅ idf j
wi j =</p>
        <p>(ln(ln(tf i j) + 1) + 1) ⋅idf j
(1 - slope) ⋅ pivot + slope ⋅ nt i
wi j = idfj . [0.5+ 0.5.tfi j / max tfi.]
wi j = tfi j . ln[(n-dfj) / dfj]
wi j =
wi j =
Ê1 + ln(tf i j) ˆ
ËÁ ln(mean tf) + 1˜¯
(1 - slope) ⋅ pivot + slope ⋅ nt i</p>
        <p>tf i j ⋅ idf j
t
Â (tf i k ⋅idf k )
k =1</p>
        <p>2
avdl</p>
        <p>To evaluate our approaches, we used the SMART system as a test bed running on an Intel Pentium III/600
(memory: 1 GB, swap: 2 GB, disk: 6 x 35 GB). To measure the retrieval performance, we adopted the
noninterpolated mean average precision (computed on the basis of 1,000 retrieved items per request by the
TRECEVAL program). We indexed the English, French, Spanish and Italian collections using words as indexing
units. The evaluation of our two probabilistic models and nine vector-space schemes are given in Table 5a.</p>
        <p>In order to represent German, Dutch, Swedish, Finnish and Russian documents and queries, we considered
the n-gram, decompounded and word-based indexing schemes. The resulting mean average precision for these
various indexing approaches is shown in Table 5b (German and Dutch corpora), in Table 5c (Swedish and
Finnish languages) and in Table 5d (Russian collection).</p>
        <p>
          It was observed that pseudo-relevance feedback (blind-query expansion) seems to be a useful technique for
enhancing retrieval effectiveness. In this study, we adopted Rocchio's approach [
          <xref ref-type="bibr" rid="ref4">Buckley 1996</xref>
          ] with a = 0.75,
b = 0.75 whereby the system was allowed to add m terms extracted from the k best ranked documents from the
original query. To evaluate this proposition, we used the Okapi and the Prosit probabilistic models and we
enlarged the query by the 10 to 175 terms provided by the 3 or 10 best-retrieved articles.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>Mean average precision French Spanish 52 queries 57 queries Query TD</title>
        <p>The results depicted in Tables 6 (depicting our best results) indicate that the optimal parameter setting seems
to be collection-dependant. Moreover, performance improvement also seems to be collection dependant (or
language dependant), with no improvement for the English corpus yet an increase of 8.55% for the Spanish
corpus (from a mean average precision of 51.71 to 56.13), 9.85% for the French corpus (from 48.41 to 53.18),
12.91% for the Italian language (41.05 to 46.35) and 13.26% for the German collection (from 41.25 to 46.72,
combined model, Table 6b).</p>
      </sec>
      <sec id="sec-5-4">
        <title>Mean average precision</title>
        <p>For the English, French, Spanish, Italian and Russian languages, we assumed that the n-gram indexing and
word-based document representation approaches are distinct and independent sources of evidence regarding the
content of documents. For the German, Dutch, Swedish and Finnish languages, we added the decompounding
indexing approach in our documents (and queries) representation scheme.</p>
        <p>
          In order to combine these two and three indexing schemes respectively, we evaluated various fusion
operators, as suggested by
          <xref ref-type="bibr" rid="ref6">Fox and Shaw [Fox 1994</xref>
          ]. Table 7 shows their precise description. For example,
the combSUM operator indicates that the combined document score (or the final retrieval status value) is simply
the sum of the retrieval status value (RSVk) of the corresponding document Dk computed by each single
indexing scheme. CombNBZ specifies that we multiply the sum of the document scores by the number of
retrieval schemes that are able to retrieve the corresponding document. In Table 7, we can see that both the
combRSV% and combRSVnorm apply a normalization procedure when combining document scores. When
combining the retrieval status value (RSVk) for various indexing schemes, we may multiply the document score
by a constant ai (usually equal to 1) in order to favor the ith more efficient retrieval scheme. In addition to use
these data fusion operators, we also considered the round-robin approach, whereby in turn we take one document
from all individual lists and remove duplicates, keeping the most highly ranked instance.
        </p>
        <p>combMAX
combMIN
combSUM
combANZ
combNBZ
combRSV%
combRSVnorm</p>
      </sec>
      <sec id="sec-5-5">
        <title>Query TD</title>
        <p>Model
Okapi expand doc/term
Prosit expand doc/term
combMAX
combMIN
combSUM
combANZ
combNBZ
combRSV%
combRSVnorm
round-robin</p>
        <p>MAX (ai . RSVk)
MIN (ai . RSVk)</p>
        <p>SUM (ai . RSVk)
SUM (ai . RSVk) / # of nonzero (RSVk)
SUM (ai . RSVk) * (# of nonzero (RSVk))</p>
        <p>SUM (ai . (RSVk / MAXRSV))</p>
        <p>SUM [ai . ((RSVk-MINRSV) / (MAXRSV-MINRSV))]
Table 8a: Mean average precision using different combination operators (ai = 1, with blind-query expansion)</p>
      </sec>
      <sec id="sec-5-6">
        <title>Run name Language Query</title>
      </sec>
      <sec id="sec-5-7">
        <title>UniNEfr French</title>
      </sec>
      <sec id="sec-5-8">
        <title>UniNEfr2 French</title>
      </sec>
      <sec id="sec-5-9">
        <title>UniNEsp Spanish</title>
      </sec>
      <sec id="sec-5-10">
        <title>UniNEsp2 Spanish</title>
      </sec>
      <sec id="sec-5-11">
        <title>UniNEde German</title>
      </sec>
      <sec id="sec-5-12">
        <title>UniNEde2 German</title>
      </sec>
      <sec id="sec-5-13">
        <title>UniNEit</title>
      </sec>
      <sec id="sec-5-14">
        <title>UniNEit2</title>
      </sec>
      <sec id="sec-5-15">
        <title>UniNEnl Italian</title>
      </sec>
      <sec id="sec-5-16">
        <title>Italian</title>
      </sec>
      <sec id="sec-5-17">
        <title>Dutch</title>
      </sec>
      <sec id="sec-5-18">
        <title>UniNEnl2 Dutch</title>
      </sec>
      <sec id="sec-5-19">
        <title>UniNEsv Swedish</title>
      </sec>
      <sec id="sec-5-20">
        <title>UniNEsv2 Swedish</title>
      </sec>
      <sec id="sec-5-21">
        <title>UniNEfi Finnish</title>
      </sec>
      <sec id="sec-5-22">
        <title>UniNEfi2 Finnish</title>
      </sec>
      <sec id="sec-5-23">
        <title>UniNEru Russian</title>
      </sec>
      <sec id="sec-5-24">
        <title>UniNEru1 Russian</title>
      </sec>
      <sec id="sec-5-25">
        <title>UniNEru2 Russian</title>
      </sec>
      <sec id="sec-5-26">
        <title>UniNEru3 Russian TD TD TD</title>
        <p>TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TDN
TDN
TDN
TDN
word Pro+Oka
decomp. Pro+Oka
5-gram Pro+Oka
word Pro+Oka
decomp. Pro+Oka
4-gram Pro+Oka
word Pro+Oka
decomp. Pro+Oka
4-gram Pro+Oka
Index
word
word
word
word
word
word
word
word
word
decomp.
5-gram
word
word
word
word
word
decomp.
5-gram
word
decomp.
5-gram
word
decomp.
5-gram
word
decomp.
5-gram
word
word
word
word
5-gram
5-gram
4-gram
4-gram
word
word</p>
      </sec>
      <sec id="sec-5-27">
        <title>Okapi</title>
        <p>Prosit</p>
      </sec>
      <sec id="sec-5-28">
        <title>Okapi</title>
        <p>Prosit</p>
      </sec>
      <sec id="sec-5-29">
        <title>Okapi</title>
        <p>Prosit</p>
      </sec>
      <sec id="sec-5-30">
        <title>Okapi</title>
        <p>Prosit</p>
      </sec>
      <sec id="sec-5-31">
        <title>Prosit Prosit Prosit</title>
      </sec>
      <sec id="sec-5-32">
        <title>Okapi</title>
        <p>Prosit</p>
      </sec>
      <sec id="sec-5-33">
        <title>Okapi</title>
        <p>Prosit</p>
      </sec>
      <sec id="sec-5-34">
        <title>Okapi</title>
        <p>Okapi
Prosit</p>
      </sec>
      <sec id="sec-5-35">
        <title>Okapi Okapi Prosit</title>
      </sec>
      <sec id="sec-5-36">
        <title>Prosit</title>
        <p>Prosit
Prosit</p>
      </sec>
      <sec id="sec-5-37">
        <title>Prosit</title>
        <p>Prosit
Prosit</p>
      </sec>
      <sec id="sec-5-38">
        <title>Okapi</title>
        <p>Prosit</p>
      </sec>
      <sec id="sec-5-39">
        <title>Okapi</title>
        <p>Prosit</p>
      </sec>
      <sec id="sec-5-40">
        <title>Okapi</title>
        <p>Prosit
Okapi
Prosit</p>
      </sec>
      <sec id="sec-5-41">
        <title>Okapi</title>
        <p>Prosit</p>
        <p>Query expansion
10 best docs / 10 terms
5 best docs / 30 terms
10 best docs / 10 terms
5 best docs / 30 terms
10 best docs / 10 terms
10 best docs / 10 terms
5 best docs / 10 terms
10 best docs / 10 terms
5 best docs / 20 terms
10 best docs / 40 terms
5 best docs / 175 terms
5 best docs / 20 terms
10 best docs / 40 terms
5 best docs / 175 terms
10 best docs / 20 terms
10 best docs / 50 terms
10 best docs / 20 terms
10 best docs / 50 terms
10 best docs / 20 terms
10 best docs / 20 terms
10 best docs / 150 terms
3 best docs / 15 terms
3 best docs / 15 terms
3 best docs / 40 terms
5 best docs / 30 terms
5 best docs / 50 terms
5 best docs / 30 terms
5 best docs / 30 terms
5 best docs / 15 terms
3 best docs / 125 terms
5 best docs / 30 terms
5 best docs / 15 terms
3 best docs / 125 terms
1 0b e ds to c/ s20 terms
5 best docs / 30 terms
1 0b e ds to c/ s20 terms
5 best docs / 30 terms
10 best docs / 50 terms
5 best docs / 40 terms
10 best docs / 50 terms
5 best docs / 40 terms</p>
      </sec>
      <sec id="sec-5-42">
        <title>1 0b e ds to c/ s10 terms 5 best docs / 20 terms combined round-robin</title>
        <p>RSV%</p>
      </sec>
      <sec id="sec-5-43">
        <title>RSVnorm</title>
      </sec>
      <sec id="sec-5-44">
        <title>RSVnorm</title>
      </sec>
      <sec id="sec-5-45">
        <title>RSVnorm</title>
        <p>sumRSV
RSV%
sumRSV
sumRSV
RSV%
RSVnorm
sumRSV
sumRSV
sumRSV
sumRSV
sumRSV
sumRSV
52.61
54.50</p>
      </sec>
      <sec id="sec-5-46">
        <title>Prosit word doc/term</title>
        <p>Prosit decomp doc/term
Prosit n-gram doc/term
combMAX
combMIN
combSUM
combANZ
combNBZ
combRSV%
combRSVnorm
round-robin</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>
        Acknowledgments
[
        <xref ref-type="bibr" rid="ref1 ref2">Amati 2002</xref>
        b]
[
        <xref ref-type="bibr" rid="ref5">Chen 2002</xref>
        ]
[
        <xref ref-type="bibr" rid="ref6">Fox 1994</xref>
        ]
      </p>
      <p>In this fourth CLEF evaluation campaign, we proposed a general stopword list and stemming procedure for
eight European languages (excluding English). Currently it is not clear if a stemming procedure such as that
suggested and that only removes inflectional suffixes from nouns and adjectives, could produce better retrieval
effectiveness than a stemming approach that takes both inflectional and derivational suffixes into account. We
also suggested a simple decompounding approach for the German, Dutch, Swedish and Finnish language. In
order to achieve better retrieval performance, we used a data fusion approach, one requiring that document (and
query) representation be based on two or three indexing schemes.</p>
      <p>The author would like to thank C. Buckley from SabIR for giving us the opportunity to use the SMART
system. This research was supported by the Swiss National Science Foundation under grant #21-66 742.01.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Amati</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carpineto</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Romano</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>Italian monolingual information retrieval with PROSIT</article-title>
          .
          <source>In Proceedings of CLEF-2002</source>
          , (pp.
          <fpage>145</fpage>
          -
          <lpage>151</lpage>
          ). Roma.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Amati</surname>
            ,
            <given-names>G</given-names>
          </string-name>
          . &amp; van
          <string-name>
            <surname>Rijsbergen</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>Probabilistic models of information retrieval based on measuring the divergence from randomness</article-title>
          .
          <source>ACM TOIS</source>
          ,
          <volume>20</volume>
          (
          <issue>4</issue>
          ),
          <fpage>357</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Braschler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Ripplinger</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Stemming and decompounding for German text retrieval</article-title>
          .
          <source>In Proceedings 25th European Conference in IR</source>
          (pp.
          <fpage>177</fpage>
          -
          <lpage>192</lpage>
          ). Berlin: Springer.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singhal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>1996</year>
          ).
          <article-title>New retrieval approaches using SMART</article-title>
          .
          <source>In Proceedings of TREC'4</source>
          , (pp.
          <fpage>25</fpage>
          -
          <lpage>48</lpage>
          ). Gaithersburg: NIST Publication #
          <fpage>500</fpage>
          -
          <lpage>236</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>Cross-language retrieval experiments at CLEF-2002</article-title>
          .
          <source>In Proceedings of CLEF-2002</source>
          , (pp.
          <fpage>5</fpage>
          -
          <lpage>20</lpage>
          ). Roma.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Fox</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Shaw</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          (
          <year>1994</year>
          ).
          <article-title>Combination of multiple searches</article-title>
          .
          <source>In Proceedings TREC-2</source>
          , (pp.
          <fpage>243</fpage>
          -
          <lpage>249</lpage>
          ). Gaithersburg: NIST Publication #
          <fpage>500</fpage>
          -
          <lpage>215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Kraaij</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Pohlmann</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>1996</year>
          ).
          <article-title>Viewing stemming as recall enhancement</article-title>
          .
          <source>In Proceedings of the ACM-SIGIR'96</source>
          , (pp.
          <fpage>40</fpage>
          -
          <lpage>48</lpage>
          ). New York: The ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Lovins</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          (
          <year>1968</year>
          ).
          <article-title>Development of a stemming algorithm</article-title>
          .
          <source>Mechanical Translation and Computational Linguistics</source>
          ,
          <volume>11</volume>
          (
          <issue>1</issue>
          ),
          <fpage>22</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Porter</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          (
          <year>1980</year>
          ).
          <article-title>An algorithm for suffix stripping</article-title>
          .
          <source>Program</source>
          ,
          <volume>14</volume>
          ,
          <fpage>130</fpage>
          -
          <lpage>137</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Walker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Beaulieu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>Experimentation as a way of life: Okapi at TREC</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>36</volume>
          (
          <issue>1</issue>
          ),
          <fpage>95</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Savoy J.</surname>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>Report on CLEF-2002 experiments: Combining multiple sources of evidence</article-title>
          .
          <source>In Proceedings of CLEF-2002</source>
          , (pp.
          <fpage>31</fpage>
          -
          <lpage>46</lpage>
          ). Roma.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Singhal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hindle</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>D.D.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>1999</year>
          ). AT&amp;
          <string-name>
            <surname>T at</surname>
          </string-name>
          TREC-7.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>In Proceedings TREC-7</source>
          , (pp.
          <fpage>239</fpage>
          -
          <lpage>251</lpage>
          ). Gaithersburg: NIST Publication #
          <fpage>500</fpage>
          -
          <lpage>242</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>