<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Stemming Approaches for East European Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ljiljana Dolamic</string-name>
          <email>Ljiljana.Dolamic@unine.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jacques Savoy</string-name>
          <email>Jacques.Savoy@unine.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Natural Language Processing with East European Languages</institution>
          ,
          <addr-line>Stemmer, Stemming Strategy, Czech Language, Hungarian Language, Bulgarian Language</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Neuchatel</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In our participation in this CLEF evaluation campaign, the first objective is to propose and evaluate various indexing and search strategies for the Czech language in order to hopefully produce better retrieval effectiveness than that of the language-independent approach (n-gram). Based on our stemming strategy used with other languages, we propose two light stemmers for this Slavic language and a third one based on a more aggressive suffix-stripping scheme that removes some derivational suffixes. Our second objective is to obtain a better picture of the relative merit of various search engines in exploring Hungarian and Bulgarian documents. Moreover for the Bulgarian language we developed a new and more aggressive stemmer. To evaluate these solutions we use our various IR models, including the Okapi, Divergence from Randomness (DFR) and statistical language model (LM) together with the classical tf.idf vectorprocessing approach. Our experiments tend to show that for the Bulgarian language removing certain frequently used derivational suffixes may improve mean average precision. For the Hungarian corpus, applying an automatic decompounding procedure improves the MAP. For the Czech language, a comparison between a light (inflectional only) and a more aggressive stemmer that removes both inflectional and some derivational suffixes reveals small performance differences. For this language only, the performance difference between a word-based or a 4gram indexing strategy is also rather small, while for the Hungarian or Bulgarian corpora, a wordbased approach tend to produce better MAP.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        During the last few years, the IR group at University of Neuchatel has been involved in designing,
implementing and evaluating IR systems for various natural languages, including both European
        <xref ref-type="bibr" rid="ref1 ref12 ref13 ref2">(Savoy &amp;
Abdou, 2007)</xref>
        and popular Asian
        <xref ref-type="bibr" rid="ref11">(Savoy, 2005)</xref>
        <xref ref-type="bibr" rid="ref1 ref12 ref13 ref2">(Abdou &amp; Savoy, 2007a)</xref>
        languages (namely, Chinese,
Japanese, and Korean). In this context our main objective is to promote effective monolingual IR in those
languages. For our participation in the CLEF 2007 evaluation campaign we decided to review our stemming
strategy by including some very frequently used derivational suffixes. When defining our stemming rules
however we still focus only on nouns and adjectives.
      </p>
      <p>The rest of this paper is organized as follows: Section 2 describes the main characteristics of the CLEF-2007
test-collections. Section 3 outlines the main aspects of our stopword lists and stemming procedures. Section 4
analyses the principal features of different indexing and search strategies, and evaluates their use with the
available corpora. The data fusion approaches adapted in our experiments are explained in Section 5, and
Section 6 depicts our official results.</p>
      <p>The corpora used in our experiments include newspaper articles, namely Magyar Hirlap (2002, Hungarian),
Sega (2002, Bulgarian), Standart (2002, Bulgarian), Novinar (2002, a new Bulgarian sub-collection in CLEF
2007), Mladná fronta Dnes (2002, Czech), Lidove Noviny (2002, Czech). As shown in Table 1, the Bulgarian
corpus is relatively large compared to the others, both in size and in the number of documents. As for average
article length, the Czech corpus is longer (212.6), while for the Bulgarian (135.9) and Hungarian (152.3)
languages the lengths are relatively similar. It is interesting to note that even though the Hungarian collection is
the smallest (105 MB), it contains a larger number of distinct indexing terms (191,738 computed after
stemming) when compared to the Bulgarian and Czech corpuses.</p>
      <p>During the indexing process we retained only the following logical sections from the original documents:
&lt;TITLE&gt;, &lt;LEAD&gt;, and &lt;TEXT&gt;. From the topic descriptions we automatically removed certain phrases such
as “Relevant document report …”, “Подходящ е всеки документ” or “Keressünk olyan cikkeket, amelyek …”,
etc. All our runs were fully automatic.</p>
      <p>As shown in the Appendix 2, the available topics cover various subjects (e.g., Topic #409: “Bali Car
Bombing,” Topic #414: “Beer Festivals,” Topic #436: “VIP Divorces,” or Topic #443: “World Swimming
Records”), including both regional (Topic #445: “Prince Harry and Drugs”) and more international coverage.</p>
      <sec id="sec-1-1">
        <title>Bulgarian Hungarian</title>
        <p>Size (in MB) 261 MB 105 MB
# of documents 87,281 49,530
# of distinct terms 169,394 191,738
Number of distinct indexing terms per document
Mean 99.5 105.4
Standard deviation 93.86 91.08
Median 70 75
Maximum 1,193 1,284
Minimum 0 2
Number of indexing terms per document
Mean 135.9 152.3
Standard deviation 143.58 145.86
Median 91 102
Maximum 2,837 6,008
Minimum 0 5
Number of queries</p>
        <p>Number rel. items
Mean rel./ request
Standard deviation
Median
Maximum
Minimum
50
1,012
20.24
14.23
17.5
62 (T#438)
2 (T#419)
50
911
18.22
14.08</p>
        <p>14
66 (T#415)
1 (T#411)</p>
        <p>Czech
178 MB
81,735
194,500</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3 Stopword Lists and Stemming Procedures</title>
      <p>
        During this evaluation campaign, our stopword list and stemmer for Hungarian were the same as that used in
our CLEF 2006 participation
        <xref ref-type="bibr" rid="ref1 ref12 ref13 ref2">(Savoy &amp; Abdou, 2007)</xref>
        . For this language our suggested stemmer mainly
includes inflectional removals (gender, number and 23 grammatical cases, as for example in “házakat” → “ház”
(house)) as well as some pronouns (e.g., “házamat” (my house) → “ház”) and a few derivational suffixes (e.g.,
“temetés” (burial) → “temet” (to bury)). See
        <xref ref-type="bibr" rid="ref12 ref13">Savoy (2007)</xref>
        for more information. Moreover, the Hungarian
language uses compound constructions (e.g., “hétvégé” (weekend) = “hét” (week / seven) + “vég” (end)). In
order to increase the matching possibilities between search keywords and document representations, we
automatically decompounded Hungarian words using our decompounding algorithm
        <xref ref-type="bibr" rid="ref11">(Savoy, 2004)</xref>
        , leaving
both compound words and their component parts in the documents and queries. The stopword list retained
contains 737 words. The stemmer and stopword list are freely available www.unine.ch/info/clef.
      </p>
      <p>For the Bulgarian language we decided to modify the transliteration procedure we used previously to convert
Cyrillic characters into Latin letters. By correcting an error and adapting it for the new transliteration scheme,
we modified last year’s stemmer and denoted it the light Bulgarian stemmer. In this language, definite articles
and plural forms are represented by suffixes and the general noun pattern is the following:
&lt;stem&gt; &lt;plural&gt; &lt;article&gt;. Our light stemmer contains eight rules for removing plurals and five for removing
articles. Additionally we applied seven grammatical normalization rules plus three others to remove
palatalization (changing a stem's final consonant when followed by a suffix beginning with certain vowels), as is
very common in most Slavic languages (see Appendix 3 for all the rules). We also proposed a new and more
aggressive Bulgarian stemmer that also removes some derivational suffixes (e.g., “страшен” (fearfull) →
“страх” (fear)). The stopword list used for this language contains 309 words, somewhat bigger than that of last
year (258 items).</p>
      <p>For the Czech language, we proposed a new stopword list containing 467 forms (determinants, prepositions,
conjunctions, pronouns, and some very frequent verb forms). We also designed and implemented three Czech
stemmers. The first one is a light stemmer that removes only those inflectional suffixes attached to nouns or
adjectives in order to conflate to the same stem those morphological variations related to gender (feminine,
neutral vs. masculine), number (plural vs. singular) and various grammatical cases (seven in the Czech
language). For example, the noun “město” (city) appears as such in its singular form (nominative, vocative or
accusative) but varies with other cases, “města” (genitive), “městu” (dative), “městem” (instrumental) or
“městě” (locative). The corresponding plural forms are “města”, “měst”, “městům”, “městy” or “městech”. In
the Czech language all nouns have a gender, and with a few exceptions (indeclinable borrowed words), they are
declined for both number and case. For Czech nouns, the general pattern is the following:
&lt;stem&gt; &lt;possessive&gt; &lt;case&gt; in which &lt;case&gt; ending includes both gender and number. Adjectives are
declined to match the gender, case and number of the nouns to which they are attached. To remove these
various case endings from nouns and adjectives we devised 52 rules, and then before returning the computed
stem, we added five normalization rules in order to control palatalization and certain vowel changes in the basic
stem (see Appendix 4 for all details).</p>
      <p>Our second Czech stemmer denoted “light+” also includes rules for removing comparative forms from
adjectives (e.g., “krásný”, ”krásnější”, ”nejkrásnější” → “krásn” (beautiful, more beautiful, the most beautiful)).
We do not however expect this light stemmer variation to result in any significant changes in retrieval
performance.</p>
      <p>Finally, we designed and implemented a more aggressive stemmer that includes certain rules to remove
frequently used derivational suffixes (e.g., “členství”(membership) → “člen”(member)). In applying this third
more aggressive stemmer (denoted “derivational”) we hope to improve mean average precision (MAP). Finally
and unlike other languages, we do not remove the diacritics when building Czech stemmers.</p>
    </sec>
    <sec id="sec-3">
      <title>4 IR models and Evaluation</title>
      <sec id="sec-3-1">
        <title>4.1. Indexing and Searching Strategies</title>
        <p>In order to obtain a high MAP values, we might adopt different weighting schemes applied to terms that
occur in the documents or in the query. This weighting would allow us to account for term occurrence
frequency (denoted tfij for indexing term tj in document Di), as well as their inverse document frequency
(denoted idfj). Moreover, we might normalize each indexing weight using the cosine to obtain the classical tf.idf
formulation, rather than the more recent normalization approaches that account for document length.</p>
        <p>
          In addition to this vector-space approach, we also considered probabilistic models such as the Okapi (or
BM25) (Robertson et al. 2000). As a second probabilistic approach, we implemented three variants of the DFR
(Divergence from Randomness) family of models suggested by
          <xref ref-type="bibr" rid="ref3">Amati &amp; van Rijsbergen (2002</xref>
          ). In this
framework, the indexing weight wij attached to term tj in document Di combines two information measures as
follows:
        </p>
        <p>wij = Inf1ij · Inf2ij = –log2[Prob1 ij(tf)] · (1 – Prob2ij(tf))
As a first model, we implemented the PB2 scheme, defined by the following equations:</p>
        <p>Inf1ij = -log2[(e-λj · λjtfij)/tfij!]</p>
        <p>with λj = tcj / n
Prob2ij = 1 - [(tcj +1) / (dfj · (tfnij + 1))]
with tfnij = tfij · log2[1 + ((c·mean dl) / li)]
(1)
(2)
where tcj indicates the number of occurrences of term tj in the collection, li the length (number of indexing
terms) of document Di, mean dl the average document length, n the number of documents in the corpus, and c a
constant (the corresponding values are given in the Appendix 1).</p>
        <p>For the second model called GL2, the implementation of Prob1ij is given by Equation 3, and Prob2ij is given
by Equation 4, as follows:</p>
        <p>Prob1ij = [1 / (1+λj)] · [λj / (1+λj)]tfnij</p>
        <p>Prob2ij = tfnij / (tfnij + 1)
where λj and tfnij were defined previously.</p>
        <p>For the third model called IneC2, the implementation is given by the following two equations:
Inf1ij = tfnij · log2[(n+1) / (ne+0,5)]</p>
        <p>with ne = n · [1 – [(n-1)/n]tcj ]</p>
        <p>Prob2ij = 1 - [(tcj +1) / (dfj · (tfnij+1))]
where n, tcj and tfnij were defined previously, and dfj indicates the number of documents in with the term tj
occurs.</p>
        <p>
          Finally, we also considered an approach based on a statistical language model (LM)
          <xref ref-type="bibr" rid="ref6">(Hiemstra, 2000; 2002)</xref>
          ,
known as a non-parametric probabilistic model (the Okapi and DFR are viewed as parametric models).
Probability estimates would thus not be based on any known distribution (e.g., as in Equation 1 or 3), but rather
be estimated directly based on occurrence frequencies in document Di or corpus C. Within this language model
paradigm, various implementations and smoothing methods might be considered, although in this study we
adopted a model proposed by
          <xref ref-type="bibr" rid="ref7">Hiemstra (2002)</xref>
          , as described in Equation 7, combining an estimate based on
document (P[tj | Di]) and on corpus (P[tj | C]).
        </p>
        <p>P[Di | Q] = P[Di] . ∏tj∈Q [λj . P[tj | Di] + (1-λj) . P[tj | C]]
with P[tj | Di] = tfij/li and P[tj | C] = dfj/lc
with lc = ∑k dfk
where λj is a smoothing factor (constant for all indexing terms tj, and usually fixed at 0.35) and lc an estimate of
the size of the corpus C.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Overall Evaluation</title>
        <p>To measure retrieval performance, we adopted MAP values computed on the basis of 1,000 retrieved items
per request as calculated with the new TREC-EVAL program. Using this evaluation tool, some evaluation
differences may occur in the values computed according to the official measure (the latter always takes 50
queries into account while in our presentation we do not account for queries having no relevant items). In the
following tables, the best performance under the given conditions (with the same indexing scheme and the same
collection) is listed in bold type.
(3)
(4)
(5)
(6)
(7)</p>
        <sec id="sec-3-2-1">
          <title>Query</title>
          <p>Stemmer / indexing unit
Model \ # of queries
Okapi
DFR GL2
DFR PB2
DFR IneC2
LM (λ=0.35)
tf . idf
Average
% change over TD
% change
Bulgarian Bulgarian</p>
          <p>TD TDN
light / word light / word
50 queries 50 queries
0.3155 0.3462
0.3307 0.3653
0.3266 0.3476
0.3423 0.3696
0.3175 0.3580
0.2103 0.2264
0.3265 0.3573
+9.4%
-5.8%</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Mean average precision</title>
          <p>Bulgarian Bulgarian Bulgarian Bulgarian</p>
          <p>TD TDN TD TDN
deriv./word deriv./word none/4-gram none/4-gram
50 queries 50 queries 50 queries 50 queries
0.3425 0.3720 0.3022 0.3342
0.3541 0.3909 0.3100 0.3250
0.3394 0.3637 0.2960 0.3116
0.3606 0.3862 0.3156 0.3409
0.3368 0.3782 0.2868 0.3294
0.2143 0.2293 0.2105 0.2271
0.3467 0.3782 0.3021 0.3282</p>
          <p>+9.09% +8.6%
shows that the best performing IR model corresponds to the DFR IneC2 model with all stemming approaches or
query sizes.</p>
          <p>In the last lines we reported the MAP average over these 5 IR models together with percentage of variation
compared to the medium (TD) query formulation or to the derivational stemmer (TD query). As depicted in the
last lines, increasing the query size improves the MAP (around +9%). According to the average performance,
the best indexing approach seems to be a word-based approach using our derivational stemmer. In this case, the
MAP with TD query formulation is, in average, 0.3467 vs. 0.3021 for the 4-gram approach, a relative difference
of 12.9%. The performance difference with the light stemmer is smaller in average (0.3467 vs. 0.3265), a
relative difference of 5.8%.</p>
          <p>The evaluations done on the Czech language are depicted in Table 4. In this case, we compared three
stemmers and the 4-gram indexing approach (without stemming). The best performing IR models corresponds
to either the DFR GL2 or the Okapi probabilistic model. The performance differences between these two IR
models are usually rather small.</p>
          <p>As shown in the last three lines of Table 4, the best indexing strategy seems to be the word-based indexing
strategy using the light stemming approach. As expected, performance differences between the “light” and
“light+” stemmers are rather small (2.14% when using the TD query formulation). Moreover, the performance
differences between the 4-gram and the light stemming approach seem to be statistically not significant (in
average, 0.3068 vs. 0.3057 with TD query formulation). As for the other corpora, increasing the query size
improves the MAP (around +10%).</p>
          <p>
            An analysis showed that pseudo-relevance feedback (PRF or blind-query expansion) seemed to be a useful
technique for enhancing retrieval effectiveness. In this study, we adopted Rocchio's approach (denoted “Roc”)
            <xref ref-type="bibr" rid="ref4">(Buckley et al., 1996)</xref>
            with α = 0.75, β = 0.75, whereby the system was allowed to add m terms extracted from
the k best ranked documents from the original query. From our previous experiments we learned that this type
of blind query expansion strategy does not always work well. More particularly, we believe that including terms
occurring frequently in the corpus (because they also appear in the top-ranked documents) may introduce more
noise, and thus be an ineffective means of discriminating between relevant and non-relevant items
            <xref ref-type="bibr" rid="ref9">(Peat &amp;
Willett, 1991)</xref>
            . Consequently we chose to also apply our idf-based query expansion model (denoted “idf” in
Tables 9 and 10)
            <xref ref-type="bibr" rid="ref1 ref12 ref13 ref2">(Abdou &amp; Savoy, 2007b)</xref>
            .
          </p>
          <p>To evaluate these propositions, we applied certain probabilistic models and enlarged the query by the 20 to
150 terms (indexing words or n-grams) retrieved from the 3 to 10 best-ranked articles within the Bulgarian
(Table 5), Hungarian (Table 6) and Czech corpora (Table 7).</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>Query TD</title>
          <p>PRF using Rocchio</p>
        </sec>
        <sec id="sec-3-2-4">
          <title>IR Model / MAP k doc. / m terms</title>
        </sec>
        <sec id="sec-3-2-5">
          <title>Query TD</title>
          <p>PRF using Rocchio</p>
        </sec>
        <sec id="sec-3-2-6">
          <title>IR Model / MAP k doc. / m terms</title>
        </sec>
        <sec id="sec-3-2-7">
          <title>Query TD</title>
          <p>PRF using Rocchio</p>
        </sec>
        <sec id="sec-3-2-8">
          <title>IR Model / MAP k doc. / m terms</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5 Data Fusion</title>
      <p>Bulgarian
derivational
Okapi 0.3425</p>
      <p>For the Bulgarian corpus (Table 5), enhancement increased from +1.47% (4-gram, Okapi, 0.3022 vs. 0.3065)
to +21.7% (LM model, 0.3368 vs. 0.4098). For the Hungarian collection (Table 6), percentage improvement
varied from +6.1% (4-gram, Okapi model, 0.3445 vs. 0.3654) to +10.1% (LM model, 0.3913 vs. 0.4323). For
the Czech language (Table 7), the percentages of variation range from -2.6% (4-gram, Okapi model, 0.3401 vs.
0.3314) to +21.6% (DFR GL2 model, 0.3437 vs. 0.4179).</p>
      <p>
        It is assumed that combining different search models should improve retrieval effectiveness, due to the fact
that each document representation might not retrieve the same pertinent items and thus increase the overall recall
        <xref ref-type="bibr" rid="ref14">(Vogt &amp; Cottrell, 1999)</xref>
        . In this current study we combined three probabilistic models representing both the
      </p>
      <sec id="sec-4-1">
        <title>Bulgarian</title>
        <p>
          derivational
LM 0.3368
parametric (Okapi and DFR) and non-parametric (language model or LM) approaches. On the other hand, we
also combined both word-based and n-gram indexing strategies. To perform such combination we evaluated
various fusion operators (see Table 8 for a detailed list of their descriptions). The “Sum RSV” operator for
example indicates that the combined document score (or the final retrieval status value) is simply the sum of the
retrieval status value (RSVk) of the corresponding document Dk computed by each single indexing scheme
          <xref ref-type="bibr" rid="ref5">(Fox
&amp; Shaw, 1994)</xref>
          . Table 8 thus illustrates how both the “Norm Max” and “Norm RSV” apply a normalization
procedure when combining document scores. When combining the retrieval status value (RSVk) for various
indexing schemes and in order to favor certain more efficient retrieval schemes, we could multiply the document
score by a constant αi (usually equal to 1) reflecting the differences in retrieval performance.
        </p>
        <p>Sum RSV
Norm Max
Norm RSV
Z-Score</p>
        <p>SUM (αi . RSVk)</p>
        <p>SUM (αi . (RSVk / Maxi))</p>
        <p>SUM [αi . ((RSVk - Mini) / (Maxi - Mini))]
αi . [((RSVk - Meani) / Stdevi) + δi] with δi = [(Meani - Mini) / Stdevi]</p>
        <p>In addition to using these data fusion operators, we also considered the round-robin approach, wherein we
took one document in turn from each individual list and removed any duplicates, retaining only the highest
ranking occurrence. Finally we suggest merging the retrieved documents according to the Z-Score, computed
for each result list. Within this scheme, for each ith result list we needed to compute the average RSVk value
(denoted Meani) and the standard deviation (denoted Stdevi). Based on these we could then normalize the
retrieval status value for each document Dk provided by the ith result list by computing the deviation of RSVk
with respect to the mean (Meani). In Table 8, Mini (Maxi) lists the minimal (maximal) RSV value in the ith
result list. Of course, we might also weight the relative contribution of each retrieval scheme by assigning a
different αi value to each retrieval model.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Language / Query Model</title>
      </sec>
      <sec id="sec-4-3">
        <title>LM &amp; PRF doc/term</title>
        <p>Okapi &amp; PRF doc/term
DFR &amp; PRF doc/term
Official run name</p>
      </sec>
      <sec id="sec-4-4">
        <title>Round-robin</title>
        <p>Sum RSV
Norm Max
Norm RSV
Z-Score
Bulgarian TD</p>
        <p>50 queries
Roc 10/50 0.4098
Roc 3/150 0.3169
idf 5/60 0.3750</p>
        <p>UniNEbg1
0.3747 (-8.6.%)
0.3841 (-6.3%)
0.4076 (-0.5%)
0.4069 (-0.7%)
0.4128 (+0.7%)</p>
      </sec>
      <sec id="sec-4-5">
        <title>Mean average precision (% of change) Bulgarian TDN Hungarian TD 50 queries 50 queries Roc 10/50 0.4418</title>
        <p>Roc 3/150 0.3406
idf 5/60 0.4038</p>
        <p>UniNEbg4
0.4038 (-8.6%)
0.4171 (-5.6%)
0.4403 (-0.3%)
0.4404 (-0.3%)
0.4422 (+0.1%)
Roc 5/70 0.4315
idf 3/120 0.4233
idf 5/100 0.4376</p>
        <p>UniNEhu2
0.4396 (+0.5%)
0.4677 (+6.9%)
0.4738 (+8.3%)
0.4726 (+8.0%)
0.4716 (+7.8%)</p>
      </sec>
      <sec id="sec-4-6">
        <title>Czech TD</title>
        <p>50 queries
idf 5/20 0.4070
Roc 5/70 0.3672
Roc 5/50 0.4085</p>
        <p>UniNEcz3
0.4136 (+1.2%)
0.3987 (-2.4%)
0.4131 (+1.1%)
0.4139 (+1.3%)
0.4225 (+3.4%)</p>
      </sec>
      <sec id="sec-4-7">
        <title>UniNEbg1 BG</title>
      </sec>
      <sec id="sec-4-8">
        <title>UniNEbg2 BG UniNEbg3 BG</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>7 Conclusion</title>
      <p>In this eighth CLEF evaluation campaign we evaluated various probabilistic IR models using three different
test-collections written in three different East European languages, namely the Hungarian, Bulgarian and Czech
languages. We suggested a new stemmer for the Bulgarian language that removed some very frequent
derivational suffixes. For the Czech language, we designed and implemented three different stemmers.</p>
      <p>Our various experiments tend to demonstrate that the Okapi model or the IneC2 model derived from
Divergence from Randomness (DFR) paradigm tend to produce the best overall retrieval performances (see
Tables 2 to 4). The statistical language model (LM) used in our experiments usually results in retrieval
performance inferior to that obtained with the Okapi or DFR approach.</p>
      <p>For the Bulgarian language (Table 2), our new and more aggressive stemmer tends to produce a better MAP
when compared to a light stemming approach (5.8% in relative difference) and better than the 4-gram indexing
scheme (-12.9%). For the Hungarian language (Table 3), applying an automatic decompounding procedure
seems to improve the MAP around 9.4% when compared to a word-based approach, or around 7.8% when
compared to a 4-gram indexing scheme. For the Czech language however performance differences between a
light (inflectional only) and a more aggressive stemmer removing both inflectional and some derivational
suffixes were rather small (Table 4). Moreover, the performance differences were also small when compared to
those achieved with a 4-gram approach. Pseudo-relevance feedback (Rocchio’s model) improves the MAP
depending on the parameter settings (Tables 5 to 7). A data fusion strategy may clearly enhance the retrieval
performance for the Hungarian language (Table 8) and slightly for the two other languages.</p>
      <p>Acknowledgments</p>
      <p>The authors would like to also thank the CLEF-2007 task organizers for their efforts in developing various
European language test-collections. The authors would also thank Samir Abdou for his help during the
implementations of the different stemmers within the Lucene system. This research was supported in part by the
Swiss National Science Foundation under Grant #200021-113273.</p>
      <sec id="sec-5-1">
        <title>Appendix 1: Parameter Settings</title>
        <sec id="sec-5-1-1">
          <title>Language Czech Bulgarian Hungarian</title>
          <p>b</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Appendix 2: Topic Titles</title>
        <p>RemoveArticle(word) {
if (word ends with “-ът”) then remove “-ът” return;
if (word ends with “-ят”) then
if (word ends with “ V+ят”) then replace by “-й”</p>
        <p>else remove “-ят” return;
if (word ends with “-то”) then remove “-то” return;
if (word ends with “-те”) then remove “-те” return;
if (word ends with “-та”) then remove “-та” return;
return;
}
RemovePlural(word) {
if (word ends with “-ища”) then remove “-ища” return;
if (word ends with “-ище”) then remove “-ище” return;
if (word ends with “-овци”) then replace by “-о” return;
if (word ends with “-евци”) then replace by “-е” return;
# masculine
# masculine
# V –any vowel
# neutral
# neutral
# feminine
# for adjectives
# for adjectives
# for adjectives
# for adjectives
if (word ends with “-ове”) then remove “-ове” return;
if (word ends with “-еве”) then
if (word ends with “ V+ еве”) then replace by “-й”
else remove “-еве” return;
if (word ends with “-та”) then remove “-та” return;
if (word ends with “-..е.и”) then replace by “-.я.” return;
return;
}
# masculine
# masculine
# feminine
# rewriting rule
# with . any character
Normalize(word) {
if (word ends with “-еи” or “-ии”) then remove “-еи” or “-ии”;
if (word ends with “-я”) then
if (word ends with “ V+ я”) then replace by “-й”</p>
        <p>else remove “-я”;
if (word ends with “-[аой]”) then remove “-[аой]”;
if (word ends with “-[еи]”) then remove “-[еи]”;
if (word ends with “-йн”) then replace by “-н” return;
if (word ends with “-LеC”) then replace by “-LC”;
if (word ends with “-LъL”) then replace by “-LL”;
return;
}
Palatalization(word) {
if (word ends with “-ц” or “-ч”) then replace by “-к” return;
if (word ends with “-з” or “-ж”) then replace by “-г” return;
if (word ends with “-с” or “-ш”) then replace by “-х” return;
return;
}
# normalize
# adjectives
# rewriting rule
# L-any letter
# C-any consonant</p>
        <sec id="sec-5-2-1">
          <title>RemovePossessives(word) {</title>
          <p>if (word ends with “-ov”) then remove “-ov” return;
if (word ends with “-in”) then remove “-in” return;
if (word ends with “-ův”) then remove “-ův” return;
return;
}
Normalize(word) {
if (word ends with “čt”) then replace by “ck” return;
if (word ends with “št”) then replace by “sk” return;
if (word ends with “c” or “č”) then replace by “k” return;
if (word ends with “z” or “ž”) then replace by “h” return;
if (word ends with “.ů.”) then replace by “.o.” return;
return;
}
RemoveCase(word) {
if (word ends with “-atech”) then remove “-atech” return;
if (word ends with “-ětem”) then remove “-ětem” return;
if (word ends with “-etem”) then remove “-etem” return;
if (word ends with “-atům”) then remove “-atům” return;
if (word ends with “-ech”) then remove “-ech” return;
if (word ends with “-ich”) then remove “-ich” return;
if (word ends with “-ích”) then remove “-ích” return;
if (word ends with “-ého”) then remove “-ého” return;
if (word ends with “-ěmi”) then remove “-ěmi” return;
if (word ends with “-emi”) then remove “-emi” return;
if (word ends with “-ému”) then remove “-ému” return;
if (word ends with “-ěte”) then remove “-ěte” return;
if (word ends with “-ete”) then remove “-ete” return;
if (word ends with “-ěti”) then remove “-ěti” return;
if (word ends with “-eti”) then remove “-eti” return;
if (word ends with “-ího”) then remove “-ího” return;
if (word ends with “-iho”) then remove “-iho” return;
if (word ends with “-ími”) then remove “-ími” return;
if (word ends with “-ímu”) then remove “-ímu” return;
if (word ends with “-imu”) then remove “-imu” return;
if (word ends with “-ách”) then remove “-ách” return;
if (word ends with “-ata”) then remove “-ata” return;
if (word ends with “-aty”) then remove “-aty” return;
if (word ends with “-ých”) then remove “-ých” return;
if (word ends with “-ama”) then remove “-ama” return;
if (word ends with “-ami”) then remove “-ami” return;
if (word ends with “-ové”) then remove “-ové” return;
if (word ends with “-ovi”) then remove “-ovi” return;
if (word ends with “-ými”) then remove “-ými” return;
if (word ends with “-em”) then remove “-em” return;
if (word ends with “-es”) then remove “-es” return;
if (word ends with “-ém”) then remove “-ém” return;
if (word ends with “-ím”) then remove “-ím” return;
if (word ends with “-ům”) then remove “-ům” return;
if (word ends with “-at”) then remove “-at” return;
if (word ends with “-ám”) then remove “-ám” return;
if (word ends with “-os”) then remove “-os” return;
if (word ends with “-us”) then remove “-us” return;
if (word ends with “-ým”) then remove “-ým” return;
if (word ends with “-mi”) then remove “-mi” return;
if (word ends with “-ou”) then remove “-ou” return;
if (word ends with “-[aeiouyáéíýě]”) then remove “-[aeiouyáéíýě]” return;
return;
}</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Abdou S.</given-names>
            &amp;
            <surname>Savoy</surname>
          </string-name>
          <string-name>
            <surname>J.</surname>
          </string-name>
          (
          <year>2007a</year>
          ).
          <article-title>Monolingual experiments with Far-East Languages in NTCIR-6</article-title>
          .
          <source>In Proceedings NTCIR-6</source>
          , Tokyo: NII publication (National Institute of Informatics),
          <fpage>52</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Abdou S.</given-names>
            &amp;
            <surname>Savoy</surname>
          </string-name>
          <string-name>
            <surname>J.</surname>
          </string-name>
          (
          <year>2007b</year>
          ).
          <article-title>Searching in Medline: Stemming, query expansion, and manual indexing evaluation</article-title>
          .
          <source>Information Processing &amp; Management</source>
          , to appear.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Amati</surname>
            ,
            <given-names>G</given-names>
          </string-name>
          . &amp; van
          <string-name>
            <surname>Rijsbergen</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>Probabilistic models of information retrieval based on measuring the divergence from randomness</article-title>
          .
          <source>ACM Transactions on Information Systems</source>
          ,
          <volume>20</volume>
          (
          <issue>4</issue>
          ),
          <fpage>357</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singhal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>1996</year>
          ).
          <article-title>New retrieval approaches using SMART</article-title>
          .
          <source>In Proceedings of TREC-4</source>
          , Gaithersburg: NIST Publication #
          <fpage>500</fpage>
          -
          <lpage>236</lpage>
          ,
          <fpage>25</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Fox</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Shaw</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          (
          <year>1994</year>
          ).
          <article-title>Combination of multiple searches</article-title>
          .
          <source>In Proceedings TREC-2</source>
          , Gaithersburg: NIST Publication #
          <fpage>500</fpage>
          -
          <lpage>215</lpage>
          ,
          <fpage>243</fpage>
          -
          <lpage>249</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Hiemstra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>Using language models for information retrieval</article-title>
          .
          <source>CTIT Ph.D. Thesis.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Hiemstra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>Term-specific smoothing for the language modeling approach to information retrieval</article-title>
          .
          <source>In Proceedings of the ACM-SIGIR</source>
          , The ACM Press, Tempere,
          <fpage>35</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>McNamee</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Mayfield</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Character n-gram tokenization for European language text retrieval</article-title>
          .
          <source>IR Journal</source>
          ,
          <volume>7</volume>
          (
          <issue>1-2</issue>
          ),
          <fpage>73</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Peat</surname>
            ,
            <given-names>H. J.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Willett</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>1991</year>
          ).
          <article-title>The limitations of term co-occurrence data for query expansion in document retrieval systems</article-title>
          .
          <source>Journal of the American Society for Information Science</source>
          ,
          <volume>42</volume>
          (
          <issue>5</issue>
          ),
          <fpage>378</fpage>
          -
          <lpage>383</lpage>
          Robertson,
          <string-name>
            <given-names>S.E.</given-names>
            ,
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            &amp;
            <surname>Beaulieu</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>Experimentation as a way of life: Okapi at TREC</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>36</volume>
          (
          <issue>1</issue>
          ),
          <fpage>95</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>1997</year>
          ).
          <article-title>Statistical inference in retrieval effectiveness evaluation</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>33</volume>
          (
          <issue>4</issue>
          ),
          <fpage>495</fpage>
          -
          <lpage>512</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Report on CLEF-2003 monolingual tracks: Fusion of probabilistic models for effective monolingual retrieval</article-title>
          . In C. Peters,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Braschler</surname>
          </string-name>
          , M. Kluck (Eds.),
          <source>Comparative Evaluation of Multilingual Information Access Systems. LNCS #3237</source>
          . Berlin: Springer-Verlag,
          <fpage>322</fpage>
          -
          <lpage>336</lpage>
          Savoy,
          <string-name>
            <surname>J.</surname>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Comparative study of monolingual and multilingual search models for use with Asian languages</article-title>
          .
          <source>ACM Transactions on Asian Languages Information Processing</source>
          ,
          <volume>4</volume>
          (
          <issue>2</issue>
          ),
          <fpage>163</fpage>
          -
          <lpage>189</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Searching strategies for the Hungarian language</article-title>
          .
          <source>Information Processing &amp; Management</source>
          , to appear.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Savoy J.</given-names>
            &amp;
            <surname>Abdou</surname>
          </string-name>
          <string-name>
            <surname>S.</surname>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Experiments with monolingual, bilingual, and robust retrieval</article-title>
          . In C. Peters,
          <string-name>
            <given-names>F.C.</given-names>
            <surname>Gey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.J.F.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kluck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          &amp;
          <string-name>
            <surname>M. de Rijke</surname>
          </string-name>
          (Eds.).
          <source>Lecture Notes in Computer Science</source>
          . Berlin: Springer-Verlag, Berlin, to appear.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Vogt</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Cottrell</surname>
            ,
            <given-names>G.W.</given-names>
          </string-name>
          (
          <year>1999</year>
          ).
          <article-title>Fusion via a linear combination of scores</article-title>
          .
          <source>IR Journal</source>
          ,
          <volume>1</volume>
          (
          <issue>3</issue>
          ),
          <fpage>151</fpage>
          -
          <lpage>173</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>