<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Domain-Specific IR for German, English and Russian Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Claire Fautsch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ljiljana Dolamic</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samir Abdou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jacques Savoy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claire.Fautsch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ljiljana.Dolamic</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samir.Abdou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jacques.Savoy}@unine.ch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Experimentation</institution>
          ,
          <addr-line>Performance, Measurement, Algorithms</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Russian Language</institution>
          ,
          <addr-line>Thesaurus</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Neuchatel</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In participating in this CLEF evaluation campaign, our first objective is to propose and evaluate various indexing and search strategies for the Russian language, in order to obtain better retrieval effectiveness than that provided by the language-independent approach (n-gram). Our second objective is to more effectively measure the relative merit of various search engines when used for the German and to a lesser extent the English language. To do so we evaluate the GIRT-4 test-collection using the Okapi, various IR models derived from the Divergence from Randomness (DFR) paradigm, the statistical language model (LM) together with the classical tf.idf vector-processing scheme. We also evaluated different pseudo-relevance feedback approaches. For the Russian language, we find that word-based indexing with our light stemming procedure results in better retrieval effectiveness than does 4-gram indexing strategy (relative difference around 30%). Using the GIRT corpora (available in German and English), we examine certain variations in retrieval effectiveness that result from applying the specialized thesaurus to automatically enlarge topic descriptions. In this case, the performance variations were relatively small and usually non significant.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        In our domain-specific retrieval task we access the GIRT (German Indexing and Retrieval Test database)
corpus, composed of bibliographic records. These are mainly extracted from two social science sources: SOLIS
(social science literature) and FORIS1 (current research in social science fields), covering Europe's German
speaking countries (Germany, Austria, and Switzerland). This collection has grown from 13,000 documents in
1996 to more than 150,000 in 2005, and we are making a continuous effort to enhance the number of documents
available, see
        <xref ref-type="bibr" rid="ref6">Kluck (2004)</xref>
        for a more complete description of this corpus.
      </p>
      <p>The fact that scientific documents may contain manually assigned keywords is of particular interest to us in our
work. They are usually extracted from a controlled vocabulary by librarians who are knowledgeable of the
domain to which the indexed articles belong. These descriptors should be helpful in improving document
surrogates and thus the extraction of more pertinent information, and at the same time discarding irrelevant
abstracts. Access to the underlying thesaurus would also improve the retrieval performance.
The rest of this paper is organized as follows: Section 2 describes the main characteristics of the GIRT-4 and
ISISS test-collections. Section 3 outlines the main aspects of our stopword lists and light stemming procedures.
Section 4 analyses the principal features of various indexing and search strategies, and evaluates their use with the
available corpora. Section 5 presents our official runs and results.</p>
      <p>&lt;DOC&gt;
&lt;DOCNO&gt; GIRT-DE19908362
&lt;TITLE-DE&gt; Auswirkungen der Informationstechnologien auf die zukünftigen Beschäftigungs- und
Ausbildungsperspektiven in der EG
&lt;AUTHOR&gt; Riedel, Monika
&lt;AUTHOR&gt; Wagner, Michael
&lt;PUBLICATION-YEAR&gt; 1990
&lt;LANGUAGE-CODE&gt; DE
&lt;CONTROLLED-TERM-DE&gt; EG
&lt;CONTROLLED-TERM-DE&gt; Informationstechnologie
&lt;CONTROLLED-TERM-DE&gt; Beschäftigungsentwicklung
&lt;CONTROLLED-TERM-DE&gt; Berufsbildung
&lt;CONTROLLED-TERM-DE&gt; Qualifikationsanforderungen
&lt;METHOD-TERM-DE&gt; beschreibend
&lt;METHOD-TERM-DE&gt; Aktenanalyse
&lt;METHOD-TERM-DE&gt; Interpretation
&lt;CLASSIFICATION-TEXT-DE&gt; Arbeitsmarkt- und Berufsforschung
&lt;CLASSIFICATION-TEXT-DE&gt; Arbeitsmarktforschung
&lt;CLASSIFICATION-TEXT-DE&gt; Berufsforschung, Berufssoziologie
&lt;CLASSIFICATION-TEXT-DE &gt; Bildungswesen quartärer Bereich
&lt;ABSTRACT-DE&gt; Veränderungen der Qualifikationsanforderungen an Beschäftigte im IT-Sektor und
Zukunftsprojektionen. …</p>
    </sec>
    <sec id="sec-2">
      <title>2 Overview of Test-Collections</title>
      <p>
        In the domain-specific retrieval task (called GIRT), the two available corpora are composed of bibliographic
records extracted from various sources in the social sciences domain. Typical records (see Figure 1 for a German
example) in this corpus consist of a title (tag &lt;TITLE-DE&gt;), author name (tag &lt;AUTHOR&gt;), document language
(tag &lt;TITLE-DE&gt;), publication date (tag &lt;PUBLICATION-YEAR&gt;) and abstract (tag &lt;ABSTRACT-DE&gt;).
Manually assigned descriptors and classifiers are provided for all documents. An inspection of this German
corpus reveals that all bibliographic notices have a title, and that 96.4% of them have an abstract. In addition to
this information provided by the author, a typical record contains on average 10.15 descriptors
(“&lt;CONTROLLED-TERM-DE&gt;”), 2.02 classification terms (“&lt;CLASSIFICATION-TEXT-DE&gt;”), and 2.42
methodological terms (“&lt;METHOD-TEXT-DE&gt;“ or “&lt;METHOD-TERM-DE&gt;“). The manually assigned
descriptors are extracted from the controlled list which is the “Thesaurus for the Social Sciences” (or GIRT
Thesaurus). Finally, associated with each record is a unique identifier (“&lt;DOCNO&gt;”).
        <xref ref-type="bibr" rid="ref6">Kluck (2004)</xref>
        provides a
more complete description of this corpus.
      </p>
      <p>The above-mentioned German collection was translated into British English, mainly by professional
translators who are native English speakers. Included in all English records is a translated title (listed under
“&lt;TITLE-EN&gt;” in Figure 2), manually assigned descriptors (“&lt;CONTROLLED-TERM-EN&gt;”), classification terms
(“&lt;CLASSIFICATION-TEXT-EN&gt;”) and methodological terms (“&lt;METHOD-TEXT-EN&gt;”). Abstracts however
were not always translated (in fact they are available for only around 15% of the English records).</p>
      <p>In addition to this bilingual corpus, we also have access to the GIRT thesaurus. Figure 3 shows some examples
of four typical entries in this thesaurus. Each main entry includes the tag &lt;GERMAN&gt; followed by the descriptor
written in the German language. Its corresponding uppercase form without diacritics or “ß” appears under the tag
&lt;GERMAN-CAPS&gt;. The British English translation follows the label &lt;ENGLISH-TRANSLATION&gt;. The
hierarchical relationships between the different descriptors are shown under the labels &lt;BROADER-TERM&gt; (a
term having a broader semantic coverage) and &lt;NARROWER-TERM&gt; (a more specific term). The relationship
&lt;RELATED-TERM&gt; is used to provide additional pertinent descriptors (similar to the relationship “see also …”
often found in many controlled vocabularies). The tag &lt;USE-INSTEAD&gt; is used to redirected readers to another
entry (usually a synonym of an existing entry or to indicate that an acronym exists). The tag
&lt;USE-COMBINATION&gt; is sometimes used to indicate a possible decompounded or simplified term variant, or
more generally a similar term. Usually however, the &lt;USE-COMBINATION&gt; is used like &lt;USE-INSTEAD&gt; to
refer from a non-descriptor to a descriptor but having usually more than one descriptor that should be used in
combination.</p>
      <p>In the GIRT thesaurus are found 10,623 entries (all with both the tag &lt;GERMAN&gt; and &lt;GERMAN-CAPS&gt;)
together with 9,705 English translations. Also found are 2,947 &lt;BROADER-TERM&gt; relationships and 2,853
&lt;NARROWER-TERM&gt; links. The synonym relationship between terms can be expressed through the
&lt;USE-INSTEAD&gt; (2,153 links), &lt;RELATED-TERM&gt; (1,528) or &lt;USE-COMBINATION&gt; (3,263).
&lt;ENTRY&gt;
&lt;GERMAN&gt; Raumwahrnehmung
&lt;GERMAN-CAPS&gt; RAUMWAHRNEHMUNG
&lt;BROADER-TERM&gt; Wahrnehmung
&lt;RELATED-TERM&gt; Perspektive
&lt;ENGLISH-TRANSLATION&gt; spatial orientation
&lt;/ENTRY&gt;
&lt;ENTRY&gt;
&lt;GERMAN&gt; Volksabstimmung
&lt;GERMAN-CAPS&gt; VOLKSABSTIMMUNG
&lt;BROADER-TERM&gt; direkte Demokratie
&lt;NARROWER-TERM&gt; Volksbegehren
&lt;NARROWER-TERM&gt; Volksentscheid
&lt;ENGLISH-TRANSLATION&gt; plebiscite
&lt;/ENTRY&gt;
…
&lt;ENTRY&gt;
&lt;GERMAN&gt; Volksstamm
&lt;GERMAN-CAPS&gt; VOLKSSTAMM
&lt;USE-INSTEAD&gt; ethnische Gruppe
&lt;ENGLISH-TRANSLATION&gt; tribe
&lt;/ENTRY&gt;
&lt;ENTRY&gt;
&lt;GERMAN&gt; Wachstumspolitik
&lt;GERMAN-CAPS&gt; WACHSTUMSPOLITIK
&lt;USE-COMBINATION&gt; Wirtschaftspolitik
&lt;USE-COMBINATION&gt; Wirtschaftswachstum
&lt;ENGLISH-TRANSLATION&gt; policy of economic</p>
      <p>growth
&lt;/ENTRY&gt;</p>
      <p>During the indexing process, we retained all pertinent sections in order to build document representatives.
Additional information such as author name, publication date and the language in which the bibliographic notice
was written are of less importance, particularly from an IR perspective, and in our experiments they will be
ignored.</p>
      <p>As shown in Appendix 2, the available topics cover various subjects (e.g., Topic #176: “Sibling relations,”
Topic #178: “German-French relations after 1945,” Topic #196: “Tourism industry in Germany,” or Topic #199:
“European climate policy”), and some of them may cover a relative large domain (e.g. Topic #187: “Migration
pressure”).</p>
    </sec>
    <sec id="sec-3">
      <title>3 Stopword Lists and Stemming Procedures</title>
      <p>
        During this evaluation campaign, we used the same stopword lists and stemmers that we selected for our
previous English and German language CLEF participation
        <xref ref-type="bibr" rid="ref10 ref9">(Savoy, 2004a)</xref>
        . Thus for English it was the SMART
stemmer and stopword list (containing 571 items), while for the German we applied our light stemmer (available at
http://www.unine.ch/info/clef/) and stopword list (603 words). For all our German experiments we applied our
decompounding algorithm
        <xref ref-type="bibr" rid="ref10 ref9">(Savoy, 2004b)</xref>
        .
      </p>
      <p>For the Russian language, we designed and implemented a new light stemmer that removes only inflectional
suffixes attached to nouns or adjectives. This stemmer applies 53 rules to remove the final suffix representing
gender (masculine, feminine, and neutral), number (singular, plural) and the six Russian grammatical cases
(nominative, accusative, genitive, dative, instrumental, and locative). The stemmer also applied three
normalization rules in order to correct certain variations that occur when a particular suffix is attached to a noun or
adjective. See Appendix 3 for a list all this new stemmer's rules.</p>
    </sec>
    <sec id="sec-4">
      <title>4 IR Models and Evaluation</title>
      <sec id="sec-4-1">
        <title>4.1. Indexing and Search Strategies</title>
        <p>In order to obtain a broader view of the relative merit of various retrieval models, we may first adopt the
classical tf idf indexing scheme. In this case, the weight attached to each indexing term in a document surrogate
(or in a query) is composed by the term occurrence frequency (denoted tfij for indexing term tj in document Di) and
the inverse document frequency (denoted idfj).</p>
        <p>
          In addition to this vector-processing model, we may also consider probabilistic models such as the Okapi
model (or BM25)
          <xref ref-type="bibr" rid="ref8">(Robertson et al., 2000)</xref>
          . As a second probabilistic approach, we may implement four variants
of the DFR (Divergence from Randomness) family suggested by
          <xref ref-type="bibr" rid="ref2">Amati &amp; van Rijsbergen (2002</xref>
          ). In this
framework, the indexing weight wij attached to term tj in document Di combines two information measures as
follows:
        </p>
        <p>wij = Inf1ij · Inf2ij = –log2[Prob1 ij(tf)] · (1 – Prob2ij(tf))
The first model called GL2 was based on the following equations:</p>
        <p>Prob2ij = tfnij / (tfnij + 1)</p>
        <p>with tfnij = tfij · log2[1 + ((c · mean dl) / li)]
Prob1ij = [1 / (1+λj)] · [λj / (1+λj)]tfnij
with λj = tcj / n
(1)
(2)
(3)
(4)
(5)
Prob1ij = (e-λj · λtfij) / tfij!
Inf1ij = tfnij · log2[(n+1) / (dfj+0.5)]
Prob2ij = 1- [(tcj+1) / (df j · (tfij+1))]
where tcj represents the number of occurrences of term tj in the collection, dfj the number of documents in which
the term tj appears, and n the number of documents in the corpus. In our experiments, we fixed the constants
values according to the values given in the Appendix 1.</p>
        <p>For the second model called PL2, Prob2ij was obtained from Equation 1, and Prob1ij was modified as:
For the third model called I(n)L2, we still used Equation 1 to compute Prob2ij but the implementation of Inf1ij
was modified as:</p>
        <p>For the fourth model called PB2, the implementation of Prob1ij was obtained by Equation 3, and for evaluating
Prob2ij we used:</p>
        <p>For the fifth model called I(n)B2, the implementation of Inf1ij was obtained from Equation 4 while Prob2ij was
provided by Equation 5.</p>
        <p>
          Finally, we also considered an approach based on a statistical language model (LM)
          <xref ref-type="bibr" rid="ref4">(Hiemstra 2000; 2002)</xref>
          ,
known as a non-parametric probabilistic model (both Okapi and DFR are viewed as parametric models). Thus
probability estimates would not be based on any known distribution (as in Equations 2, or 3), but rather be
estimated directly based on occurrence frequencies of document D or corpus C. Within this language model (LM)
paradigm, various implementations and smoothing methods might be considered, and in this study we adopted a
model proposed by
          <xref ref-type="bibr" rid="ref5">Hiemstra (2002)</xref>
          as described in Equation 6, which combines an estimate based on document
(P[tj | Di]) and on corpus (P[tj | C]).
        </p>
        <p>P[Di | Q] = P[Di] . ∏tj∈Q [λj . P[tj | Di] + (1-λj) . P[tj | C]]
with P[tj | Di] = tfij/li and P[tj | C] = dfj/lc
with lc = ∑k dfk
(6)
where λj is a smoothing factor (constant for all indexing terms tj, and usually fixed at 0.35) and lc an estimate of the
size of the corpus C.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Overall Evaluation</title>
        <p>To measure the retrieval performance, we adopted the mean average precision (MAP) (computed on the basis
of 1,000 retrieved items per request by the new TREC-EVAL program). In the following tables, the best
performance under the given conditions (with the same indexing scheme and the same collection) is listed in bold
type. For the English corpus, our evaluation measures are lower than expected due to the fact that our IR system
does not take account for the CSA collection.</p>
        <p>From this table, we can see that the best performing model when using word-based indexing strategy tends to
be the DFR I(n)B2 or the DFR GL2 model. With the 4-gram indexing approach, we may also include the LM
model in the set of the best performing schemes. The improvement over the medium query formulation (TD) is
greater than 25%, a clear and important enhancement. As shown in the last line, when comparing word-based and
4-gram indexing system, we can see that the relative difference is rather large (around 30%) and favors the
word-based approach.</p>
        <sec id="sec-4-2-1">
          <title>Query</title>
          <p>Model \ # of queries
DFR PB2
DFR PL2
DFR GL2
DFR I(n)B2
DFR I(n)L2
LM (λ=0.35)
Okapi
tf idf
Mean (top-7 best models)
% change over TD queries
Using our evaluation approach, evaluation differences occur when comparing with values computed according
to the official measure (the latter always takes 25 queries into account).</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Blind-Query Expansion</title>
        <sec id="sec-4-3-1">
          <title>Query TD Rocchio’ model IR Model / MAP k doc. / m terms</title>
          <p>Russian</p>
          <p>TD
word / light
22 queries</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>Russian</title>
          <p>TDN
4-gram / none
22 queries
0.1498
0.1433
0.1672
0.1277
0.1229
0.1470
+31.3%
-28.7%</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>Mean average precision</title>
          <p>
            English
25 queries
DFR PL2 0.2978
10/50 0.3426
10/100 0.3390
10/200 0.3237
In an effort to improve search performance we examined pseudo-relevance feedback using Rocchio’s
formulation (denoted “Roc”)
            <xref ref-type="bibr" rid="ref3">(Buckley et al., 1996)</xref>
            with α = 0.75, β = 0.75, whereby the system was allowed to
add m terms extracted from the k best ranked documents from the original query. For the German corpus (Table 4),
enhancement increased from +9.8% (Okapi, 0.2616 vs. 0.2872) to +21.8% (LM model, 0.2526 vs. 0.3076). For
the English collection (Table 5), Rocchio’s blind query expansion improves the MAP from +3.6% (Okapi, 0.2549
vs. 0.2640) to +18.2% (LM model, 0.2603 vs. 0.3077). For the Russian language (Table 6), blind query expansion
may hurt the MAP (e.g., -21.3% with the DFR InB2 model, 0.1775 vs. 0.1397) or improve the retrieval
effectiveness (e.g., +8.9% with the LM model, 0.1511 vs. 0.1645). As another pseudo-relevance feedback
technique we applied our idf-based approach (denoted “idf” in Table 8)
            <xref ref-type="bibr" rid="ref1">(Abdou &amp; Savoy, 2007)</xref>
            .
          </p>
        </sec>
        <sec id="sec-4-3-4">
          <title>Query TD PRF Rocchio’s model IR Model / MAP k doc. / m terms</title>
        </sec>
        <sec id="sec-4-3-5">
          <title>Russian</title>
          <p>22 queries
Okapi 0.1630
5/50 0.1709
10/20 0.1712
10/60 0.1709</p>
        </sec>
        <sec id="sec-4-3-6">
          <title>Mean average precision</title>
          <p>Russian
22 queries
DFR InB2 0.1775
5/50 0.1397
10/20 0.1462
10/60 0.1477</p>
        </sec>
        <sec id="sec-4-3-7">
          <title>Russian</title>
          <p>22 queries
LM 0.1511
5/50 0.1515
10/20 0.1614
10/10 0.1645</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Query Expansion Using a Specialized Thesaurus</title>
        <p>The GIRT collection has certain interesting aspects from an IR perspective. Each record has manually
assigned descriptors (see examples given in Figures 1 and 2) in order to provide more information on the semantic
contents of each bibliographic record. Additionally, descriptors from the specialized thesaurus are accessed (see
entry examples depicted in Figure 3).</p>
        <sec id="sec-4-4-1">
          <title>Mean average precision</title>
          <p>German German</p>
          <p>TD TD
25 queries 25 queries
without thesaurus with thesaurus
Query
Model \ # of queries
DFR PB2
DFR PL2
DFR GL2
DFR I(n)B2
DFR I(n)L2
LM (λ=0.35)
Okapi
tf idf
Mean (top-7 models)
% change</p>
          <p>In an effort to improve the mean average precision, we used the GIRT thesaurus to automatically enlarge the
query. To achieve this, we considered each entry in the thesaurus as a document and then indexed it. We then took
each query in turn and used it to retrieve the thesaurus entries. Since the number of retrieved thesaurus entries was
relatively small, we simply added all these thesaurus entries to the query, forming a new and enlarged one.
Although certain terms occurring in the original query were repeated, in other cases this procedure added related
terms. If for example the topic included the country name “Deutschland”, our thesaurus-based query expansion
0.2558
-3.1%</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>UniNEde2 German</title>
        </sec>
        <sec id="sec-4-4-3">
          <title>Run name</title>
        </sec>
        <sec id="sec-4-4-4">
          <title>UniNEde1</title>
        </sec>
        <sec id="sec-4-4-5">
          <title>UniNEde3</title>
          <p>thesaurus</p>
        </sec>
        <sec id="sec-4-4-6">
          <title>UniNEde4</title>
        </sec>
        <sec id="sec-4-4-7">
          <title>UniNEen2</title>
        </sec>
        <sec id="sec-4-4-8">
          <title>UniNEen3</title>
          <p>German</p>
        </sec>
        <sec id="sec-4-4-9">
          <title>German</title>
        </sec>
        <sec id="sec-4-4-10">
          <title>German</title>
        </sec>
        <sec id="sec-4-4-11">
          <title>English</title>
        </sec>
        <sec id="sec-4-4-12">
          <title>English</title>
        </sec>
        <sec id="sec-4-4-13">
          <title>UniNEen1 English</title>
        </sec>
        <sec id="sec-4-4-14">
          <title>UniNEen4 English</title>
        </sec>
        <sec id="sec-4-4-15">
          <title>UniNEru1 Russian</title>
        </sec>
        <sec id="sec-4-4-16">
          <title>UniNEru2 Russian</title>
        </sec>
        <sec id="sec-4-4-17">
          <title>UniNEru3 Russian</title>
        </sec>
        <sec id="sec-4-4-18">
          <title>UniNEru4 Russian TD TD</title>
          <p>TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TD
TDN
TDN
TDN
TDN
TDN
TDN
TDN
TDN
TDN
dec
dec
dec
dec
dec
dec
dec
dec
dec
dec
dec
dec
dec
dec
word
word
word
word
word
word
word
word
word
word
word
wd/light
wd/light
wd/light
wd/light
4-gram
wd/light
wd/light
4-gram
wd/light
4-gram
wd/light
wd/light
4-gram
PL2
PL2
InB2
PB2
PB2
InL2
LM
PL2
InB2
PL2
InB2
PL2
InB2
PB2
GL2
PB2
InB2
LM2
Okapi
PL2
PL2
InB2
GL2
PB2
InB2</p>
        </sec>
        <sec id="sec-4-4-19">
          <title>Okapi</title>
          <p>GL2
LM
GL2
GL2
GL2</p>
        </sec>
        <sec id="sec-4-4-20">
          <title>Okapi</title>
          <p>LM
LM
GL2
GL2
LM
GL2</p>
        </sec>
        <sec id="sec-4-4-21">
          <title>Query expansion</title>
        </sec>
        <sec id="sec-4-4-22">
          <title>Roc 10 docs / 120 terms</title>
          <p>Roc 10 docs / 120 terms
idf 10 docs / 150 terms
idf 10 docs / 150 terms
Roc 10 docs / 100 terms
Roc 10 docs / 230 terms</p>
        </sec>
        <sec id="sec-4-4-23">
          <title>Roc 10 docs / 120 terms</title>
        </sec>
        <sec id="sec-4-4-24">
          <title>Roc 10 docs / 120 terms</title>
          <p>idf 10 docs / 230 terms</p>
        </sec>
        <sec id="sec-4-4-25">
          <title>Roc 10 docs / 120 terms</title>
          <p>idf 10 docs / 150 terms</p>
        </sec>
        <sec id="sec-4-4-26">
          <title>Roc 10 docs / 100 terms Roc 10 docs / 150 terms</title>
        </sec>
        <sec id="sec-4-4-27">
          <title>Roc 10 docs / 150 terms</title>
          <p>idf 10 docs / 150 terms</p>
        </sec>
        <sec id="sec-4-4-28">
          <title>Roc 10 docs / 50 terms</title>
          <p>idf 10 docs / 150 terms</p>
        </sec>
        <sec id="sec-4-4-29">
          <title>Roc 10 docs / 100 terms</title>
          <p>Roc 10 docs / 150 terms
idf 10 docs / 20 terms
Roc 10 docs / 20 terms
idf 10 docs / 60 terms
idf 5 docs / 50 terms
Roc 5 docs / 50 terms
Roc 5 docs / 50 terms
idf 5 docs / 50 terms
Roc 5 docs / 50 terms
Roc 10 docs / 60 terms
Roc 5 docs / 50 terms</p>
        </sec>
        <sec id="sec-4-4-30">
          <title>Roc 5 docs / 50 terms</title>
          <p>idf 5 docs / 50 terms
Roc 10 docs / 60 terms
procedure might add the related term “BDR” and “Bundesrepublik”. Thus, these two terms would usually be
helpful to retrieve more pertinent articles.</p>
          <p>Using the TD query formulation, MAP differences were relatively small (around -3.1%, in average). We
believe that one possible explanation for this relatively small difference was that a query might be expanded with
frequently used terms that would not be really effective in discriminating between the relevant and irrelevant
items.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5 Official Results</title>
      <p>GIRT-4 collection and thus we have ignored the CSA corpus. The MAP values achieved for this language are
therefore clearly below the expected performance. Finally for the Russian collection, Table 8 depicts the MAP
achieved when considering 22 queries and in parenthesis, the official MAP computed with 25 queries.</p>
    </sec>
    <sec id="sec-6">
      <title>6 Conclusion</title>
      <p>For our participation in this domain-specific evaluation campaign, we propose a new light stemmer for the
Russian language. The resulting MAP (see Table 3) shows that for this Slavic language our approach may
produce better MAP than a 4-gram approach (relative difference around 30%). For the German corpus, we try to
exploit the specialized thesaurus in order to improve the resulting MAP. The retrieval effectiveness difference is
rather small and we still need to analyze the reasons for obtaining so little difference (see Table 7). We believe that
a more specific query enrichment procedure is needed, one able to take the various different term-term
relationships into account, along with the occurrence frequencies of the potential new search terms.</p>
      <p>When comparing the various IR models (see Table 2), we found that the I(n)B2 model derived from the
Divergence from Randomness (DFR) paradigm tends usually to result in the best performance. When analyzing
blind query expansion approaches (see Tables 4 to 6), we find that this type of automatic query expansion can
enhance MAP but there is clearly larger improvement when using the LM model. Finally for the Russian corpus,
this search strategy produces less improvement than for the English or German collections.</p>
      <p>Acknowledgments</p>
      <p>The authors would like to also thank the GIRT - CLEF-2007 task organizers for their efforts in developing
domain-specific test-collections. This research was supported in part by the Swiss National Science Foundation
under Grant #200021-113273.</p>
      <sec id="sec-6-1">
        <title>Appendix 1: Parameter Settings</title>
        <sec id="sec-6-1-1">
          <title>Language</title>
        </sec>
        <sec id="sec-6-1-2">
          <title>German GIRT</title>
        </sec>
        <sec id="sec-6-1-3">
          <title>English GIRT</title>
        </sec>
        <sec id="sec-6-1-4">
          <title>Russian ISISS</title>
          <p>b</p>
        </sec>
        <sec id="sec-6-1-5">
          <title>Normalize(word) {</title>
          <p>if (word ends with “-ь”) then remove “-ь” return;
if (word ends with “-и”) then remove “-и”return;
if (word ends with “-нн”) then replace by “-н” return;
return;
RemoveCase (word) {
if (word ends with “-иями”) then remove “-иями” return;
if (word ends with “-оями”) then remove “-оями” return;
if (word ends with “-оиев”) then remove “-оиев” return;
if (word ends with “-иях”) then remove “-иях” return;
if (word ends with “-иям”) then remove “-иям” return;
if (word ends with “-ями”) then remove “-ями” return;
if (word ends with “-оям”) then remove “-оям” return;
if (word ends with “-оях”) then remove “-оях” return;
if (word ends with “-ами”) then remove “-ами” return;
if (word ends with “-его”) then remove “-его” return;
if (word ends with “-ему”) then remove “-ему” return;
if (word ends with “-ери”) then remove “-ери” return;
if (word ends with “-ими”) then remove “-ими” return;
if (word ends with “-иев”) then remove “-иев” return;
if (word ends with “-ого”) then remove “-ого” return;
if (word ends with “-ому”) then remove “-ому” return;
if (word ends with “-ыми”) then remove “-ыми” return;
if (word ends with “-оев”) then remove “-оев” return;
if (word ends with “-яя”) then remove “-яя” return;
if (word ends with “-ях”) then remove “-ях” return;
if (word ends with “-юю”) then remove “-юю” return;
if (word ends with “-ая”) then remove “-ая” return;
if (word ends with “-ах”) then remove “-ах” return;
if (word ends with “-ею”) then remove “-ею” return;
if (word ends with “-их”) then remove “-их” return;
if (word ends with “-ия”) then remove “-ия” return;
if (word ends with “-ию”) then remove “-ию” return;
if (word ends with “-ие”) then remove “-ие” return;
if (word ends with “-ий”) then remove “-ий” return;
if (word ends with “-им”) then remove “-им” return;
if (word ends with “-ое”) then remove “-ое” return;
if (word ends with “-ом”) then remove “-ом” return;
if (word ends with “-ой”) then remove “-ой” return;
if (word ends with “-ов”) then remove “-ов” return;
if (word ends with “-ые”) then remove “-ые” return;
if (word ends with “-ый”) then remove “-ый” return;
if (word ends with “-ым”) then remove “-ым” return;
if (word ends with “-ми”) then remove “-ми” return;
if (word ends with “-ою”) then remove “-ою” return;
if (word ends with “-ую”) then remove “-ую” return;
if (word ends with “-ям”) then remove “-ям” return;
if (word ends with “-ых”) then remove “-ых” return;
if (word ends with “-ея”) then remove “-ея” return;
if (word ends with “-ам”) then remove “-ам” return;
if (word ends with “-ее”) then remove “-ее” return;
if (word ends with “-ей”) then remove “-ей” return;
if (word ends with “-ем”) then remove “-ем” return;
if (word ends with “-ев”) then remove “-ев” return;
if (word ends with “-я”) then remove “-я” return;
if (word ends with “-ю”) then remove “-ю” return;
if(word ends with “-й”) then remove “-й” return;
if (word ends with “-ы”) then remove “-ы” return;
if (word ends with “-[аеиоу]”) then remove “-[аеиоу]” return;
}</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Abdou</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Savoy</surname>
            <given-names>J.</given-names>
          </string-name>
          , (
          <year>2007</year>
          ).
          <article-title>Searching in Medline: Stemming, query expansion, and manual indexing evaluation</article-title>
          .
          <source>Information Processing &amp; Management</source>
          , to appear.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Amati</surname>
            ,
            <given-names>G</given-names>
          </string-name>
          . &amp; van
          <string-name>
            <surname>Rijsbergen</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>Probabilistic models of information retrieval based on measuring the divergence from randomness</article-title>
          .
          <source>ACM Transactions on Information Systems</source>
          ,
          <volume>20</volume>
          (
          <issue>4</issue>
          ),
          <fpage>357</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singhal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>1996</year>
          ).
          <article-title>New retrieval approaches using SMART</article-title>
          .
          <source>In Proceedings of TREC-4</source>
          , Gaithersburg: NIST Publication #
          <fpage>500</fpage>
          -
          <lpage>236</lpage>
          ,
          <fpage>25</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Hiemstra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>Using language models for information retrieval</article-title>
          .
          <source>CTIT Ph.D. Thesis.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Hiemstra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>Term-specific smoothing for the language modeling approach to information retrieval</article-title>
          .
          <source>In Proceedings of the ACM-SIGIR</source>
          , The ACM Press, Tempere,
          <fpage>35</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Kluck</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>The GIRT data in the evaluation of CLIR systems - from 1997 until 2003</article-title>
          . In C. Peters,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Braschler</surname>
          </string-name>
          , M. Kluck (Eds.),
          <source>Comparative Evaluation of Multilingual Information Access Systems. LNCS #3237</source>
          . Springer-Verlag, Berlin,
          <year>2004</year>
          ,
          <fpage>376</fpage>
          -
          <lpage>390</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>McNamee</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Mayfield</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Character n-gram tokenization for European language text retrieval</article-title>
          .
          <source>IR Journal</source>
          ,
          <volume>7</volume>
          (
          <issue>1-2</issue>
          ),
          <fpage>73</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Walker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Beaulieu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>Experimentation as a way of life: Okapi at TREC</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>36</volume>
          (
          <issue>1</issue>
          ),
          <fpage>95</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2004a</year>
          ).
          <article-title>Combining multiple strategies for effective monolingual and cross-lingual retrieval</article-title>
          .
          <source>IR Journal</source>
          ,
          <volume>7</volume>
          (
          <issue>1-2</issue>
          ),
          <fpage>121</fpage>
          -
          <lpage>148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2004b</year>
          ).
          <article-title>Report on CLEF-2003 monolingual tracks: Fusion of probabilistic models for effective monolingual retrieval</article-title>
          . In C. Peters,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Braschler</surname>
          </string-name>
          , M. Kluck (Eds.),
          <source>Comparative Evaluation of Multilingual Information Access Systems. LNCS #3237</source>
          . Springer-Verlag, Berlin,
          <year>2004</year>
          ,
          <fpage>322</fpage>
          -
          <lpage>336</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Berger</surname>
          </string-name>
          , P.-Y. (
          <year>2006</year>
          ).
          <article-title>Monolingual, Bilingual and GIRT Information Retrieval at CLEF 2005</article-title>
          . In C. Peters,
          <string-name>
            <given-names>P.</given-names>
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.J.F.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kluck</surname>
          </string-name>
          &amp; B.
          <string-name>
            <surname>Magnini</surname>
          </string-name>
          (Eds.),
          <source>Multilingual Information Access for Text, Speech and Images</source>
          . Springer-Verlag, Berlin,
          <year>2006</year>
          , to appear.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>