<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UniNE at CLEF 2008: TEL, Persian and Robust IR</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ljiljana Dolamic</string-name>
          <email>Ljiljana.Dolamic@unine.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claire Fautsch</string-name>
          <email>Claire.Fautsch@unine.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jacques Savoy</string-name>
          <email>Jacques.Savoy@unine.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Experimentation</institution>
          ,
          <addr-line>Performance, Measurement, Algorithms</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Natural Language Processing</institution>
          ,
          <addr-line>Stemmer, Digital Libraries, Persian Language (Farsi), Robust Retrieval</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Neuchatel</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2008</year>
      </pub-date>
      <abstract>
        <p>In participating in this evaluation campaign, our first objective is to analyze the retrieval effectiveness when using TEL (The European Library) corpora composed of very short descriptions (library catalogue records) and to evaluate the retrieval effectiveness of several IR models. As a second objective we want to design and evaluate a stopword list and a light stemming strategy for the Persian language, a language belonging to the Indo-European family and having a relatively simple morphology. Finally, we participated in the robust track in an attempt to understand the difficulty involved in retrieving pertinent documents, even when the query and document representations share many common terms. Moreover, we made use of word sense disambiguation (WSD) information to order to reduce problems related to polysemy when matching topic and document representation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>During the last few years, the IR group at University of Neuchatel has been involved in designing,
implementing and evaluating IR systems for various natural languages, including both European and popular
Asian languages (namely, Chinese, Japanese, and Korean). Our main objective in this context is to promote
effective monolingual IR in those languages.</p>
      <p>The rest of this paper is organized as follows: Section 2 describes the main characteristics of the TEL corpus
used in the CLEF-2008 ad hoc track. Section 3 outlines the main aspects of different IR models used with TEL
collections together with the evaluation of our official runs and certain related experiments. Section 4 presents
the principal features of the Persian (Farsi) language, presents the stopword list and stemming strategy we
developed for this language and describes our official runs and results for this task. Our participation and results
concerning the robust task are outlined in Section 5, and Section 6 presents our main conclusions.
challenge was to retrieve pertinent records composed of a very short description of the referred information item.
The only information contained in many records consists of only a title and author, and manually assigned
subject headings.</p>
      <p>Typical documents are shown in the tables below. Table 1a (British Library), Table 1b (Austrian National
Library), and Table 1c (Bibliothèque nationale de France) shown the descriptions that appear in different
languages. Table 1a shows a record with a title (tag &lt;dc:title&gt;) in German from a BL record and the subject in
English (tag &lt;dc:subject&gt;). Table 1c illustrates another example with the title (tag &lt;dc:title&gt;) and a part of the
description (tag &lt;dc:description&gt;) written in Latin.</p>
      <p>&lt;dc:identifier &gt; &lt;dc:identifier
xsi:type="dcterms:URI"&gt;http://catalogue.bl.uk/F/-?func=direct-docset&amp;amp;amp;l_base=BLL01&amp;amp;from=TELgateway&amp;amp;doc_number=010624878&lt;/dc:identifier&gt;
&lt;mods:location&gt; British Library HMNTS YA.1992.b.771 &lt;/mods:location&gt;
&lt;/oai_dc:dc&gt; &lt;/document&gt; &lt;/record&gt;
&lt;record&gt; &lt;set&gt; TEL_BnF_opac &lt;/set&gt;
&lt;id&gt;oai:bnf.fr:catalogue/ark:/12148/cb30000394c/description&lt;/id&gt;
&lt;document format="index"&gt; &lt;index&gt; &lt;topic&gt;BnF_opac&lt;/topic&gt; &lt;/index&gt; &lt;/document&gt;
&lt;document format="dcx"&gt; &lt;oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"&gt;
&lt;dc:identifier&gt;http://catalogue.bnf.fr/ark:/12148/cb30000394c/description&lt;/dc:identifier&gt;
&lt;dc:title&gt; Codex canonum vetus ecclesiae romanae a Francisco Pithoeo restitutus..&lt;/dc:title&gt;
&lt;dc:date&gt; 1687 &lt;/dc:date&gt;
&lt;dc:description&gt; Comprend : Apologeticus et epistolae &lt;/dc:description&gt;
&lt;dc:language&gt; lat &lt;/dc:language&gt;
&lt;dc:type xml:lang="fre"&gt; texte imprimé &lt;/dc:type&gt;
&lt;dc:type xml:lang="eng"&gt; printed text &lt;/dc:type&gt;
&lt;dc:type xml:lang="eng"&gt; text &lt;/dc:type&gt;
&lt;dc:rights xml:lang="fre"&gt; Catalogue en ligne de la Bibliothèque nationale de France &lt;/dc:rights&gt;
&lt;dc:rights xml:lang="eng"&gt; French National Library online Catalog &lt;/dc:rights&gt;
&lt;/oai_dc:dc&gt; &lt;/document&gt; &lt;/record&gt;
...
&lt;record&gt; &lt;set&gt; TEL_BnF_opac &lt;/set&gt;
&lt;id&gt;oai:bnf.fr:catalogue/ark:/12148/cb319212546/description&lt;/id&gt;
&lt;document format="index"&gt; &lt;index&gt; &lt;topic&gt;BnF_opac&lt;/topic&gt; &lt;/index&gt; &lt;/document&gt;
&lt;document format="dcx"&gt; &lt;oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"&gt;
&lt;dc:identifier&gt;http://catalogue.bnf.fr/ark:/12148/cb319212546/description&lt;/dc:identifier&gt;
&lt;dc:title&gt; Ingénieux Hidalgo Don Quichotte de la Manche. Traduction nouvelle précédée d'une introduction
par Jean Babelon &lt;/dc:title&gt;
&lt;dc:creator&gt; Cervantes Saavedra, Miguel de (1547-1616) &lt;/dc:creator&gt;
&lt;dc:date&gt; 1929 &lt;/dc:date&gt;
&lt;dc:description&gt; Comprend : T. I. - Paris, A la Cité des Livres, 27, rue Saint-Sulpice. 1929. (16 mars.) In-8,
XXIX-...55 p. [5224] ; T. 3. - 1929, 422 p. ; T. 4. - 1929, 423 p. &lt;/dc:description&gt;
&lt;dc:language&gt; fre &lt;/dc:language&gt;
&lt;dc:type xml:lang="fre"&gt; texte imprimé &lt;/dc:type&gt;
&lt;dc:type xml:lang="eng"&gt; printed text &lt;/dc:type&gt;
&lt;dc:type xml:lang="eng"&gt; text &lt;/dc:type&gt;
&lt;dc:rights xml:lang="fre"&gt; Catalogue en ligne de la Bibliothèque nationale de France &lt;/dc:rights&gt;
&lt;dc:rights xml:lang="eng"&gt; French National Library online Catalog &lt;/dc:rights&gt;
&lt;/oai_dc:dc&gt; &lt;/document&gt;
&lt;/record&gt;</p>
      <p>TEL collections statistics are shown below in Table 2. The average size of each descriptor is relatively short
(between 10 and 16), and similar across all three languages (perhaps a bit longer for the French corpus). During
the indexing process we retained only the following logical sections from the original documents: &lt;dc:title&gt;,
&lt;dc:description&gt;, &lt;dc:subject&gt;, and &lt;dcterms:alternative&gt;. From the topic descriptions we automatically
removed certain phrases such as “Relevant document report …” or “Relevante Dokumente berichten …”, etc.
All our runs were fully automatic.</p>
      <p>As shown in Appendix 2, the available topics cover various subjects (e.g., Topic #452: “Celtic Art,”
Topic #500: “Gauguin and Tahiti,” Topic #470: “Car Industry in Europe,” or Topic #498: “World War I
Aviation”). We were surprised to see that the topic descriptions do not contain many proper names (creators and
their works or geographical names). We found two topics with personal names (“Henry VIII” and “Gauguin”)
but 23 with geographical names (e.g., “Europe,” “Eastern,” “Bordeaux” or “Greek”). The expression used to
refer to a given location is not standardized, with various expressions being used to refer to a similar location
(e.g., “USA,” “North America,” or “America”). Also, time periods are infrequently used (7 topics) and many
include expressions having rather broad (e.g., “Modern,” “Ancient,” or “Roman”) or more precise (“World War
I”) interpretations.</p>
      <sec id="sec-1-1">
        <title>English French</title>
        <p>Size (in MB) 1.2 GB 1.3 GB
# of documents 1,000,100 1,000,100
# of distinct terms 9,087,132 15,189,862
Number of distinct indexing terms per document
Mean 10 16
Standard deviation 6 11
Median 8 13
Maximum 168 618
Minimum 0 0
Number of indexing terms per document</p>
        <p>Mean 12
Standard deviation 8
Median 9
Maximum 330</p>
        <p>Minimum 0
Number of queries 50</p>
        <p>Number rel. items 2,533
Mean rel./ request 50.66
Standard deviation 44.85
Median 32
Maximum 190 (T #472)
Minimum 7 (T #473)
19
17
15
1004
0
50
1,339
26.78
33.77
16.5
224 (T #465)
3 (T #451)</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3 IR models and Evaluation</title>
      <sec id="sec-2-1">
        <title>3.1 Indexing Approaches</title>
        <p>
          In defining our indexing strategies, we used a stopword list to denote very frequent forms having no important
impact on sense-matching between topic and document representatives (e.g., “the,” “in,” “or,” “has,” etc.). In
our experiments, the stopword list contains 589 English, 484 French and 578 German terms. The diacritics were
replaced by their corresponding non-accented equivalent. We reused the light stemmers we developed for the
French and German languages, because removing the inflectional suffixes attached only to nouns and adjectives
tends to result in better retrieval effectiveness than more aggressive stemmers that also remove derivational
suffixes
          <xref ref-type="bibr" rid="ref19">(Savoy, 2006)</xref>
          . These stemmers and stopword lists are freely available at the Web site
www.unine.ch/info/clef. For the English languages we tried both a light stemming (S-stemmer
proposed by
          <xref ref-type="bibr" rid="ref8">Harman (1991)</xref>
          that removes only the plural form '-s') and a more aggressive one
          <xref ref-type="bibr" rid="ref14">(Porter, 1980)</xref>
          based on a list of around 60 suffixes.
        </p>
        <p>
          In the German language, compound words are widely used. For example, a life insurance company employee
would be “Lebensversicherungsgesellschaftsangestellter” (“Leben” + 's' + “Versicherung” + 's' + “Gesellschaft”
+ 's' + “Angestellter” for life + insurance + company + employee). The augment (i.e. the letter 's' in our previous
example) is not always present (e.g., “Bankangestelltenlohn” combines “Bank” + “Angestellten” + “Lohn”
(salary)). Since compound construction is so widely used and written in many different forms, it is almost
impossible to compile a dictionary providing quasi-total coverage of the German language. Thus an effective IR
system including an automatic decompounding procedure for German had to be developed
          <xref ref-type="bibr" rid="ref3">(Braschler &amp;
Ripplinger, 2004)</xref>
          . In our experiments, we used our own automatic decompounding procedure
          <xref ref-type="bibr" rid="ref16">(Savoy, 2004)</xref>
          leaving both the compounds and their composite parts in the topic and document representatives.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2 IR Models</title>
        <p>In order to obtain high MAP values, we considered adopting different weighting schemes for the terms
included in documents or queries. This would allow us to account for term occurrence frequencies (denoted tfij
for indexing term tj in document Di), as well as their inverse document frequency (denoted idfj). Moreover, we
considered normalizing each indexing weight using the cosine to obtain the classical tf.idf formulation.</p>
        <p>
          In addition to this classical vector-space approach, we also considered probabilistic models such as the Okapi
(or BM25)
          <xref ref-type="bibr" rid="ref15">(Robertson et al. 2000)</xref>
          that also take document length into account. As a second probabilistic
wij = Inf1ij · Inf2ij = –log2[Prob1 ij(tf)] · (1 – Prob2ij(tf))
Prob1ij = (e-λj · λjtfij)/tfij!
        </p>
        <p>with λj = tcj / n</p>
        <p>Prob2ij = 1 - [(tcj +1) / (dfj · (tfnij + 1))]
As a first model, we implemented the PB2 scheme, defined by the following equations:</p>
        <p>with tfnij = tfij · log2[1 + ((c·mean dl) / li)]
where tcj indicates the number of occurrences of term tj in the collection, li the length (number of indexing terms)
of document Di, mean dl the average document length, n the number of documents in the corpus, and c a constant
(the corresponding values are given in the Appendix 1).</p>
        <p>
          For the second model called GL2, the implementation of Prob1ij is given by Equation 3, and Prob2ij is given
by Equation 4, as follows:
approach, we implemented three variants of the DFR (Divergence from Randomness) family of models suggested
by
          <xref ref-type="bibr" rid="ref2">Amati &amp; van Rijsbergen (2002</xref>
          ). In this framework, the indexing weight wij attached to term tj in document Di
combines two information measures as follows
        </p>
        <p>Prob1ij = [1 / (1+λj)] · [λj / (1+λj)]tfnij
Prob2ij = tfnij / (tfnij + 1)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
where λj and tfnij were defined previously.</p>
        <p>For the third model called I(ne)B2, the implementation was applied using the following two equations:
Inf1ij = tfnij · log2[(n+1) / (ne+0,5)]</p>
        <p>with ne = n · [1 – [(n-1)/n]tcj ]
Prob2ij = 1 - [(tcj +1) / (dfj · (tfnij + 1))]</p>
        <p>with tfnij = tfij · log2[1 + ((c·mean dl) / li)]
where n, tcj and tfnij were defined previously, and dfj indicates the number of documents in which the term tj
occurs.</p>
        <p>
          Finally, we also considered an approach based on a statistical language model (LM)
          <xref ref-type="bibr" rid="ref9">(Hiemstra, 2000; 2002)</xref>
          ,
known as a non-parametric probabilistic model (the Okapi and DFR are viewed as parametric models).
Probability estimates would thus not be based on any known distribution (e.g., as in Equation 1 or 3), but rather
be directly estimated based on the term occurrence frequencies in document Di or corpus C. Within this
language model paradigm, various implementations and smoothing methods might be considered, although in
this study we adopted a model proposed by
          <xref ref-type="bibr" rid="ref10">Hiemstra (2002)</xref>
          , as described in Equation 7, combining an estimate
based on document (P[tj | Di]) and on corpus (P[tj | C]) corresponding to the Jelinek-Mercer smoothing approach.
        </p>
        <p>P[Di | Q] = P[Di] . ∏tj∈Q [λj . P[tj | Di] + (1-λj) . P[tj | C]]
with P[tj | Di] = tfij/li and P[tj | C] = dfj/lc
with lc = ∑k dfk
where λj is a smoothing factor (constant for all indexing terms tj, and usually fixed at 0.35) and lc an estimate of
the size of the corpus C.</p>
      </sec>
      <sec id="sec-2-3">
        <title>3.3 Overall Evaluation</title>
        <p>To measure retrieval performance, we adopted MAP values computed on the basis of 1,000 retrieved items
per request as calculated with the TREC_EVAL program. Using this evaluation tool, some evaluation differences
may occur in the values computed according to the official measure (the latter always takes 50 queries into
account while in our presentation we do not account for queries having no relevant items). In the following
tables, the best performance under the given conditions (with the same indexing scheme and the same collection)
is listed in bold type.</p>
        <p>In the last lines we reported the MAP average over these 5 IR models together with percentage variations
derived from comparing the short (T) query formulation to the performance achieved using Porter stemmer and T
query (last line). As depicted in the last lines, increasing the query size improves the MAP (around +12.4% to
+14.7%). According to the average performance, the best indexing approach seemed to be the stemming
approach using Porter's approach. In this case, the MAP with TD query formulation was 0.3559 on average,
versus 0.3416 for the S-stemmer, a relative difference of 4.2%.</p>
        <sec id="sec-2-3-1">
          <title>Query</title>
          <p>Stemmer
Model \ # of queries
Okapi
DFR PB2
DFR GL2
DFR I(ne)B2
LM (λ=0.35)
tf . idf
Average over the 5 best IR
% change over T
% change over S-stemmer
In Table 4 we reported the MAP achieved by probabilistic models using the German collection with two
query formulations (T or TD) and comparing the performance with and without our automatic decompounding
approach. The best IR model seemed to be the DFR PB2 (without decompounding) or the LM model when
applying our decompounding scheme. By adding terms to the topic descriptions, we were also able to improve
retrieval performance (between 17.4% to 31.0%). From comparing the average performances, it can be seen that
applying an automatic decompounding approach improves retrieval effectiveness (see last line of Table 4, with
an average improvement of 46.8% for short query formulations, or +31.5% when considering TD queries).</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>Query</title>
          <p>Decompounding?
Model \ # of queries
Okapi
DFR PB2
DFR GL2
DFR I(ne)B2
LM (λ = 0.35)
tf idf
Average
% change over T
% change</p>
        </sec>
        <sec id="sec-2-3-3">
          <title>German</title>
          <p>T</p>
          <p>
            An analysis showed that pseudo-relevance feedback (whether PRF or blind-query expansion) seemed to be a
useful technique for enhancing retrieval effectiveness. In this study, we adopted Rocchio's approach (denoted
“Roc” in the following tables)
            <xref ref-type="bibr" rid="ref5">(Buckley et al., 1996)</xref>
            with α = 0.75, β = 0.75, whereby the system was allowed to
add m terms extracted from the k best ranked documents from the original query. From our previous experiments
we learned that this type of blind query expansion strategy does not always work well. More particularly, we
believe that including terms occurring frequently in the corpus (because they also appear in the top-ranked
documents) may introduce more noise, and thus be an ineffective means of discriminating between relevant and
non-relevant items
            <xref ref-type="bibr" rid="ref13">(Peat &amp; Willett, 1991)</xref>
            . Consequently we also chose to apply our idf-based query expansion
model (denoted “idf” in following tables)
            <xref ref-type="bibr" rid="ref1 ref6">(Abdou &amp; Savoy, 2008)</xref>
            .
          </p>
          <p>To evaluate these propositions, we applied certain probabilistic models and enlarged the query by adding the
20 to 150 terms retrieved from the 3 to 10 best-ranked articles contained in the English collection (Table 5), and
both the French and German corpora (Table 6).</p>
        </sec>
        <sec id="sec-2-3-4">
          <title>Query TD PRF</title>
        </sec>
        <sec id="sec-2-3-5">
          <title>IR Model / MAP k doc. / m terms</title>
        </sec>
        <sec id="sec-2-3-6">
          <title>Query TD PRF</title>
        </sec>
        <sec id="sec-2-3-7">
          <title>IR Model / MAP k doc. / m terms Table 5: MAP using blind-query expansion (English collection) English</title>
          <p>
            It is usually assumed that combining different search models may improve retrieval effectiveness
            <xref ref-type="bibr" rid="ref21">(Vogt &amp;
Cottrell, 1999)</xref>
            , for three main reasons. First there is a skimming process in which only the k top-ranked
retrieved items from each ranked list are considered. In this case, we would combine the best answers obtained
from various document representations (which would retrieve various pertinent items). Second we would count
on the chorus effect, by which different retrieval schemes would retrieve the same item, and as such provide
stronger evidence that the corresponding document was indeed relevant. Third, an opposite or dark horse effect
may also play a role, whereby a given retrieval model may provide unusually high (low) and accurate estimates
regarding a document's relevance. Thus, a combined system could possibly return more pertinent items by
accounting for documents having a relatively high (low) score, or when a relatively short (long) result lists
occurs. Such a data fusion approach however requires more storage space and processing time. In the trade-off
between the advantages and drawbacks, it is unclear whether such approaches might be of any real commercial
interest.
          </p>
          <p>
            In this current study we combined three probabilistic models representing both the parametric (Okapi and
DFR) and non-parametric (language model or LM) approaches. To produce such a combination we evaluated
various fusion operators (see Table 7 for a detailed list of their descriptions). The “Sum RSV” operator for
example indicates that the combined document score (or the final retrieval status value) is simply the sum of the
retrieval status value (RSVk) of the corresponding document Dk computed by each single indexing scheme
            <xref ref-type="bibr" rid="ref7">(Fox
&amp; Shaw, 1994)</xref>
            . Table 7 thus illustrates how both the “Norm Max” and “Norm RSV” apply a normalization
procedure when combining document scores. When combining the retrieval status value (RSVk) for various
indexing schemes and in order to favor certain more efficient retrieval schemes, we could multiply the document
score by a constant αi (usually equal to 1), reflecting the differences in retrieval performance.
          </p>
          <p>Sum RSV
Norm Max
Norm RSV
Z-Score</p>
          <p>SUM (αi . RSVk)</p>
          <p>SUM (αi . (RSVk / Maxi))</p>
          <p>SUM [αi . ((RSVk - Mini) / (Maxi - Mini))]
αi . [((RSVk - Meani) / Stdevi) + δi] with δi = [(Meani - Mini) / Stdevi]</p>
          <p>
            In addition to using these data fusion operators, we also considered the round-robin approach, wherein we
took one document in turn from each individual list and removed any duplicates, retaining only the highest
ranking occurrence. Finally we suggested merging the retrieved documents according to the Z-Score, computed
for each result list. More details can be found in
            <xref ref-type="bibr" rid="ref17 ref18">Savoy &amp; Berger (2005)</xref>
            . In Table 7, Mini (Maxi) lists the
minimal (maximal) RSV value in the ith result list. Of course, we might also weight the relative contribution of
each retrieval scheme by assigning a different αi value to each retrieval model (fixed to 1 in all our experiments).
          </p>
        </sec>
        <sec id="sec-2-3-8">
          <title>Language / Query Model</title>
        </sec>
        <sec id="sec-2-3-9">
          <title>Okapi &amp; PRF doc/term DFR GL2</title>
          <p>DFR I(ne)B2
Official run name
Round-robin
Sum RSV
Norm Max
Norm RSV
Z-Score
English TD
50 queries</p>
        </sec>
        <sec id="sec-2-3-10">
          <title>Roc 5 docs / 10 terms</title>
          <p>idf 10 docs / 20 terms
idf 10 docs / 50 terms
Roc 10 docs / 10 terms
idf 10 docs / 20 terms
idf 10 docs / 50 terms
Roc 10 docs / 10 terms
Roc 5 docs / 50 terms
idf 5 docs / 50 terms</p>
        </sec>
        <sec id="sec-2-3-11">
          <title>Roc 5 docs / 10 terms</title>
          <p>idf 10 docs / 20 terms
idf 10 docs / 50 terms
Roc 10 docs / 10 terms
idf 5 docs / 10 terms
Roc 5 docs / 20 terms
Roc 5 docs / 50 terms</p>
        </sec>
        <sec id="sec-2-3-12">
          <title>Roc 5 docs / 20 terms</title>
          <p>idf 5 docs / 50 terms
idf 5 docs / 50 terms
idf 5 docs / 10 terms
idf 5 docs / 10 terms
Roc 5 docs / 20 terms
Roc 5 docs / 50 terms</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>3.5 Official Results</title>
        <p>The Persian (or Farsi) language is a member of the Indo-European family with relatively few morphological
variations. This year we used a corpus extracted from the newspapers Hamshahri, made available thought the
efforts of the University of Tehran (http://ece.ut.ac.ir/dbrg/hamshahri/). As usual in various
evaluation campaigns, the corpus contains news articles (611 MB, for the years 1996 to 2002). This corpus
contains exactly 166,774 documents on a variety of subjects (politic, literature, art, and economy, etc.) and
includes about 448,100 different words. Hamshahri articles vary between 1 KB and 140 KB in size, comprising
on average about 202 tokens (or 127 if we only count the number of word types). The corpus was coded in
UTF8 and written using the 28 Arabic letters plus an additional 4 letters for the Persian language.</p>
        <p>For the Persian language we first built a stopword list containing 884 terms. Unlike most other lists, this one
contains words most frequently occurring in the collection (determinants, prepositions, conjunctions, pronouns or
some auxiliary verb forms), plus a large number of suffixes already separated from word stems in the collection
(see examples given below).</p>
        <p>
          As a stemming strategy, we can use a morphological analysis
          <xref ref-type="bibr" rid="ref12">(Miangah, 2006)</xref>
          or our simple, fast and light
stemming approach that attempts to remove only nouns and adjective inflections. In the Persian language, the
general pattern for inflectional suffixes is as follows: &lt;possessive&gt; &lt;plural&gt; &lt;other-suffix&gt; &lt;stem&gt;. In our light
stemming strategy, we usually removed possessive, plural and some of the suffixes marked as others. The
following examples of our light stemmer illustrate the relatively simple Persian morphology. From the plural
form درختان (“trees”), we can obtain درخت (“tree”). For the possessive form, دسم (“my hand”), our stemmer will
return دست (“hand”), and for the form ايرانيان (“Iranians”) we obtain ايران (“Iran”). In this corpus we saw that
in some circumstances the suffixes might be written together or separated from the word as in ڪشتي ا and ا
ڪشتي (“boats”), or منزل ا and ا منزل (“houses”). The adjectives are usually indeclinable whether used
attributively or as a predicate. When used as substantives, adjectives take the normal plural endings, while
comparative and superlative forms use the endings تر , and تزين .
        </p>
        <p>The Persian language uses few case markers (the accusative case and certain specific genitive cases), unlike
the Latin, German or Hungarian languages. The accusative for the definite noun is followed by را which can be
joined to the noun or written separately (e.g., را مرد for the noun “man”). The genitive case is expressed by
means of coupling two nouns by means of the particle known as ezafe (e.g.ِ د “man’s son”). As is usually
done in the English language, other relations are expressed by means of prepositions (e.g., in, with, etc.). Both
the stopword list and our light stemmer are freely available at http://www.unine.ch/info/clef/.</p>
        <sec id="sec-2-4-1">
          <title>Query</title>
          <p>Stemmer
Model \ # of queries
Okapi
DFR PL2
DFR I(ne)C2
LM (λ=0.35)
tf . idf
Average (4 IR models)
% change over T
% change over "none"</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5 Robust Retrieval</title>
      <p>
        In the robust task
        <xref ref-type="bibr" rid="ref22">(Voorhees, 2006)</xref>
        , we were interested in learning why retrieving relevant items for a given
topic could be hard, even if the query contains certain common terms found in the relevant documents. In order
to evaluate various search techniques, we used a corpus created during recent CLEF evaluation campaigns. This
collection consists of articles published in 1994 in the newspaper Los Angeles Times, as well as articles extracted
from the Glasgow Herald and published in 1995. This collection contains a total of 169,477 documents (or
about 579 MB of data). On average each article contains about 250 (median: 191) content-bearing terms (not
counting commonly occurring words such as “the,” “of” or “in”). Typically, documents in this collection are
represented by a short title plus one to four paragraphs of text, and both American and British English spellings
can be found in the corpus. To compile the test set, we used the topics created during the CLEF 2003 campaign
(Topics #141 - #200) as well as queries from the 2005 (Topics #251 - #300) and 2006 (Topics #301 - #350)
evaluation campaign. In this test set we found 153 queries able to return at least one relevant item from the
collection.
      </p>
      <p>This year we were interested in verifying whether word-sense disambiguation (WSD) might improve retrieval
effectiveness. For this reason the organizers provides us with a new version of both the document and topic
descriptions containing the correct lemma (entry in the dictionary) and SYNSET number(s) of the corresponding
entry in the WordNet thesaurus (version 1.6). Table 13 lists an example for the title of Topic #47. Under the
attribute LEMA the corresponding English dictionary entry is shown (therefore a stemming procedure is no more
needed) and under the tag SYNSET, we can find both the score and the SYNSET number. The surface form is
indicated under the label &lt;WF&gt; and the Part-of-Speech (POS) tag is also available for each word.</p>
      <p>
        Various possibilities have been put forward to explain why certain successful IR systems may fail for some
queries
        <xref ref-type="bibr" rid="ref20 ref4">(Buckley, 2004; Savoy, 2007)</xref>
        . The organizers thought that the polysemy (already known as a problem in
finding pertinent matches between query and document surrogates) could be partially resolved in an appropriate
manner by using the SYNSET information.
      </p>
      <p>
        Based on past experiments
        <xref ref-type="bibr" rid="ref1 ref6">(Dolamic &amp; Savoy, 2008)</xref>
        with this corpus and using the TD queries and Porter's
stemmer
        <xref ref-type="bibr" rid="ref14">(Porter, 1980)</xref>
        , we achieved a MAP of 0.2216 with tf . idf IR model to 0.4070 with Okapi model
        <xref ref-type="bibr" rid="ref15">(Robertson et al., 2000)</xref>
        . With this last IR model, the set of hardest topics (defined as a query listing no relevant
items in the top-20) were composed of seven topics, namely Topic #153 (“Olympic Games and Peace”), Topic
#301 (“Nestlé Brands”), Topic #320 (“Energy Crises”), Topic #188 (“German Spelling Reform”), Topic #258
(“Brain-Drain Impact”), Topic #309 (“Hard Drugs”), and Topic #322 (“Atomic Energy”).
Query
      </p>
      <sec id="sec-3-1">
        <title>Index Single MAP Comb MAP WSD &amp; POS WSD &amp; POS</title>
        <p>POS
WSD
WSD
WSD
WSD
WSD</p>
      </sec>
      <sec id="sec-3-2">
        <title>Model</title>
        <p>I(ne)C2
Okapi
I(ne)C2
Okapi</p>
        <p>LM
I(ne)C2
I(ne)C2
I(ne)C2</p>
        <p>LM
Okapi
I(ne)C2</p>
        <p>LM
Okapi
Okapi</p>
        <p>LM
I(ne)C2</p>
        <p>Query expansion
In the current experiments, we generated six different runs using word-sense disambiguation information. As
shown in Table 14 above, we followed our combination strategy, taking into account the various probabilistic
models using different blind query expansion approaches. Our best results were achieved in the UniNERobust4
run with a MAP of 0.4515. Moreover, if we compare runs with or without word sense disambiguation (WSD)
information (lemma, POS tags and SYNSET), we see no real and important differences (e.g., UniNERobust1 vs.
UniNERobust2, and UniNERobust4 vs. UniNERobust3).
10th</p>
        <p>As shown in Table 14, in our official runs a hard topic was where the query resulted in low average precision.
Using this definition, Table 16 lists the 10 topics having the lowest mean average precision. When all six runs
are listed we obtain: Topic #153 (“Olympic Games and Peace”), followed by Topic #343 (“South African
National Party”), Topic #313 (“Centenary Celebrations”), Topic #320 (“Energy Crises”), Topic #286 (“Football
Injuries”). In an attempt to explain why a topic was difficult, we might mention that for Topics #343 and #153
only one relevant document was retrieved. Based on our best run (UniNERobust4), this item was ranked low on
the retrieved list (44th for Topics #343, and 382th with Topics #153) even though they contained a large number
of search terms.
UniNERobust2
UniNERobust3
153
153
153
153
153
178
343
336
286
169
266
266
266
314</p>
        <p>In this ninth CLEF campaign we evaluated various probabilistic IR models using two different
testcollections, the first composed of short bibliographic notices extracted from the TEL corpora (written in English,
German and French languages), and the second newspapers articles written in the Persian language. For the
latter we also suggested a stopword list and a light stemmer strategy.</p>
        <p>The results of our various experiments demonstrate that the I(ne)B2 or PB2 models (or I(ne)C2 for the Persian
language) derived from the Divergence from Randomness (DFR) paradigm and the LM model seem to provide
the best overall retrieval performances (see Tables 3, 4 and 11). The Okapi model used in our experiments
usually results in retrieval performances inferior to those obtained with the DFR or LM approaches.</p>
        <p>For the Persian language (Tables 11 and 12), our light stemmer tends to produce better MAP than does the
4gram indexing scheme (relative difference of 5.5%). On the other hand, the performance difference with an
approach ignoring a stemming stage is rather small.</p>
        <p>Using the TEL corpora, the pseudo-relevance feedback (Rocchio’s model) tends to hurt the retrieval
effectiveness (see Tables 5 or 6). A data fusion strategy may enhance the retrieval performance for the French
and German (Table 8) or Persian languages (Table 12), but not with the English corpus.</p>
        <p>In the robust track, using the blind query expansion and data fusion approaches (combining three different
probabilistic models), we are able to improve the MAP from 0.4086 (Okapi) to 0.4515. However, if we define
hard topics as queries for which we cannot find any relevant items listed in the top-20, then these two runs
produce the same number of hard topics (7 over 153). Finally the performance differences with and without
word sense disambiguation (WSD) information are rather small.</p>
        <p>Acknowledgments</p>
        <p>The authors would like to also thank the CLEF-2008 task organizers for their efforts in developing various
European language test-collections. This research was supported in part by the Swiss National Science
Foundation under Grant #200021-113273.</p>
        <sec id="sec-3-2-1">
          <title>Appendix 1: Parameter Settings</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Language</title>
        <p>English TEL
French TEL
German TEL
Persian word
Persian 4-gram
English Robust
C451
C452
C453
C454
C455
C456
C457
C458
C459
C460
C461
C462
C463
C464
C465
C466
C467
C468
C469
C470
C471
C472
C473
C474
C475
C551
C552
C553
C554
C555
C556
C557
C558
C559
C560
C561
C562
C563
C564
C565
C566
C567
C568
C569
C570
C571
C572
C573
C574
C575</p>
        <p>Roman Military in Britain
Celtic Art
Bombing of Japanese Cities
The Inquisition in Italy
Irish Emigration to North America
Women's Vote in the USA
Big Game Hunting in Africa
The Wives of Henry VIII
Gardening for Children
Scary Movies
Ancient Greek Coins
Israeli Secret Service
Churches in France
Piano Lessons
Trade Unions
Gay Fiction
Formula One Drivers
Modern Japanese Culture
Scottish Music
Car Industry in Europe
Watchmaking
Man in Space
British Women Authors
Journeys to Antarctica
Eastern philosophy</p>
      </sec>
      <sec id="sec-3-4">
        <title>Wimbledon tennis cup</title>
        <p>Tehran’s stock market
2002 world cup
Stress and Health
Road casualty statistics
Nuclear energy regulations
Iran football coaches
Danger of solid oil
Best Fajr film
Iran economic sanction
Gardening handbooks
Reconstruction of Kandovan tunnel
Mad cow disease
Sport blood pressure
Drought losses
Prevention detection kidney diseases
Population growth control
Cell phone expansion
Cases of economic corruption
Iran dam construction
Global oil economy
Shajarian Concert
Gross amount film cinema
Champion team Iran first league</p>
        <p>PersPolis Club establishment date
C476 Contrastive Analysis of Electoral Systems
C477 Web Advertising
C478 Multilingual Upbringing
C479 Food Allergies
C480 Pilgrimage to Santiago de Compostela
C481 Famous Jazz Musicians
C482 Vegetarianism
C483 Solar Energy
C484 Soap-making
C485 Counterfeiting Money
C486 Pictures of Vintage Cars
C487 Jousting in the Middle Ages
C488 African Americans and the American Civil War
C489 Graphics Programming
C490 Bordeaux Wine Guides
C491 Salary Inequality between Sexes
C492 Homeopathic Cures for Children
C493 Recipes for Chocolate Desserts
C494 Youth Employment in Europe
C495 Women in the French Revolution
C496 Gods in Greek Mythology
C497 20th Century S. American Authors
C498 World War I Aviation
C499 Wonders of the Ancient World
C500 Gauguin and Tahiti
C576
C577
C578
C579
C580
C581
C582
C583
C584
C585
C586
C587
C588
C589
C590
C591
C592
C593
C594
C595
C596
C597
C598
C599
C600</p>
      </sec>
      <sec id="sec-3-5">
        <title>Iran Khodro company</title>
        <p>Anti-Cancer Drugs
Traffic Congestion in Tehran
Tehran International book festival
Iranian presidential election
Plane crashes
Water shortage in Tehran
Earthquake damages
Oil price changes
Air pollution control
European football champion league final
Development of Iranian software industry
Chemical attacks
Iranian carpet export
Merchandise smuggling
Global warming
Widely used narcotics in Iran
Masouleh (Masooleh) Province
Aircraft ticket prices
World cup South Korea Japan
Iraqi weapons of mass destruction
Tehran murders
Serial Killings
2nd of Khordad election
Inflation in Iran</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Abdou</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Searching in Medline: Stemming, query expansion, and manual indexing evaluation</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>44</volume>
          (
          <issue>2</issue>
          ), p.
          <fpage>781</fpage>
          -
          <lpage>789</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Amati</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp; van
          <string-name>
            <surname>Rijsbergen</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>Probabilistic models of information retrieval based on measuring the divergence from randomness</article-title>
          .
          <source>ACM Transactions on Information Systems</source>
          ,
          <volume>20</volume>
          (
          <issue>4</issue>
          ), p.
          <fpage>357</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Braschler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ripplinger</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>How effective is stemming and decompounding for German text retrieval</article-title>
          ?
          <source>IR Journal</source>
          ,
          <volume>7</volume>
          , p.
          <fpage>291</fpage>
          -
          <lpage>316</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Why current IR engines fail</article-title>
          .
          <source>Proceedings ACM-SIGIR'</source>
          <year>2004</year>
          , The ACM Press, p.
          <fpage>584</fpage>
          -
          <lpage>585</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singhal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>1996</year>
          ).
          <article-title>New retrieval approaches using SMART</article-title>
          .
          <source>In Proceedings of TREC-4</source>
          , Gaithersburg: NIST Publication #
          <fpage>500</fpage>
          -
          <lpage>236</lpage>
          ,
          <fpage>25</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Dolamic</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Monolingual and Bilingual Searches: Evaluation, Challenges and Failure Analysis</article-title>
          . Submitted.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Fox</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Shaw</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          (
          <year>1994</year>
          ).
          <article-title>Combination of multiple searches</article-title>
          .
          <source>In Proceedings TREC-2</source>
          , Gaithersburg: NIST Publication #
          <fpage>500</fpage>
          -
          <lpage>215</lpage>
          , p.
          <fpage>243</fpage>
          -
          <lpage>249</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Harman</surname>
            ,
            <given-names>D.K.</given-names>
          </string-name>
          (
          <year>1991</year>
          ).
          <article-title>How effective is suffixing</article-title>
          ?
          <source>Journal of the American Society for Information Science</source>
          ,
          <volume>42</volume>
          (
          <issue>1</issue>
          ), p.
          <fpage>7</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Hiemstra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>Using language models for information retrieval</article-title>
          .
          <source>CTIT Ph.D. Thesis.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Hiemstra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>Term-specific smoothing for the language modeling approach to information retrieval</article-title>
          .
          <source>In Proceedings of the ACM-SIGIR</source>
          , The ACM Press, p.
          <fpage>35</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>McNamee</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Mayfield</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Character n-gram tokenization for European language text retrieval</article-title>
          .
          <source>IR Journal</source>
          ,
          <volume>7</volume>
          (
          <issue>1-2</issue>
          ),
          <fpage>73</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Miangah</surname>
            ,
            <given-names>T.M.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>Automatic lemmatization of Persian words</article-title>
          .
          <source>Journal of Quantitative Linguistics</source>
          ,
          <volume>13</volume>
          (
          <issue>1</issue>
          ), p.
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Peat</surname>
            ,
            <given-names>H. J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Willett</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>1991</year>
          ).
          <article-title>The limitations of term co-occurrence data for query expansion in document retrieval systems</article-title>
          .
          <source>Journal of the American Society for Information Science</source>
          ,
          <volume>42</volume>
          (
          <issue>5</issue>
          ), p.
          <fpage>378</fpage>
          -
          <lpage>383</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Porter</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          (
          <year>1980</year>
          ).
          <article-title>An algorithm for suffix stripping</article-title>
          .
          <source>Program</source>
          ,
          <volume>14</volume>
          , p.
          <fpage>130</fpage>
          -
          <lpage>137</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Walker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Beaulieu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>Experimentation as a way of life: Okapi at TREC</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>36</volume>
          (
          <issue>1</issue>
          ),
          <fpage>95</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Combining multiple strategies for effective monolingual and cross-lingual retrieval</article-title>
          .
          <source>IR Journal</source>
          ,
          <volume>7</volume>
          , p.
          <fpage>121</fpage>
          -
          <lpage>148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Berger</surname>
          </string-name>
          , P.-Y. (
          <year>2005</year>
          )
          <article-title>: Selection and merging strategies for multilingual information retrieval</article-title>
          . In: Peters,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.J.F.</given-names>
            ,
            <surname>Kluck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Magnini</surname>
          </string-name>
          ,
          <string-name>
            <surname>B</surname>
          </string-name>
          . (Eds.):
          <article-title>Multilingual Information Access for text</article-title>
          ,
          <source>Speech and Images. Lecture Notes in Computer Science</source>
          : Vol.
          <volume>3491</volume>
          . Springer, Heidelberg, p.
          <fpage>27</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Bibliographic database access using free-text and controlled vocabulary: An evaluation</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>41</volume>
          (
          <issue>4</issue>
          ),
          <fpage>873</fpage>
          -
          <lpage>890</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>Light stemming approaches for the French, Portuguese, German and Hungarian languages</article-title>
          .
          <source>Proceedings ACM-SAC</source>
          , The ACM Press, p.
          <fpage>1031</fpage>
          -
          <lpage>1035</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Why do successful search systems fail for some topics</article-title>
          ?
          <source>Proceedings ACM-SAC</source>
          , The ACM Press, p.
          <fpage>872</fpage>
          -
          <lpage>877</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Vogt</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Cottrell</surname>
            ,
            <given-names>G.W.</given-names>
          </string-name>
          (
          <year>1999</year>
          ).
          <article-title>Fusion via a linear combination of scores</article-title>
          .
          <source>IR Journal</source>
          ,
          <volume>1</volume>
          (
          <issue>3</issue>
          ),
          <fpage>151</fpage>
          -
          <lpage>173</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>E.M.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>The TREC 2005 robust track</article-title>
          .
          <source>ACM SIGIR Forum</source>
          ,
          <volume>40</volume>
          ,
          <year>2006</year>
          , p.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>