1 Introduction

UniNE at CLEF 2008: TEL, Persian and Robust IR

Ljiljana Dolamic

Ljiljana.Dolamic@unine.ch 0 1 2 3

Claire Fautsch

Claire.Fautsch@unine.ch 0 1 2 3

Jacques Savoy

Jacques.Savoy@unine.ch 0 1 2 3 0 Computer Science Department 1 Experimentation , Performance, Measurement, Algorithms 2 Natural Language Processing , Stemmer, Digital Libraries, Persian Language (Farsi), Robust Retrieval 3 University of Neuchatel , Switzerland

2008

In participating in this evaluation campaign, our first objective is to analyze the retrieval effectiveness when using TEL (The European Library) corpora composed of very short descriptions (library catalogue records) and to evaluate the retrieval effectiveness of several IR models. As a second objective we want to design and evaluate a stopword list and a light stemming strategy for the Persian language, a language belonging to the Indo-European family and having a relatively simple morphology. Finally, we participated in the robust track in an attempt to understand the difficulty involved in retrieving pertinent documents, even when the query and document representations share many common terms. Moreover, we made use of word sense disambiguation (WSD) information to order to reduce problems related to polysemy when matching topic and document representation.

1 Introduction

During the last few years, the IR group at University of Neuchatel has been involved in designing, implementing and evaluating IR systems for various natural languages, including both European and popular Asian languages (namely, Chinese, Japanese, and Korean). Our main objective in this context is to promote effective monolingual IR in those languages.

The rest of this paper is organized as follows: Section 2 describes the main characteristics of the TEL corpus used in the CLEF-2008 ad hoc track. Section 3 outlines the main aspects of different IR models used with TEL collections together with the evaluation of our official runs and certain related experiments. Section 4 presents the principal features of the Persian (Farsi) language, presents the stopword list and stemming strategy we developed for this language and describes our official runs and results for this task. Our participation and results concerning the robust task are outlined in Section 5, and Section 6 presents our main conclusions. challenge was to retrieve pertinent records composed of a very short description of the referred information item. The only information contained in many records consists of only a title and author, and manually assigned subject headings.

Typical documents are shown in the tables below. Table 1a (British Library), Table 1b (Austrian National Library), and Table 1c (Bibliothèque nationale de France) shown the descriptions that appear in different languages. Table 1a shows a record with a title (tag <dc:title>) in German from a BL record and the subject in English (tag <dc:subject>). Table 1c illustrates another example with the title (tag <dc:title>) and a part of the description (tag <dc:description>) written in Latin.

<dc:identifier > <dc:identifier xsi:type="dcterms:URI">http://catalogue.bl.uk/F/-?func=direct-docset&amp;l_base=BLL01&from=TELgateway&doc_number=010624878</dc:identifier> <mods:location> British Library HMNTS YA.1992.b.771 </mods:location> </oai_dc:dc> </document> </record> <record> <set> TEL_BnF_opac </set> <id>oai:bnf.fr:catalogue/ark:/12148/cb30000394c/description</id> <document format="index"> <index> <topic>BnF_opac</topic> </index> </document> <document format="dcx"> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:dc="http://purl.org/dc/elements/1.1/"> <dc:identifier>http://catalogue.bnf.fr/ark:/12148/cb30000394c/description</dc:identifier> <dc:title> Codex canonum vetus ecclesiae romanae a Francisco Pithoeo restitutus..</dc:title> <dc:date> 1687 </dc:date> <dc:description> Comprend : Apologeticus et epistolae </dc:description> <dc:language> lat </dc:language> <dc:type xml:lang="fre"> texte imprimé </dc:type> <dc:type xml:lang="eng"> printed text </dc:type> <dc:type xml:lang="eng"> text </dc:type> <dc:rights xml:lang="fre"> Catalogue en ligne de la Bibliothèque nationale de France </dc:rights> <dc:rights xml:lang="eng"> French National Library online Catalog </dc:rights> </oai_dc:dc> </document> </record> ... <record> <set> TEL_BnF_opac </set> <id>oai:bnf.fr:catalogue/ark:/12148/cb319212546/description</id> <document format="index"> <index> <topic>BnF_opac</topic> </index> </document> <document format="dcx"> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:dc="http://purl.org/dc/elements/1.1/"> <dc:identifier>http://catalogue.bnf.fr/ark:/12148/cb319212546/description</dc:identifier> <dc:title> Ingénieux Hidalgo Don Quichotte de la Manche. Traduction nouvelle précédée d'une introduction par Jean Babelon </dc:title> <dc:creator> Cervantes Saavedra, Miguel de (1547-1616) </dc:creator> <dc:date> 1929 </dc:date> <dc:description> Comprend : T. I. - Paris, A la Cité des Livres, 27, rue Saint-Sulpice. 1929. (16 mars.) In-8, XXIX-...55 p. [5224] ; T. 3. - 1929, 422 p. ; T. 4. - 1929, 423 p. </dc:description> <dc:language> fre </dc:language> <dc:type xml:lang="fre"> texte imprimé </dc:type> <dc:type xml:lang="eng"> printed text </dc:type> <dc:type xml:lang="eng"> text </dc:type> <dc:rights xml:lang="fre"> Catalogue en ligne de la Bibliothèque nationale de France </dc:rights> <dc:rights xml:lang="eng"> French National Library online Catalog </dc:rights> </oai_dc:dc> </document> </record>

TEL collections statistics are shown below in Table 2. The average size of each descriptor is relatively short (between 10 and 16), and similar across all three languages (perhaps a bit longer for the French corpus). During the indexing process we retained only the following logical sections from the original documents: <dc:title>, <dc:description>, <dc:subject>, and <dcterms:alternative>. From the topic descriptions we automatically removed certain phrases such as “Relevant document report …” or “Relevante Dokumente berichten …”, etc. All our runs were fully automatic.

As shown in Appendix 2, the available topics cover various subjects (e.g., Topic #452: “Celtic Art,” Topic #500: “Gauguin and Tahiti,” Topic #470: “Car Industry in Europe,” or Topic #498: “World War I Aviation”). We were surprised to see that the topic descriptions do not contain many proper names (creators and their works or geographical names). We found two topics with personal names (“Henry VIII” and “Gauguin”) but 23 with geographical names (e.g., “Europe,” “Eastern,” “Bordeaux” or “Greek”). The expression used to refer to a given location is not standardized, with various expressions being used to refer to a similar location (e.g., “USA,” “North America,” or “America”). Also, time periods are infrequently used (7 topics) and many include expressions having rather broad (e.g., “Modern,” “Ancient,” or “Roman”) or more precise (“World War I”) interpretations.

English French

Size (in MB) 1.2 GB 1.3 GB # of documents 1,000,100 1,000,100 # of distinct terms 9,087,132 15,189,862 Number of distinct indexing terms per document Mean 10 16 Standard deviation 6 11 Median 8 13 Maximum 168 618 Minimum 0 0 Number of indexing terms per document

Mean 12 Standard deviation 8 Median 9 Maximum 330

Minimum 0 Number of queries 50

Number rel. items 2,533 Mean rel./ request 50.66 Standard deviation 44.85 Median 32 Maximum 190 (T #472) Minimum 7 (T #473) 19 17 15 1004 0 50 1,339 26.78 33.77 16.5 224 (T #465) 3 (T #451)

3 IR models and Evaluation 3.1 Indexing Approaches

In defining our indexing strategies, we used a stopword list to denote very frequent forms having no important impact on sense-matching between topic and document representatives (e.g., “the,” “in,” “or,” “has,” etc.). In our experiments, the stopword list contains 589 English, 484 French and 578 German terms. The diacritics were replaced by their corresponding non-accented equivalent. We reused the light stemmers we developed for the French and German languages, because removing the inflectional suffixes attached only to nouns and adjectives tends to result in better retrieval effectiveness than more aggressive stemmers that also remove derivational suffixes (Savoy, 2006) . These stemmers and stopword lists are freely available at the Web site www.unine.ch/info/clef. For the English languages we tried both a light stemming (S-stemmer proposed by Harman (1991) that removes only the plural form '-s') and a more aggressive one (Porter, 1980) based on a list of around 60 suffixes.

In the German language, compound words are widely used. For example, a life insurance company employee would be “Lebensversicherungsgesellschaftsangestellter” (“Leben” + 's' + “Versicherung” + 's' + “Gesellschaft” + 's' + “Angestellter” for life + insurance + company + employee). The augment (i.e. the letter 's' in our previous example) is not always present (e.g., “Bankangestelltenlohn” combines “Bank” + “Angestellten” + “Lohn” (salary)). Since compound construction is so widely used and written in many different forms, it is almost impossible to compile a dictionary providing quasi-total coverage of the German language. Thus an effective IR system including an automatic decompounding procedure for German had to be developed (Braschler & Ripplinger, 2004) . In our experiments, we used our own automatic decompounding procedure (Savoy, 2004) leaving both the compounds and their composite parts in the topic and document representatives.

3.2 IR Models

In order to obtain high MAP values, we considered adopting different weighting schemes for the terms included in documents or queries. This would allow us to account for term occurrence frequencies (denoted tfij for indexing term tj in document Di), as well as their inverse document frequency (denoted idfj). Moreover, we considered normalizing each indexing weight using the cosine to obtain the classical tf.idf formulation.

In addition to this classical vector-space approach, we also considered probabilistic models such as the Okapi (or BM25) (Robertson et al. 2000) that also take document length into account. As a second probabilistic wij = Inf1ij · Inf2ij = –log2[Prob1 ij(tf)] · (1 – Prob2ij(tf)) Prob1ij = (e-λj · λjtfij)/tfij!

with λj = tcj / n

Prob2ij = 1 - [(tcj +1) / (dfj · (tfnij + 1))] As a first model, we implemented the PB2 scheme, defined by the following equations:

with tfnij = tfij · log2[1 + ((c·mean dl) / li)] where tcj indicates the number of occurrences of term tj in the collection, li the length (number of indexing terms) of document Di, mean dl the average document length, n the number of documents in the corpus, and c a constant (the corresponding values are given in the Appendix 1).

For the second model called GL2, the implementation of Prob1ij is given by Equation 3, and Prob2ij is given by Equation 4, as follows: approach, we implemented three variants of the DFR (Divergence from Randomness) family of models suggested by Amati & van Rijsbergen (2002 ). In this framework, the indexing weight wij attached to term tj in document Di combines two information measures as follows

Prob1ij = [1 / (1+λj)] · [λj / (1+λj)]tfnij Prob2ij = tfnij / (tfnij + 1) (1) (2) (3) (4) (5) (6) (7) where λj and tfnij were defined previously.

For the third model called I(ne)B2, the implementation was applied using the following two equations: Inf1ij = tfnij · log2[(n+1) / (ne+0,5)]

with ne = n · [1 – [(n-1)/n]tcj ] Prob2ij = 1 - [(tcj +1) / (dfj · (tfnij + 1))]

with tfnij = tfij · log2[1 + ((c·mean dl) / li)] where n, tcj and tfnij were defined previously, and dfj indicates the number of documents in which the term tj occurs.

Finally, we also considered an approach based on a statistical language model (LM) (Hiemstra, 2000; 2002) , known as a non-parametric probabilistic model (the Okapi and DFR are viewed as parametric models). Probability estimates would thus not be based on any known distribution (e.g., as in Equation 1 or 3), but rather be directly estimated based on the term occurrence frequencies in document Di or corpus C. Within this language model paradigm, various implementations and smoothing methods might be considered, although in this study we adopted a model proposed by Hiemstra (2002) , as described in Equation 7, combining an estimate based on document (P[tj | Di]) and on corpus (P[tj | C]) corresponding to the Jelinek-Mercer smoothing approach.

P[Di | Q] = P[Di] . ∏tj∈Q [λj . P[tj | Di] + (1-λj) . P[tj | C]] with P[tj | Di] = tfij/li and P[tj | C] = dfj/lc with lc = ∑k dfk where λj is a smoothing factor (constant for all indexing terms tj, and usually fixed at 0.35) and lc an estimate of the size of the corpus C.

3.3 Overall Evaluation

To measure retrieval performance, we adopted MAP values computed on the basis of 1,000 retrieved items per request as calculated with the TREC_EVAL program. Using this evaluation tool, some evaluation differences may occur in the values computed according to the official measure (the latter always takes 50 queries into account while in our presentation we do not account for queries having no relevant items). In the following tables, the best performance under the given conditions (with the same indexing scheme and the same collection) is listed in bold type.

In the last lines we reported the MAP average over these 5 IR models together with percentage variations derived from comparing the short (T) query formulation to the performance achieved using Porter stemmer and T query (last line). As depicted in the last lines, increasing the query size improves the MAP (around +12.4% to +14.7%). According to the average performance, the best indexing approach seemed to be the stemming approach using Porter's approach. In this case, the MAP with TD query formulation was 0.3559 on average, versus 0.3416 for the S-stemmer, a relative difference of 4.2%.

Query

Stemmer Model \ # of queries Okapi DFR PB2 DFR GL2 DFR I(ne)B2 LM (λ=0.35) tf . idf Average over the 5 best IR % change over T % change over S-stemmer In Table 4 we reported the MAP achieved by probabilistic models using the German collection with two query formulations (T or TD) and comparing the performance with and without our automatic decompounding approach. The best IR model seemed to be the DFR PB2 (without decompounding) or the LM model when applying our decompounding scheme. By adding terms to the topic descriptions, we were also able to improve retrieval performance (between 17.4% to 31.0%). From comparing the average performances, it can be seen that applying an automatic decompounding approach improves retrieval effectiveness (see last line of Table 4, with an average improvement of 46.8% for short query formulations, or +31.5% when considering TD queries).

Query

Decompounding? Model \ # of queries Okapi DFR PB2 DFR GL2 DFR I(ne)B2 LM (λ = 0.35) tf idf Average % change over T % change

German

An analysis showed that pseudo-relevance feedback (whether PRF or blind-query expansion) seemed to be a useful technique for enhancing retrieval effectiveness. In this study, we adopted Rocchio's approach (denoted “Roc” in the following tables) (Buckley et al., 1996) with α = 0.75, β = 0.75, whereby the system was allowed to add m terms extracted from the k best ranked documents from the original query. From our previous experiments we learned that this type of blind query expansion strategy does not always work well. More particularly, we believe that including terms occurring frequently in the corpus (because they also appear in the top-ranked documents) may introduce more noise, and thus be an ineffective means of discriminating between relevant and non-relevant items (Peat & Willett, 1991) . Consequently we also chose to apply our idf-based query expansion model (denoted “idf” in following tables) (Abdou & Savoy, 2008) .

To evaluate these propositions, we applied certain probabilistic models and enlarged the query by adding the 20 to 150 terms retrieved from the 3 to 10 best-ranked articles contained in the English collection (Table 5), and both the French and German corpora (Table 6).

Query TD PRF IR Model / MAP k doc. / m terms Query TD PRF IR Model / MAP k doc. / m terms Table 5: MAP using blind-query expansion (English collection) English

It is usually assumed that combining different search models may improve retrieval effectiveness (Vogt & Cottrell, 1999) , for three main reasons. First there is a skimming process in which only the k top-ranked retrieved items from each ranked list are considered. In this case, we would combine the best answers obtained from various document representations (which would retrieve various pertinent items). Second we would count on the chorus effect, by which different retrieval schemes would retrieve the same item, and as such provide stronger evidence that the corresponding document was indeed relevant. Third, an opposite or dark horse effect may also play a role, whereby a given retrieval model may provide unusually high (low) and accurate estimates regarding a document's relevance. Thus, a combined system could possibly return more pertinent items by accounting for documents having a relatively high (low) score, or when a relatively short (long) result lists occurs. Such a data fusion approach however requires more storage space and processing time. In the trade-off between the advantages and drawbacks, it is unclear whether such approaches might be of any real commercial interest.

In this current study we combined three probabilistic models representing both the parametric (Okapi and DFR) and non-parametric (language model or LM) approaches. To produce such a combination we evaluated various fusion operators (see Table 7 for a detailed list of their descriptions). The “Sum RSV” operator for example indicates that the combined document score (or the final retrieval status value) is simply the sum of the retrieval status value (RSVk) of the corresponding document Dk computed by each single indexing scheme (Fox & Shaw, 1994) . Table 7 thus illustrates how both the “Norm Max” and “Norm RSV” apply a normalization procedure when combining document scores. When combining the retrieval status value (RSVk) for various indexing schemes and in order to favor certain more efficient retrieval schemes, we could multiply the document score by a constant αi (usually equal to 1), reflecting the differences in retrieval performance.

Sum RSV Norm Max Norm RSV Z-Score

SUM (αi . RSVk)

SUM (αi . (RSVk / Maxi))

SUM [αi . ((RSVk - Mini) / (Maxi - Mini))] αi . [((RSVk - Meani) / Stdevi) + δi] with δi = [(Meani - Mini) / Stdevi]

In addition to using these data fusion operators, we also considered the round-robin approach, wherein we took one document in turn from each individual list and removed any duplicates, retaining only the highest ranking occurrence. Finally we suggested merging the retrieved documents according to the Z-Score, computed for each result list. More details can be found in Savoy & Berger (2005) . In Table 7, Mini (Maxi) lists the minimal (maximal) RSV value in the ith result list. Of course, we might also weight the relative contribution of each retrieval scheme by assigning a different αi value to each retrieval model (fixed to 1 in all our experiments).

Language / Query Model Okapi & PRF doc/term DFR GL2

DFR I(ne)B2 Official run name Round-robin Sum RSV Norm Max Norm RSV Z-Score English TD 50 queries

Roc 5 docs / 10 terms

idf 10 docs / 20 terms idf 10 docs / 50 terms Roc 10 docs / 10 terms idf 10 docs / 20 terms idf 10 docs / 50 terms Roc 10 docs / 10 terms Roc 5 docs / 50 terms idf 5 docs / 50 terms

Roc 5 docs / 10 terms

idf 10 docs / 20 terms idf 10 docs / 50 terms Roc 10 docs / 10 terms idf 5 docs / 10 terms Roc 5 docs / 20 terms Roc 5 docs / 50 terms

Roc 5 docs / 20 terms

idf 5 docs / 50 terms idf 5 docs / 50 terms idf 5 docs / 10 terms idf 5 docs / 10 terms Roc 5 docs / 20 terms Roc 5 docs / 50 terms

3.5 Official Results

The Persian (or Farsi) language is a member of the Indo-European family with relatively few morphological variations. This year we used a corpus extracted from the newspapers Hamshahri, made available thought the efforts of the University of Tehran (http://ece.ut.ac.ir/dbrg/hamshahri/). As usual in various evaluation campaigns, the corpus contains news articles (611 MB, for the years 1996 to 2002). This corpus contains exactly 166,774 documents on a variety of subjects (politic, literature, art, and economy, etc.) and includes about 448,100 different words. Hamshahri articles vary between 1 KB and 140 KB in size, comprising on average about 202 tokens (or 127 if we only count the number of word types). The corpus was coded in UTF8 and written using the 28 Arabic letters plus an additional 4 letters for the Persian language.

For the Persian language we first built a stopword list containing 884 terms. Unlike most other lists, this one contains words most frequently occurring in the collection (determinants, prepositions, conjunctions, pronouns or some auxiliary verb forms), plus a large number of suffixes already separated from word stems in the collection (see examples given below).

As a stemming strategy, we can use a morphological analysis (Miangah, 2006) or our simple, fast and light stemming approach that attempts to remove only nouns and adjective inflections. In the Persian language, the general pattern for inflectional suffixes is as follows: <possessive> <plural> <other-suffix> <stem>. In our light stemming strategy, we usually removed possessive, plural and some of the suffixes marked as others. The following examples of our light stemmer illustrate the relatively simple Persian morphology. From the plural form درختان (“trees”), we can obtain درخت (“tree”). For the possessive form, دسم (“my hand”), our stemmer will return دست (“hand”), and for the form ايرانيان (“Iranians”) we obtain ايران (“Iran”). In this corpus we saw that in some circumstances the suffixes might be written together or separated from the word as in ڪشتي ا and ا ڪشتي (“boats”), or منزل ا and ا منزل (“houses”). The adjectives are usually indeclinable whether used attributively or as a predicate. When used as substantives, adjectives take the normal plural endings, while comparative and superlative forms use the endings تر , and تزين .

The Persian language uses few case markers (the accusative case and certain specific genitive cases), unlike the Latin, German or Hungarian languages. The accusative for the definite noun is followed by را which can be joined to the noun or written separately (e.g., را مرد for the noun “man”). The genitive case is expressed by means of coupling two nouns by means of the particle known as ezafe (e.g.ِ د “man’s son”). As is usually done in the English language, other relations are expressed by means of prepositions (e.g., in, with, etc.). Both the stopword list and our light stemmer are freely available at http://www.unine.ch/info/clef/.

Query

Stemmer Model \ # of queries Okapi DFR PL2 DFR I(ne)C2 LM (λ=0.35) tf . idf Average (4 IR models) % change over T % change over "none"

5 Robust Retrieval

In the robust task (Voorhees, 2006) , we were interested in learning why retrieving relevant items for a given topic could be hard, even if the query contains certain common terms found in the relevant documents. In order to evaluate various search techniques, we used a corpus created during recent CLEF evaluation campaigns. This collection consists of articles published in 1994 in the newspaper Los Angeles Times, as well as articles extracted from the Glasgow Herald and published in 1995. This collection contains a total of 169,477 documents (or about 579 MB of data). On average each article contains about 250 (median: 191) content-bearing terms (not counting commonly occurring words such as “the,” “of” or “in”). Typically, documents in this collection are represented by a short title plus one to four paragraphs of text, and both American and British English spellings can be found in the corpus. To compile the test set, we used the topics created during the CLEF 2003 campaign (Topics #141 - #200) as well as queries from the 2005 (Topics #251 - #300) and 2006 (Topics #301 - #350) evaluation campaign. In this test set we found 153 queries able to return at least one relevant item from the collection.

This year we were interested in verifying whether word-sense disambiguation (WSD) might improve retrieval effectiveness. For this reason the organizers provides us with a new version of both the document and topic descriptions containing the correct lemma (entry in the dictionary) and SYNSET number(s) of the corresponding entry in the WordNet thesaurus (version 1.6). Table 13 lists an example for the title of Topic #47. Under the attribute LEMA the corresponding English dictionary entry is shown (therefore a stemming procedure is no more needed) and under the tag SYNSET, we can find both the score and the SYNSET number. The surface form is indicated under the label <WF> and the Part-of-Speech (POS) tag is also available for each word.

Various possibilities have been put forward to explain why certain successful IR systems may fail for some queries (Buckley, 2004; Savoy, 2007) . The organizers thought that the polysemy (already known as a problem in finding pertinent matches between query and document surrogates) could be partially resolved in an appropriate manner by using the SYNSET information.

Based on past experiments (Dolamic & Savoy, 2008) with this corpus and using the TD queries and Porter's stemmer (Porter, 1980) , we achieved a MAP of 0.2216 with tf . idf IR model to 0.4070 with Okapi model (Robertson et al., 2000) . With this last IR model, the set of hardest topics (defined as a query listing no relevant items in the top-20) were composed of seven topics, namely Topic #153 (“Olympic Games and Peace”), Topic #301 (“Nestlé Brands”), Topic #320 (“Energy Crises”), Topic #188 (“German Spelling Reform”), Topic #258 (“Brain-Drain Impact”), Topic #309 (“Hard Drugs”), and Topic #322 (“Atomic Energy”). Query

Index Single MAP Comb MAP WSD & POS WSD & POS

POS WSD WSD WSD WSD WSD

Model

I(ne)C2 Okapi I(ne)C2 Okapi

LM I(ne)C2 I(ne)C2 I(ne)C2

LM Okapi I(ne)C2

LM Okapi Okapi

LM I(ne)C2

Query expansion In the current experiments, we generated six different runs using word-sense disambiguation information. As shown in Table 14 above, we followed our combination strategy, taking into account the various probabilistic models using different blind query expansion approaches. Our best results were achieved in the UniNERobust4 run with a MAP of 0.4515. Moreover, if we compare runs with or without word sense disambiguation (WSD) information (lemma, POS tags and SYNSET), we see no real and important differences (e.g., UniNERobust1 vs. UniNERobust2, and UniNERobust4 vs. UniNERobust3). 10th

As shown in Table 14, in our official runs a hard topic was where the query resulted in low average precision. Using this definition, Table 16 lists the 10 topics having the lowest mean average precision. When all six runs are listed we obtain: Topic #153 (“Olympic Games and Peace”), followed by Topic #343 (“South African National Party”), Topic #313 (“Centenary Celebrations”), Topic #320 (“Energy Crises”), Topic #286 (“Football Injuries”). In an attempt to explain why a topic was difficult, we might mention that for Topics #343 and #153 only one relevant document was retrieved. Based on our best run (UniNERobust4), this item was ranked low on the retrieved list (44th for Topics #343, and 382th with Topics #153) even though they contained a large number of search terms. UniNERobust2 UniNERobust3 153 153 153 153 153 178 343 336 286 169 266 266 266 314

In this ninth CLEF campaign we evaluated various probabilistic IR models using two different testcollections, the first composed of short bibliographic notices extracted from the TEL corpora (written in English, German and French languages), and the second newspapers articles written in the Persian language. For the latter we also suggested a stopword list and a light stemmer strategy.

The results of our various experiments demonstrate that the I(ne)B2 or PB2 models (or I(ne)C2 for the Persian language) derived from the Divergence from Randomness (DFR) paradigm and the LM model seem to provide the best overall retrieval performances (see Tables 3, 4 and 11). The Okapi model used in our experiments usually results in retrieval performances inferior to those obtained with the DFR or LM approaches.

For the Persian language (Tables 11 and 12), our light stemmer tends to produce better MAP than does the 4gram indexing scheme (relative difference of 5.5%). On the other hand, the performance difference with an approach ignoring a stemming stage is rather small.

Using the TEL corpora, the pseudo-relevance feedback (Rocchio’s model) tends to hurt the retrieval effectiveness (see Tables 5 or 6). A data fusion strategy may enhance the retrieval performance for the French and German (Table 8) or Persian languages (Table 12), but not with the English corpus.

In the robust track, using the blind query expansion and data fusion approaches (combining three different probabilistic models), we are able to improve the MAP from 0.4086 (Okapi) to 0.4515. However, if we define hard topics as queries for which we cannot find any relevant items listed in the top-20, then these two runs produce the same number of hard topics (7 over 153). Finally the performance differences with and without word sense disambiguation (WSD) information are rather small.

Acknowledgments

The authors would like to also thank the CLEF-2008 task organizers for their efforts in developing various European language test-collections. This research was supported in part by the Swiss National Science Foundation under Grant #200021-113273.

Appendix 1: Parameter Settings Language

English TEL French TEL German TEL Persian word Persian 4-gram English Robust C451 C452 C453 C454 C455 C456 C457 C458 C459 C460 C461 C462 C463 C464 C465 C466 C467 C468 C469 C470 C471 C472 C473 C474 C475 C551 C552 C553 C554 C555 C556 C557 C558 C559 C560 C561 C562 C563 C564 C565 C566 C567 C568 C569 C570 C571 C572 C573 C574 C575

Roman Military in Britain Celtic Art Bombing of Japanese Cities The Inquisition in Italy Irish Emigration to North America Women's Vote in the USA Big Game Hunting in Africa The Wives of Henry VIII Gardening for Children Scary Movies Ancient Greek Coins Israeli Secret Service Churches in France Piano Lessons Trade Unions Gay Fiction Formula One Drivers Modern Japanese Culture Scottish Music Car Industry in Europe Watchmaking Man in Space British Women Authors Journeys to Antarctica Eastern philosophy

Wimbledon tennis cup

Tehran’s stock market 2002 world cup Stress and Health Road casualty statistics Nuclear energy regulations Iran football coaches Danger of solid oil Best Fajr film Iran economic sanction Gardening handbooks Reconstruction of Kandovan tunnel Mad cow disease Sport blood pressure Drought losses Prevention detection kidney diseases Population growth control Cell phone expansion Cases of economic corruption Iran dam construction Global oil economy Shajarian Concert Gross amount film cinema Champion team Iran first league

PersPolis Club establishment date C476 Contrastive Analysis of Electoral Systems C477 Web Advertising C478 Multilingual Upbringing C479 Food Allergies C480 Pilgrimage to Santiago de Compostela C481 Famous Jazz Musicians C482 Vegetarianism C483 Solar Energy C484 Soap-making C485 Counterfeiting Money C486 Pictures of Vintage Cars C487 Jousting in the Middle Ages C488 African Americans and the American Civil War C489 Graphics Programming C490 Bordeaux Wine Guides C491 Salary Inequality between Sexes C492 Homeopathic Cures for Children C493 Recipes for Chocolate Desserts C494 Youth Employment in Europe C495 Women in the French Revolution C496 Gods in Greek Mythology C497 20th Century S. American Authors C498 World War I Aviation C499 Wonders of the Ancient World C500 Gauguin and Tahiti C576 C577 C578 C579 C580 C581 C582 C583 C584 C585 C586 C587 C588 C589 C590 C591 C592 C593 C594 C595 C596 C597 C598 C599 C600

Iran Khodro company

Anti-Cancer Drugs Traffic Congestion in Tehran Tehran International book festival Iranian presidential election Plane crashes Water shortage in Tehran Earthquake damages Oil price changes Air pollution control European football champion league final Development of Iranian software industry Chemical attacks Iranian carpet export Merchandise smuggling Global warming Widely used narcotics in Iran Masouleh (Masooleh) Province Aircraft ticket prices World cup South Korea Japan Iraqi weapons of mass destruction Tehran murders Serial Killings 2nd of Khordad election Inflation in Iran

Abdou , S. , & Savoy , J. ( 2008 ). Searching in Medline: Stemming, query expansion, and manual indexing evaluation . Information Processing & Management , 44 ( 2 ), p. 781 - 789 .

Amati , G. , & van Rijsbergen , C.J. ( 2002 ). Probabilistic models of information retrieval based on measuring the divergence from randomness . ACM Transactions on Information Systems , 20 ( 4 ), p. 357 - 389 .

Braschler , M. , & Ripplinger , B. ( 2004 ). How effective is stemming and decompounding for German text retrieval ? IR Journal , 7 , p. 291 - 316 .

Buckley , C. ( 2004 ). Why current IR engines fail . Proceedings ACM-SIGIR' 2004 , The ACM Press, p. 584 - 585 .

Buckley , C. , Singhal , A. , Mitra , M. , & Salton , G. ( 1996 ). New retrieval approaches using SMART . In Proceedings of TREC-4 , Gaithersburg: NIST Publication # 500 - 236 , 25 - 48 .

Dolamic , L. , & Savoy , J. ( 2008 ). Monolingual and Bilingual Searches: Evaluation, Challenges and Failure Analysis . Submitted.

Fox , E.A. , & Shaw , J.A. ( 1994 ). Combination of multiple searches . In Proceedings TREC-2 , Gaithersburg: NIST Publication # 500 - 215 , p. 243 - 249 .

Harman , D.K. ( 1991 ). How effective is suffixing ? Journal of the American Society for Information Science , 42 ( 1 ), p. 7 - 15 .

Hiemstra , D. ( 2000 ). Using language models for information retrieval . CTIT Ph.D. Thesis.

Hiemstra , D. ( 2002 ). Term-specific smoothing for the language modeling approach to information retrieval . In Proceedings of the ACM-SIGIR , The ACM Press, p. 35 - 41 .

McNamee , P. & Mayfield , J. ( 2004 ). Character n-gram tokenization for European language text retrieval . IR Journal , 7 ( 1-2 ), 73 - 97 .

Miangah , T.M. ( 2006 ). Automatic lemmatization of Persian words . Journal of Quantitative Linguistics , 13 ( 1 ), p. 1 - 15 .

Peat , H. J. , & Willett , P. ( 1991 ). The limitations of term co-occurrence data for query expansion in document retrieval systems . Journal of the American Society for Information Science , 42 ( 5 ), p. 378 - 383 .

Porter , M.F. ( 1980 ). An algorithm for suffix stripping . Program , 14 , p. 130 - 137 .

Robertson , S.E. , Walker , S. & Beaulieu , M. ( 2000 ). Experimentation as a way of life: Okapi at TREC . Information Processing & Management , 36 ( 1 ), 95 - 108 .

Savoy , J. ( 2004 ). Combining multiple strategies for effective monolingual and cross-lingual retrieval . IR Journal , 7 , p. 121 - 148 .

Savoy , J. , & Berger , P.-Y. ( 2005 ) : Selection and merging strategies for multilingual information retrieval . In: Peters, C. , Clough , P. , Gonzalo , J. , Jones , G.J.F. , Kluck , M. , Magnini , B . (Eds.): Multilingual Information Access for text , Speech and Images. Lecture Notes in Computer Science : Vol. 3491 . Springer, Heidelberg, p. 27 - 37 .

Savoy , J. ( 2005 ). Bibliographic database access using free-text and controlled vocabulary: An evaluation . Information Processing & Management , 41 ( 4 ), 873 - 890 .

Savoy , J. ( 2006 ). Light stemming approaches for the French, Portuguese, German and Hungarian languages . Proceedings ACM-SAC , The ACM Press, p. 1031 - 1035 .

Savoy , J. ( 2007 ). Why do successful search systems fail for some topics ? Proceedings ACM-SAC , The ACM Press, p. 872 - 877 .

Vogt , C.C. , & Cottrell , G.W. ( 1999 ). Fusion via a linear combination of scores . IR Journal , 1 ( 3 ), 151 - 173 .

Voorhees , E.M. ( 2006 ). The TREC 2005 robust track . ACM SIGIR Forum , 40 , 2006 , p. 41 - 48 .