1 Introduction

Report on CLEF-2005 Evaluation Campaign: Monolingual, Bilingual, and GIRT Information Retrieval

Jacques Savoy

Jacques.Savoy@unine.ch 0 1

Pierre-Yves Berger

0 1 0 Natural Language Processing with European Languages, Bilingual Information Retrieval , Digital Libraries, Hungarian Language, Bulgarian Language, Portuguese Language, French Language 1 University of Neuchatel , Switzerland

For our fifth participation in the CLEF evaluation campaigns, the first objective was to propose an effective and general stopword list along with a light stemming procedure for the Hungarian, Bulgarian and Portuguese (Brazilian) languages. Our second objective was to obtain a better picture of the relative merit of various search engines when processing documents in those languages. To do so we evaluated our scheme using two probabilistic models and nine vectorprocessing approaches. In the bilingual track, we evaluated both the machine translation and bilingual dictionary approaches to automatically translate a query submitted in English into various target languages. This year we explored new freely available translation sources, together with a combined query translation approach in order to obtain a better translation of the user's information need. Finally, using the GIRT corpora (available in English, German and Russian), we investigated variations in retrieval effectiveness when including or excluding manually assigned keywords attached to bibliographic records (mainly comprising a title and an abstract).

1 Introduction

Since 2001 our research group has been investigating effective information retrieval (IR) techniques when handling a variety of natural languages (Savoy 2004a; 2005a) in order to improve both monolingual and bilingual searches. Continuing along this same stream, our participation in the CLEF 2005 evaluation campaign will target various objectives. First, our aim is to propose linguistic tools for less frequently spoken languages such as Bulgarian and Hungarian, to explore the underlying IR problems with closely related languages such as Portuguese and Brazilian, and to explore new alternatives when translating a query from one source language (English in this study) to other target languages (more precisely the French, Portuguese, Bulgarian and Hungarian languages). The domain-specific GIRT corpus presents other interesting features, namely questions related to digital libraries with a collection comprising a large number of bibliographic records.

In addition to these particular objectives, various interesting problems must be analyzed and resolved. All languages are not written with the same alphabet, and Bulgarian for example uses the Cyrillic alphabet. The presence of diacritics in others also raises certain questions that directly affect the effectiveness of IR systems. Can we simply ignore them? Do they have a real impact on mean average precision? Does the distinction between uppercase and lowercase letters really influence information retrieval systems or does this distinction need only be preserved when high search precision is required?

In our work we have assumed that the semantic content of documents (or requests) is mainly linked to nouns and adjectives, and thus an effective search system can be based on the use of an appropriate set of weighted keywords extracted from corresponding documents (or requests). Based on this assumption, we designed a set of stopword lists and light stemming procedures for certain European and Asian languages. Following our suggestion, these linguistic tools were designed to automatically remove the inflectional suffixes attached to nouns and adjectives linked to gender (masculine, feminine, neural), to number (singular or plural), and to case (nominative, dative, ablative, etc.). Needless to say we were also interested in other linguistic phenomena, such as compound constructions (does an effective IR system really need to decompound them and is this linguistic phenomenon really important for the retrieval of languages other than German?)

The rest of this paper is organized as follows: Section 2 describes the main characteristics of the CLEF2005 test-collection, Section 3 outlines the main aspects of our stopword lists and light stemming procedures. Section 4 analyses the principal features of different indexing and search strategies, and evaluates their use with the four corpora. The data fusion approaches adapted in our experiments are explained in Section 5, and Section 6 depicts our official results. Our bilingual experiments are presented and evaluated in Section 7 while Section 8 describes our experiments involving the domain-specific GIRT corpus.

2 Overview of the Test-Collections

The corpora used in our experiments include newspaper and news agency articles, namely Le Monde (19941995, French), SDA (1994-1995, French), Público (1994-1995, Portuguese), Folha (1994-1995, Brazilian), Magyar Hirlap (2002, Hungarian), Sega (2002, Bulgarian), Standart (2002, Bulgarian). As shown in Table 1, the Portuguese corpus (212.9 indexing terms / document) has a larger mean size article than the French collection (178). This mean value is relatively similar for the Bulgarian (133.7) and Hungarian (142.1) languages. It is interesting to note that even though the Hungarian collection is the smallest (105 MB), it contains the largest number of distinct indexing terms (657,132), computed after stemming.

French Portuguese

Size (in MB) 487 MB 564 MB # of documents 177,452 210,734 # of distinct terms 455,366 582,117 Number of distinct indexing terms / document Mean 127.8 153.5 Standard deviation 106.57 114.95 Median 92 129 Maximum 2,645 2,655 Minimum 1 1 Number of indexing terms / document Mean 178 212.9 Standard deviation 159.87 186.4 Median 126 171 Maximum 6,720 7,554 Minimum 1 1 Number of queries Number rel. items Mean rel./ request Standard deviation Median Maximum Minimum

50 2,537 50.74 45.349 35.5 185 (Q#253) 1 (Q#255)

50 2,904 58.08 50.415

44 239 (Q#286) 2 (Q#258)

Bulgarian 213 MB 69,195 414,253

During the indexing process in our automatic runs, we retained only the following logical sections from the original documents: <TITLE>, <TEXT>, <LEAD>, <LEAD1>, <TX>, <LD>, <TI> and <ST>. For this restriction we found 1,854 documents in the Bulgarian collection to have no indexable content (for example, they may correspond to articles containing only a picture with the tags <PICTURE>, <IMGTEXT> and <IMGAUTHOR>). From the topic descriptions we automatically removed certain phrases such as “Relevant document report …”, “Finde Dokumente, die über …”, “Keressünk olyan cikkeket, amelyek …” or “Trouver des documents qui …”, etc. As shown in the Appendix, the available topics cover various subjects (e.g., “Anti-Smoking Legislation”, “Football Refereeing Disputes”, or “Lottery Winnings”), including both regional (“Swiss Referendums”) or international coverage (“Anti-abortion Movements”).

3 Stopword Lists and Stemming Procedures

In order to define general stopword lists, we first created a list of the top 200 most frequent words found in the various languages, from which some words were removed (e.g., police, minister, president, Magyar). From this list of very frequent words, we added articles, pronouns, prepositions, conjunctions or very frequently occurring verb forms (e.g., to be, is, has, etc.). Based on this scheme, we created a new list for the Bulgarian and Hungarian languages (these lists are available at www.unine.ch/info/clef/). Our final stopword list contained 463 words for the French language, 761 for Hungarian, 418 for Bulgarian and 400 for Portuguese-Brazilian (we added 8 Brazilian words to our Portuguese stopword list. These eight words are usually variants with or without accents, such as “vezes” in Portuguese and “vêzes” in Brazilian).

Once high-frequency words were removed, our indexing procedure generally applied a stemming algorithm in an attempt to conflate word variants into the same stem or root. In developing such a procedure, we first wanted to remove only inflectional suffixes such as singular and plural word forms, and also feminine and masculine forms so that they would conflate to the same root.

Bulgarian involved additional morphological difficulties, given that in this language the definite article is usually represented by a suffix. For example, “mope” (sea) becomes “mopeto” (the sea) while “mopeta” (seas) becomes “mopetata” (the seas). The general noun pattern is as follows: <stem> <plural> <article>. Contrary to other Slavic languages (such as Russian), Bulgarian does not indicate grammatical cases by adding a suffix.

The Hungarian language shares certain similarities with the Finnish language (although both languages do not belong strictly to the same family, they can be viewed as cousins). Like Finnish, Hungarian has several number cases (usually 18) and each case has its own unambiguous form. For example, the noun “house” (“hàz”) may appear as “hàzat” (accusative case, as in “(I see) the house”), “hàzakat” (accusative plural case, as in “(I see) the houses”), “hàzamat” (“… my house”) or “hàzamait” (“… my houses”). In this language, the general construction used for nouns is as follows: <stem> <plural> <possessive marker> <case>. For example, for <hàz> a <m> a <t> in which the letter “a” is introduced to facilitate better pronunciation (“hàzmt” could be difficult to pronounce). From the IR point of view, certain linguistic aspects in Hungarian are viewed as good news. For example, a gender distinction is not attached to each noun (like in English) and adjectives are invariable, as in “… a szép hàzat” (“a beautiful house”) or “… a szép hàzamat” (“my beautiful house”). Our suggested stemming procedures for these languages can be found at www.unine.ch/info/clef/.

Diacritic characters are usually not present in English collections (with certain exceptions, such as “résumé” or “cliché”). For the Hungarian, and Portuguese languages, these characters were replaced by their corresponding non-accentuated letter. Removing accents may however generate some semantic ambiguity (e.g., between “kor” (“age”) and “kór” (“illness”), or “ver” (“hurt”) and “vér” (“blood”) in Hungarian language).

Finally, most European languages manifest other morphological characteristics, with compound word constructions being only one example (e.g., handgun, worldwide). Recently, Braschler & Ripplinger (2004) showed that decompounding German words could significantly improve retrieval performance, and in some experiments with Hungarian where we used our decompounding algorithm (Savoy 2004b) , both compound words and their component parts were left in the documents and queries.

4 Indexing and Searching Strategies

In order to obtain a broader view of the relative merit of various retrieval models, we first adopted a binary indexing scheme in which each document (or request) was represented by a set of keywords, without any weight. To measure the similarity between documents and requests, we computed the inner product (retrieval model denoted “doc=bnn, query=bnn” or “bnn-bnn”). In order to weight the presence of each indexing term in a document surrogate (or in a query), we took the term occurrence frequency into account (denoted tfij for indexing term tj in document Di, and the corresponding retrieval model was denoted: “doc=nnn, query=nnn”) or we might also account for their inverse document frequency (denoted idfj). Moreover, we might normalize each indexing weight using different weighting schemes, as is described in the Appendix.

In addition to these models based on the vector-space paradigm, we also considered probabilistic models such as the Okapi model (Robertson et al. 2000) . As a second probabilistic approach, we implemented the Prosit approach, one member of a family of models suggested by Amati & van Rijsbergen (2002 ) and based on combining two information measures, formulated as follows: Prob1ij = tfnij / (tfnij + 1)

with tfnij = tfij · log2[1 + ((C · mean dl) / li)] Prob2ij = [1 / (1+lj)] · [lj / (1+lj)]tfnij

with lj = tcj / n where wij indicates the indexing weight attached to term tj in document Di, li the number of indexing terms included in the representation of Di, where tcj represents the number of occurrences of term tj in the collection and n the number of documents in the corpus. In our experiments, the constants b, k1, avdl, pivot, slope, C and mean dl were fixed according to the values listed in Table!2 (the German, English and Russian languages are used in the GIRT experiments).

To measure the retrieval performance, we adopted non-interpolated mean average precision (MAP) (computed on the basis of 1,000 retrieved items per request by the new TREC-EVAL program). To statistically determine whether or not a given search strategy would be better than another, we applied the bootstrap methodology (Savoy 1997) . Thus, in the tables included in this paper we underlined statistically significant differences using on a two-sided non-parametric bootstrap test, and based on the MAP difference with a significance level fixed at 5%. We indexed the different collections using words as indexing units. The evaluations of our two probabilistic models and nine vector-space schemes are listed in Table 3 for the French and Portuguese corpus, and in Table 4 for the Bulgarian and Hungarian collection. In these tables, the best performance under given conditions (with the same indexing scheme and the same collection) is listed in bold type. Based on the best performance, this approach is also used as a baseline for our statistical testing. The underlined results therefore indicate that the difference in mean average precision can be viewed as statistically significant when compared to the best system value. As depicted in Table 3, the Okapi model was found to be the best IR model for French and Portuguese collection. For these two corpora however, the difference in MAP between the various IR models is usually statistically significant. As shown in Table 4 (and in Table A.4 in the Appendix) similar conclusions can be drawn for the Bulgarian and Hungarian collection. In this case the best performing system was the Prosit model for Bulgarian, and the Okapi probabilistic approach for Hungarian. Moreover five IR models were shown to have similar statistical performance levels (Okapi, Prosit, “doc=Lnu, query=ltc”, “doc=dtu, query=dtn”, “doc=atn, query=ntc”).

Moreover, the data in these tables shows that when the number of search terms increases (from T, TD to TDN), retrieval effectiveness usually increases also (except for the “doc=bnn, query=bnn” or “doc=nnn, query=nnn” IR models). From an analysis of the five best retrieval schemes shown in Tables 3 and 4 (namely, Prosit, Okapi, “doc=Lnu, query=ltc”, “doc=dtu, query=dtn” and “doc=atn, query=ntc”), the improvement is around 33.4% when comparing title-only (or T) with TDN queries for the Portuguese collection, 31.3% when comparing the French corpus, 21% for Hungarian (see Table A.4 in the Appendix), and 6.4% for the Bulgarian collection. With the Hungarian collection, we automatically decompounded long words (composed by more than 8 characters) using our own algorithm (Savoy 2004b) . In this experiment, both the compound words and their components were left in documents and queries (under the label “TD-decomp“ in Table 4). Using the TD queries and the Okapi model, we achieved a MAP of 0.3391, reflecting a degradation of -3.1% when compared to an indexing approach that did not use decompounding. Based on the five best retrieval schemes, the mean degradation is around -1.6%. Using a lighter stemmer (less rules) for the Hungarian language (retrieval performance depicted under the label “TD-light“ in Table 4), the mean difference in MAP over the five best retrieval schemes is around 2% and in favor of a more complex stemming approach.

It was observed that pseudo-relevance feedback (PRF or blind-query expansion) seemed to be a useful technique for enhancing retrieval effectiveness. In this study, we adopted Rocchio's approach (Buckley et al. 1996) with a = 0.75, b = 0.75, whereby the system was allowed to add m terms extracted from the k best ranked documents from the original query. To evaluate this proposition, we used the Okapi and the Prosit probabilistic models and enlarged the query by the 10 to 50 terms retrieved from the 3 to 10 best-ranked articles.

Table 5 depicts our best results using pseudo-relevance feedback technique for the Okapi model and demonstrates that the optimal parameter setting seemed to be collection-dependant. Moreover, performance improvement also seemed to be collection dependant (or language dependant), with the French corpus showing an increase of +9.2% (from a mean average precision of 0.3754 to 0.4099), +5.2% for the Portuguese collection (from 0.3477 to 0.3668), +1.3% for the Hungarian collection (from 0.3501 to 0.3545), and +0.8% for the Bulgarian corpus (from 0.2704 to 0.2726). Table 6 shows how similar conclusions can be drawn using the Prosit model. In this case however, the blind query expansion depicted a greater improvement for all collections (e.g., for the French corpus, an increase of +14.3%, from a mean average precision of 0.3696 to 0.4225). In both Tables 5 and 6, the baseline used for our statistical testing was the MAP, calculated before the query was automatically expanded. In this case, it is interesting to note that our statistical testing cannot always detect a significant difference in MAP before and after blind query expansion, specially for the Bulgarian and Hungarian collection.

5 Data Fusion

It is assumed that combining different search models should improve retrieval effectiveness, due to the fact that different document representations might retrieve different pertinent items and thus increase the overall recall (Vogt & Cottrell 1999) . On the other hand, when combining different search schemes, we might suppose that these various IR strategies are more likely to rank the same relevant items higher on the list than they would for non-relevant documents (viewed as outliers). Thus, combining them could improve retrieval effectiveness by ranking pertinent documents higher and ranking non-relevant items lower. Based on our previous studies (Savoy 2004b, 2005a) , this expected positive effect does not always work.

In this current study we combine only the two probabilistic models because they usually depict the best or one of the best retrieval performances (Savoy 2004b, 2005a) . To achieve this we evaluated various fusion operators (see Table 7 for a list of their precise descriptions). For example, the Sum RSV operator indicates that the combined document score (or the final retrieval status value) is simply the sum of the retrieval status value (RSVk) of the corresponding document Dk computed by each single indexing scheme (Fox & Shaw 1994) . Table 7 thus illustrates how both the Norm Max and Norm RSV apply a normalization procedure when combining document scores. When combining the retrieval status value (RSVk) for various indexing schemes and in order to favor some more efficient retrieval schemes, we could multiply the document score by a constant ai (usually equal to 1) reflecting the differences in retrieval performance.

Sum RSV Norm Max Norm RSV Z-Score

SUM (ai . RSVk) SUM (ai . (RSVk / Maxi))

SUM [ai . ((RSVk - Mini) / (Maxi - Mini))] ai . [((RSVk - Meani) / Stdevi) + di] with di = [(Meani - Mini) / Stdevi]

In addition to using these data fusion operators, we also considered the round-robin approach, wherein we took one document in turn from all individual lists and removed any duplicates, retaining the most highly ranked instance. Finally we suggested merging the retrieved documents according to the Z-Score, computed for each result list. Within this scheme, for the ith result list, we needed to compute the average RSVk value (denoted Meani) and the standard deviation (denoted Stdevi). Based on these we could then normalize the retrieval status value for each document Dk provided by the ith result list by computing the deviation of RSVk with respect to the mean (Meani). In Table 7, Mini (Maxi) denotes the minimal (maximal) RSV value in the ith result list. Of course, we might also weight the relative contribution of each retrieval scheme by assigning a different ai value to each retrieval model.

Okapi & PRF doc/term

Prosit & PRF doc/term

Round-robin

Sum RSV Norm Max Norm RSV Z-Score Z-ScoreW

6 Official Results 7 Bilingual Information Retrieval

For the bilingual track, we chose English as the language for submitting queries to be automatically translated into four different languages, using nine different machine translation (MT) systems and four bilingual dictionaries (“Babylon”, “Ectaco”, “Medios”, and “Kerekes”). The following freely available translation tools were used in our experiments:

SYSTRAN GOOGLE FREETRANSLATION INTERTRAN WORLDLINGO BABELFISH www.systranlinks.com/ www.google.com/language_tools www.freetranslation.com/web.htm www.tranexp.com/ www.worldlingo.com/ babelFish.altavista.com/

- 7 PROMT ALPHAWORKS APPLIEDLANGUAGE BABYLON ECTACO MEDIOS KEREKES webtranslation.paralink.com/ www.alphaWorks.ibm.com/ www.appliedLanguage.com/ www.babylon.com www.ectaco.co.uk/free-online-dictionaries/ consulting.medios.fi/dictionary/ (only for Hungarian language) www.cab.u-szeged.hu/cgi-bin/szotar (only for Hungarian language)

When using the different bilingual dictionaries to translate an English request word-by-word, usually more than one translation is provided, in an unspecified order. We decided to pick only the first translation available (labeled “Babylon 1” or “Ectaco 1”), the first two terms (e.g., “Babylon 2” or “Medios 2”) or the first three available translations (labeled “Babylon 3”).

Moreover, the query terms could be preprocessed in order to obtain their part-of-speech (PoS) information (using www.ims.unistuttgart.de/projekte/corplex/TreeTagger/). Using this information, we could find the corresponding lemma and use it instead of the surface word before searching in the bilingual dictionaries. Once this lemmatizing procedure was done, we added the term “+ PoS” in the corresponding run label. Table 10 contains an example of this query preprocessing, showing how the plural form was removed (e.g., “disputes” into “dispute”) and how various verb forms were transformed into their lexical forms (e.g., “made” into “make” or “refereeing” into “referee”).

<num> C263 </num> <title> Football Refereeing Disputes </title> <desc> Find documents in which decisions made by a referee during a football match are criticised. </desc> <narr> Relevant documents report on football (soccer) matches in which the referee made some disputable or disputed decision. </narr> <num> C263 </num> <title> Football referee dispute </title> <desc> find document in which decision make by a referee during a football match be criticize. </desc> <narr> relevant document report on football (soccer) match in which the referee make some disputable or disputed decision. </narr> From this data, we can see that for the French collection the best translation is obtained by Google and for the Portuguese corpus by Promt. The FreeTranslation and Promt MT systems usually obtain satisfactory retrieval performances for these two languages (around 79.3% of the MAP obtained by the corresponding

French 50 queries

0.3754 0.3149 (83.9%) 0.3259 (86.8%) 0.2814 (75.0%) 0.1839 (49.0%) 0.3095 (82.5%) 0.3149 (83.9%) 0.3066 (81.7%) 0.2991 (79.7%) 0.3149 (83.9%) monolingual search for the Promt system, and 73.6% for FreeTranslation). Other good translation systems found were the BabelFish, Systran and AppliedLanguage which worked well for French. For Bulgarian and Hungarian languages, we found only a few translation tools, and unfortunately their overall performance levels were not very good. As depicted in Table 12, we also found that lemmatizing the English queries (for both the Bulgarian or Hungarian languages at least) would improve mean average precision.

Finally, Table 14 lists the parameter settings used for 12 official runs in the bilingual task. Each experiment uses queries written in English to retrieve documents in the other target languages. Before combining the result lists we automatically expanded the translated queries using a pseudo-relevance feedback method (Rocchio’s approach in the present case).

8 Monolingual Domain-Specific Retrieval: GIRT

In the domain-specific retrieval task (called GIRT), the three available corpora are composed of bibliographic records extracted from various sources in the social sciences domain, see (Kluck 2004) for a more complete description of these corpora. A few statistics on these collections are given in Table 15.

German English

Size (in MB) 326 MB 199 MB # of documents 151,319 151,319 # of distinct terms 698,638 151,181 Number of distinct indexing terms / document

Mean 70.83 107.9 Standard deviation 32.4 94.59 Median 68 77 Maximum 386 1,422

Minimum 2 2 Number of indexing terms / document

Mean 89.61 142.1 Standard deviation 44.5 139.84 Median 84 95 Maximum 629 4,984

Minimum 4 2 Number of queries

Number rel. items Mean rel./ request Standard deviation Median Maximum Minimum

25 2,682 107.28 91.654

75 318 (Q#150) 8 (Q#129)

25 2,105 84.2 69.109

54 242 (Q#150) 6 (Q#129) <DOC> <DOCNO> GIRT-EN19901932 <TITLE-EN> The Socio-Economic Transformation of a Region : the Bergische Land from 1930 to 1960 <AUTHOR> Henne, Franz J. <AUTHOR> Geyer, Michael <PUBLICATION-YEAR> 1990 <LANGUAGE-CODE> EN <CONTROLLED-TERM-EN> Rhenish Prussia <CONTROLLED-TERM-EN> historical development <CONTROLLED-TERM-EN> regional development <CONTROLLED-TERM-EN> socioeconomic factors <METHOD-TERM-EN> historical <METHOD-TERM-EN> document analysis <CLASSIFICATION-TEXT-EN> Social History <DOC> <DOCNO> GIRT-EN19902732 <TITLE-EN> Ethnic Politicians in Congress: German-American Case Studies on the Interaction of Ethnicity, Nationality and Democratic Government 1865-1930 <AUTHOR> Adams, Willi Paul <PUBLICATION-YEAR> 1990 <LANGUAGE-CODE> EN <CONTROLLED-TERM-EN> ethnic group <CONTROLLED-TERM-EN> North America … In total theses collections contain 397,218 documents or about 590 MB, and for the most part are written in German. A typical record in this collection is composed of a title, an abstract, and a set of manually assigned keyword (see Table 16 for English examples and Table 17 for their corresponding German records). Additional information such as authors' name, publication date, or the language in which the bibliographic notice is written may of course be less important from an IR perspective but they are made available. As depicted in the Appendix, the topics in this domain-specific collection cover a variety of themes (e.g., “Electoral Behaviour”, “New Art”, “Soccer and Society”, or “Churches and Money”).

<DOC> <DOCNO> GIRT-DE19909343 <TITLE-DE> Die sozioökonomische Transformation einer Region : Das Bergische Land von 1930 bis 1960 <AUTHOR> Henne, Franz J. <AUTHOR> Geyer, Michael <PUBLICATION-YEAR> 1990 <LANGUAGE-CODE> DE <CONTROLLED-TERM-DE> Rheinland <CONTROLLED-TERM-DE> historische Entwicklung <CONTROLLED-TERM-DE> regionale Entwicklung <CONTROLLED-TERM-DE> sozioökonomische Faktoren <METHOD-TERM-DE> historisch <METHOD-TERM-DE> Aktenanalyse <CLASSIFICATION-TEXT-DE> Sozialgeschichte <ABSTRACT-DE> Die Arbeit hat das Ziel, anhand einer regionalen Studie die Entstehung des "modernen" fordistischen Wirtschaftssystems und des sozialen Systems im Zeitraum zwischen 1930 und 1960 zu beleuchten; dabei geht es auch um das Studium des "Sozial-imaginären", der Veränderung von Bewußtsein und Selbst-Verständnis von Arbeitern durch das Erlebnis und die Erfahrung der Depression, des Nationalsozialismus und der Nachkriegszeit, welches sich in den 1950er Jahren gemeinsam mit der wirtschaftlichen Veränderung zu einem neuen "System" zusammenfügt. <DOC> <DOCNO> GIRT-DE19909106 <TITLE-DE> Politiker einer ethnischen Gruppe im Kongreß: Deutsch-amerikanische Fallstudien zur Interaktion von Ethnizität, Nationalität und demokratischer Regierung, 1865-1930 <AUTHOR> Adams, Willi Paul <PUBLICATION-YEAR> 1990 <LANGUAGE-CODE> DE <CONTROLLED-TERM-DE> ethnische Gruppe <CONTROLLED-TERM-DE> Nordamerika …

Query TD Model \ # of queries Prosit doc=Okapi, query=npn doc=Lnu, query=ltc doc=dtu, query=dtn doc=atn, query=ntc doc=ltn, query=ntc doc=ntc, query=ntc doc=ltc, query=ltc doc=lnc, query=ltc doc=bnn, query=bnn doc=nnn, query=nnn

Based on the GIRT corpus we are therefore able to evaluate the impact of manually assigned descriptors as compared to an indexing scheme, based only on the information contained in the corresponding article’s title and abstract sections. To tackle this question we evaluated the GIRT collection using all sections (denoted “all” in Table 18), or only using titles and abstracts from bibliographic records (under the label “TI & AB”). In related research using the Amaryllis French corpus, we found that the “TI & AB” indexing scheme presents a loss of around 45% in mean average precision (Savoy 2005b) when compared to the “all” approach. In our experiments, the decrease in mean average precision is around -14.4% for the German corpus and -36.5% for the English GIRT collection.

Our 12 official runs in the monolingual GIRT task are described in Table 19. For each language, we submitted the first run using a data fusion operator (“Z-ScoreW” in this case). For all runs, we automatically expanded the queries using a blind relevance feedback method (Rocchio in our experiments), hoping to improve retrieval effectiveness.

Run name UniNEgde1 UniNEgde2 UniNEgde3 UniNEgen1 UniNEgen2 UniNEgen3 UniNEgru1 UniNEgru2 UniNEgru3 Language Query German German 9 Conclusion

In this sixth CLEF evaluation campaign, we proposed a general stopword list and a light stemming procedure (removing only inflections attached to nouns and adjectives) for the Bulgarian and Hungarian languages (see Table 4 and Table A.4). In order to enhance retrieval performance, we suggested using a data fusion approach based on the Z-Score in order to combine the two probabilistic IR models (see Table 8). The results of this evaluation campaign seem to indicate that for the French and Portuguese languages such an approach proved to be effective (Table 8). The use of this search strategy did however require the building of two inverted files and doubling the search time required. For both the Bulgarian and Hungarian languages, more experiments are needed to confirm our first evaluations (especially in the design of a light stemming procedure for the Hungarian language, see Table 4). For all languages however, the probabilistic models (either Okapi or Prosit) usually - 12 result in better retrieval performances than do other vector-processing approaches (see Tables 3, 4, and 18 for the GIRT corpora), while the data fusion approach did not always improve mean average precision. The automatic decompounding of Hungarian words and its impact in IR remains an open question and our preliminary experiments did not provide a clear and precise answer (our decompounding scheme slightly decreased retrieval performance, as shown in Table 4).

As in previous evaluation campaigns we were able to confirm that pseudo-relevance feedback based on Rocchio’s model usually did improve mean average precision statistics for the French and Portuguese language, even though this improvement is not always statistically significant. For the other languages (Bulgarian and Hungarian), this blind query expansion did not improve mean average precision from the statistics point of view (Tables 5 and 6).

In the bilingual task, the freely available translation tools performed at a reasonable level for both the French and Portuguese languages (based on the three best translation tools, the MAP compared to the monolingual search is around 85% for the French language and 72.6% for the Portuguese). For less frequently used languages such as Bulgarian and Hungarian, the freely available translation tools (either the bilingual dictionary or the MT system) did not perform well. The mean average precision decreased by more than 50% (for Hungarian) to 80% (for Bulgarian), when compared to a monolingual search.

In the GIRT task (Table 18), we were able to measure the retrieval effectiveness by assigning keywords manually, and the presence of this information improved MAP by around 36.5% for the English corpus and 14.4% for the German collection.

Acknowledgments

The authors would like to also thank the CLEF-2005 task organizers for their efforts in developing various European language test-collections, and C. Buckley from SabIR for giving us the opportunity to use the SMART system. The first author is not able to thank the computing services at UniNE, because they consistently made no effort to be cooperative during this project. This research was supported in part by the Swiss National Science Foundation under Grant #21-66 742.01. wij = 1 wij = (ln(tfij) + 1) . idfj wij = [ln(ln(tfij) + 1) + 1] . idfj wij = wij = ((k1 + 1) ⋅ tf i j)

ln(tf i j) + 1 t Â (ln( tf i k) +1) k =1

2 (K + tf i j)

(ln(ln(tf i j) + 1) + 1) ⋅idf j (1 - slope) ⋅ pivot + slope ⋅ nt i

Table A.1: Weighting schemes

To assign an indexing weight wij that reflects the importance of each single-term tj in a document Di, we might use the various approaches shown in Table A.1, where n indicates the number of documents in the collection, t the number of indexing terms, dfj the number of documents in which the term tj appears, the document length (the number of indexing terms) of Di is denoted by nti, and avdl, b, k1, pivot and slope are constants. For the Okapi weighting scheme, K represents the ratio between the length of Di measured by li (sum of tfij) and the collection mean noted by avdl.

bnn ltn dtn

Okapi lnc ltc dtu

wij = tfij wij = idfj . [0.5+ 0.5.tfij / max tfi.] wij = tfij . ln[(n-dfj) / dfj] wij = wij = Ê1 + ln(tf i j) ˆ ËÁ ln(mean tf) + 1˜¯ (1 - slope) ⋅ pivot + slope ⋅ nt i tf i j ⋅ idf j t Â (tf i k ⋅idf k ) k =1 2 t Â ((ln(tfi k ) + 1) ⋅ idf k ) 2 k=1

EU Agricultural Subsidies

Euthanasia by Medics Transport for Disabled Swiss Referendums Crime in New York Radovan Karadzic Prison Abuse James Bond Films Space Shuttle Missions Anti-abortion Movements Football Injuries Hostage / Terrorist Situations US Car Imports Falkland Islands Oil Price Fluctuation EU Illegal Immigrants Rebuilding German Cities China-Taiwan Relations Hurricane Force Money Laundering Public Performances of Liszt Expulsion of Diplomats Nuclear Power Stations UN Peacekeeping Risks Lottery Winnings nnn atn npn Lnu ntc C276 C277 C278 C279 C280 C281 C282 C283 C284 C285 C286 C287 C288 C289 C290 C291 C292 C293 C294 C295 C296 C297 C298 C299

C300 - 14

Health Economics

Oil and Politics Street Children Advertising and Ethics Giving up Smoking Radio and Internet Poverty and Wealth Diabetes Mellitus Soccer and Society Russian Germans and their Language Anti-Semitism in the Soviet Union Television Behaviour

Amati , G . & van Rijsbergen , C.J. ( 2002 ). Probabilistic models of information retrieval based on measuring the divergence from randomness . ACM-TOIS , 20 ( 4 ), 357 - 389 .

Braschler , M. & Ripplinger , B. ( 2004 ). How effective is stemming and decompounding for German text retrieval ? IR Journal , 7 ( 3-4 ), 291 - 316 .

Buckley , C. , Singhal , A. , Mitra , M. & Salton , G. ( 1996 ). New retrieval approaches using SMART . In Proceedings of TREC-4 , (pp. 25 - 48 ). Gaithersburg: NIST Publication # 500 - 236 .

Fox , E.A. & Shaw , J.A. ( 1994 ). Combination of multiple searches . In Proceedings TREC-2 , (pp. 243 - 249 ). Gaithersburg: NIST Publication # 500 - 215 .

Kluck , M. ( 2004 ). The GIRT data i the evaluation of CLIR systems - from 1997 until 2003 . In C. Peters,

Gonzalo ,

Braschler , M. Kluck (Eds.), Comparative Evaluation of Multilingual Information Access Systems. LNCS #3237 . Springer-Verlag, Berlin, 2004 , 376 - 390 .

Robertson , S.E. , Walker , S. & Beaulieu , M. ( 2000 ). Experimentation as a way of life: Okapi at TREC . Information Processing & Management , 36 ( 1 ), 95 - 108 .

Savoy , J. ( 1997 ). Statistical inference in retrieval effectiveness evaluation . Information Processing & Management , 33 ( 4 ), 495 - 512 .

Savoy , J. ( 2004a ). Combining multiple strategies for effective monolingual and cross-lingual retrieval . IR Journal , 7 ( 1-2 ), 121 - 148 .

Savoy , J. ( 2004b ). Report on CLEF-2003 monolingual tracks: Fusion of probabilistic models for effective monolingual retrieval . In C. Peters,

Gonzalo ,

Braschler , M. Kluck (Eds.), Comparative Evaluation of Multilingual Information Access Systems. LNCS #3237 . Springer-Verlag, Berlin, 2004 , 322 - 336 .

Savoy , J. ( 2005a ). Data Fusion for effective European monolingual information retrieval . In Peters, P.D. Clough , G.J.F.

Jones , J.

Gonzalo , M.

Kluck & B. Magnini (Eds.), Multilingual Information Access for Text, Speech and Images . LNCS #3491 . Springer-Verlag, Berlin, 2005 , 233 - 244 .

Savoy , J. ( 2005b ). Bibliographic database access using free-text and controlled vocabulary: An evaluation . Information Processing & Management , 41 ( 4 ), 873 - 890 .

Vogt , C.C. & Cottrell , G.W. ( 1999 ). Fusion via a linear combination of scores . IR Journal , 1 ( 3 ), 151 - 173 .

Mean average precision Hungarian TD 50 queries 0.3420 0.3501 0.3301 0.3401 0.3215 0.2853 0.2208 0.2484 0.2395 0.1424 0 . 0875