Introduction

Report on CLEF-2003 Monolingual Tracks: Fusion of Probabilistic Models for Effective Monolingual Retrieval

Jacques Savoy

0 0 Institut interfacultaire d'informatique, Université de Neuchâtel , Switzerland

2003

For our third participation in the CLEF evaluation campaign, our first objective was to propose more effective and general stopword lists for the Swedish, Finnish and Russian languages along with an improved, more efficient and simpler stemming procedure for these three languages. Our second goal was to suggest a combined search approach based on a data fusion strategy that would work with various European languages. Included in this combined approach is a decompounding strategy for the German, Dutch, Swedish and Finnish languages.

Introduction 1. Overview of the Test-Collections

Tables!1a and 1b compare also the number of relevant documents per request, with the mean always being greater than the median (e.g., for the English collection, the average number of relevant documents per query is 18.63 with the corresponding median being 7). These findings indicate that each collection contains numerous queries, yet only a rather small number of relevant items are found. For each collection, 60 queries have been created. However, relevant documents cannot be found for each request and each language. For the English collection, the Queries #149, #161, #166, #186, #191, and #195 do not have any relevant items; for the French corpus, these requests are #146, #160, #161, #166, #169, #172, #191, #194; for the German collection (Queries #144, #146, #170, #191); for the Spanish collection (Queries #169, #188, #195); for the Italian collection (Queries #144, #146, #158, #160, #169, #170, #172, #175, #191); for the Dutch collection (Queries #160, #166, #191, #194); for the Swedish collection (Queries #146, #160, #167, #191, #194, #197, #198); for the Finnish corpus (Queries #141, #144, #145, #146, #160, #167, #169, #175, #182, #186, #188, #189, #191, #194, #195). Appearing for the first time in a CLEF evaluation campaign is the Russian corpus, for which we have only 28 requests.

During the indexing process of our automatic runs, we retained only the following logical sections from the original documents: <TITLE>, <HEADLINE>, <TEXT>, <LEAD>, <LEAD1>, <TX>, <LD>, <TI> and <ST>. From the topic descriptions we automatically removed certain phrases such as "Relevant document report …", "Find documents …", "Trouver des documents qui parlent …", "Sono valide le discussioni e le decisioni …", "Relevante Dokumente berichten …" or "Los documentos relevantes proporcionan información …".

English French Size (in MB) 579 MB 331 MB # of documents 169,477 129,806 # of distinct terms 426,757 355,691 Number of distinct indexing terms / document Mean 156.9 118.5 Standard deviation 118.77 95.72 Median 129 89 Maximum 1,881 1,621 Minimum 2 3 Number of queries Number rel. items Mean rel. / request Standard deviation Median Maximum Minimum

54 1,006 18.63 28.61

7 139 (#Q:157) 1 (#Q:141)

2. Stopword Lists and Stemming Procedures

In order to define general stopword lists, we first accounted for the top 200 most frequent words found in the various languages, together with articles, pronouns, prepositions, conjunctions or very frequently occurring verb forms (e.g., to be, is, has, etc.). As compared to last year's stopword lists [ Savoy 2002 ], we only modified those for the Swedish and Finnish languages, and we created a new one for the Russian language (these lists are available at www.unine.ch/info/clef/). For English we used the list provided by the SMART system (571 words), while for the other European languages, our stopword list contained 430 words for Italian, 463 for French, 603 for German, 351 for Spanish, 1,315 for Dutch, 747 for Finnish, 386 for Swedish and 420 for Russian.

Once it removes high-frequency words, an indexing procedure generally applies a stemming algorithm in an attempt to conflate word variants into the same stem or root. In developing this procedure for various European languages, we first wanted to remove only inflectional suffixes such as singular and plural word forms, and also feminine and masculine forms, such that they conflate to the same root. Our suggested stemmers also try to reduce various word declensions into the same stem, such as those used in the German, Finnish and Russian languages.

More sophisticated schemes have already been proposed for the removal of derivational suffixes (e.g., "-ize", "-ably", "-ship" in the English language), the stemmer developed by Lovins [1968 ] (based on a list of over 260 suffixes), or that of Porter [1980 ] (which looks for about 60 suffixes). For the French language only, our stemming approach tried to remove some derivational suffixes (e.g., "communicateur" -> "communiquer", "faiblesse" -> "faible"). For the Dutch language we used the Kraaij & Pohlmann's stemmer [Kraaij 1996 ]. Our various stemming procedures can be found at www.unine.ch/info/clef/. Currently, it is not clear whether a stemming procedure such ours removes only inflectional suffixes from nouns and adjectives, and better retrieval effectiveness may be achieved by a stemming approach that also accounts for verbs or that removes both inflectional and derivational suffixes.

Finally, diacritic characters are usually not present in English collections (with some exceptions, such as "résumé"); and as with the Italian, Dutch, Finnish, Swedish, German, Spanish and Russian languages, these characters are replaced by their corresponding non-accentuated letter. For this latter language, we convert and normalize the Cyrillic Unicode characters into Latin alphabet (perl script available at www.unine.ch/clef/).

3. Decompounding Words

Most European languages manifest other morphological characteristics with compound word constructions being just one example (e.g., handgun, worldwide). In German for example, compound words are widely used and they may cause more difficulties than do those in English. For example, an insurance company would be "Versicherungsgesellschaft" ("Versicherung" + "S" + "Gesellschaft"). However the morphological marker ("S") is not always present (e.g., "Atomtests" built as "Atom" + "Tests"), and sometimes the letter "S" belongs to the decompounded word (e.g., "Wintersports" for "Winter" + "Sports"). In Finnish, we also encounter similar constructions as such as "rakkauskirje" ("rakkaus" + "kirje" for love & letter) or "työviikko" ("työ" + "viikko" for work & week). Recently, Braschler [2003 ] shows that decompounding German words may significantly improve retrieval performance.

Our proposed decompounding approach shares some similarity with Chen's algorithm [2002 ]. Before using it, we create a word list composed of all words appearing in the given collection (without stemming). Associated with each word, we also store the number of its occurrences in the collection (some examples are given in Table 2).

computer computers sicherheit sicher heit bank bund bundes bundesbank präsident

In order to present an overview of our decompounding approach, we will take as an example the German word "Computersicherheit," composed of "Computer" + "Sicherheit" (security). This compound word does not appear in our German word list as depicted in Table 2, so our algorithm starts the decompounding process by attempting to split a word following the k = 4 last letters (given the two strings "computersicher" and "heit"). During the entire procedure, we only consider words having a length greater than a given threshold (fixed at 3 for all languages in our experiments). If both components appear in the word list, then we have a candidate for decompounding; otherwise the k limit is increased by one. Since, in our case, the string "computersiche" does not appear in the German word list, splitting is rejected. When k = 9, our algorithm will find the word "computers" in the word list, but will fail to find the word "icherheit". With k = 10, our algorithm will find both the word "computer" and "sicherheit" in the German word list (see Table 2) and this solution becomes the top level decompounding suggestion. Recursively, the system now tries to decompound the two parts, namely the words "computer" and "sicherheit". During this recursive process, the system is allowed to ignore some short sequences of letters at the end of a word (such as "-s" or "-es" in German, or "-s" for the Swedish language) because such morphological markers may indicate the genitive form (such as "'s" in the noun phrase "John's book").

After this generative part, the system responds a tree of possible formats in which the compound construction can be broken down, and with each component, we find the number of its occurrences in the corpus. In our example, the answer will be (computer 2452, sicherheit 6583 (sicher 4522, heit 4)). Thus, from this result, we know that the word "Sicherheit" appears 6583 times in the corpus, and we may consider decompounding this term into the words "sicher" and "heit". From this we can add (or replace) the compound word in the document (or in the request) by all decompound candidates ("computer" + "sicherheit", and "computer" + "sicher" + "heit" in our case) or only by decompounding only the minimum number of terms ("computer" + "sicherheit" in our case).

However, when faced with multiple candidates, our algorithm will try to select the single "best" one. To achieve this, our system will consider the total number of occurrences for the component words and if this value is greater than the number of occurrences for the compound construction, the decompounded candidate will be selected. In our example, the system will not decompound the word "Sicherheit" because the number of occurrences of the words "sicher" (4522) and "heit" (4) will not produce a total (4526) greater than the number of occurrences of the word "sicherheit" (6583).

If we consider the German word "Bundesbankpräsident" (president of the (German) federal bank), the generative part of our algorithm would return (bundesbank 1453 (bund 7032, bank 9657), präsident 24041) and the final decompounding approach would return (bund 7032, bank 9657, präsident 24041). In this case, the number of occurrences of "bundesbank" (1453) is smaller than the sum of the occurrences of the words "bund" and "bank". However, our approach does not always generate the appropriate components of a compounded term. For example, based on the compound construction "wintersports", the system answers with (winter 1643, port 1091) instead of (winter 1643, sport 1483). This problem is due to the fact that the first part of our approach ignores backtracking and will stop when it encounters the first splitting of the compound into two parts.

4. Indexing and Searching Strategy

In order to obtain a broader view of the relative merit of various retrieval models, we first adopted a binary indexing scheme within which each document (or request) is represented by a set of keywords, without any weight. To measure the similarity between documents and requests, we computed the inner product (retrieval model denoted "doc=bnn, query=bnn" or "bnn-bnn"). In order to weight the presence of each indexing term in a document surrogate (or in a query), we could account for the term occurrence frequency (retrieval model notation: "doc=nnn, query=nnn" or "nnn-nnn") or we might also account for their frequency in the collection (or more precisely the inverse document frequency, denoted by idfj). Moreover, a cosine normalization could prove beneficial and each indexing weight could vary within the range of 0 to 1 (retrieval model notation: "ntc-ntc", Table 3 depicts the exact weighting formulation).

Other variants might also be created. For example, the tf component may be computed as 0.5 + 0.5 · [tf / max tf in a document] (retrieval model denoted "doc=atn"). We might also consider that a term's presence in a shorter document provides stronger evidence than it does in a longer document, leading to more complex IR models; for example, the IR model denoted by "doc=Lnu" [ Buckley 1996 ], "doc=dtu" [ Singhal 1999 ].

Besides the previous models based on the vector-space approach, we also considered probabilistic models. In this vein, we used the Okapi probabilistic model [ Robertson 2000 ] within with:

K = k1 · [(1 - b) + b · (li / avdl)] represents the ratio between the length of Di measured by li (sum of tfi j) and the collection mean noted by avdl. In Table 3, the value of nti indicates the number of distinct indexing terms including in the representation of Di.

As a second probabilistic approach, we implemented the Prosit (PRObabilistic Sift of Information Terms) approach [ Amati 2002 a, 2002b] which is based on the following indexing formula: wi j = Inf1i j · Inf2i j = (1 - Prob1i j) · Inf2i j with Prob1i j = tfni j / (tfni j + 1)

with tfni j = tfi j · log2[1 + ((C · mean dl) / li)] Inf2i j = -log2[1 / (1+lj)] - tfni j · log2 [lj / (1+lj)] with lj = tcj / n in which tcj indicates the number of occurrences of term tj in the collection and n the number of documents in the corpus. In our experiments, the constants b, k1, avdl, pivot, slope, C and mean dl are fixed according to values listed in Table!4.

bnn ltn dtn Okapi lnc (K + tf i j) ltc dtu

Language English

French Spanish German German Italian Dutch Dutch Finnish Finnish Swedish Swedish Russian Russian Russian

Index word word word word 5-gram word word 5-gram word 5-gram word 4-gram word 5-gram 4-gram b (ln(tfi j) + 1)⋅ idf j wi j =

(ln(ln(tf i j) + 1) + 1) ⋅idf j (1 - slope) ⋅ pivot + slope ⋅ nt i wi j = idfj . [0.5+ 0.5.tfi j / max tfi.] wi j = tfi j . ln[(n-dfj) / dfj] wi j = wi j = Ê1 + ln(tf i j) ˆ ËÁ ln(mean tf) + 1˜¯ (1 - slope) ⋅ pivot + slope ⋅ nt i

tf i j ⋅ idf j t Â (tf i k ⋅idf k ) k =1

2 avdl

To evaluate our approaches, we used the SMART system as a test bed running on an Intel Pentium III/600 (memory: 1 GB, swap: 2 GB, disk: 6 x 35 GB). To measure the retrieval performance, we adopted the noninterpolated mean average precision (computed on the basis of 1,000 retrieved items per request by the TRECEVAL program). We indexed the English, French, Spanish and Italian collections using words as indexing units. The evaluation of our two probabilistic models and nine vector-space schemes are given in Table 5a.

In order to represent German, Dutch, Swedish, Finnish and Russian documents and queries, we considered the n-gram, decompounded and word-based indexing schemes. The resulting mean average precision for these various indexing approaches is shown in Table 5b (German and Dutch corpora), in Table 5c (Swedish and Finnish languages) and in Table 5d (Russian collection).

It was observed that pseudo-relevance feedback (blind-query expansion) seems to be a useful technique for enhancing retrieval effectiveness. In this study, we adopted Rocchio's approach [ Buckley 1996 ] with a = 0.75, b = 0.75 whereby the system was allowed to add m terms extracted from the k best ranked documents from the original query. To evaluate this proposition, we used the Okapi and the Prosit probabilistic models and we enlarged the query by the 10 to 175 terms provided by the 3 or 10 best-retrieved articles.

Mean average precision French Spanish 52 queries 57 queries Query TD

The results depicted in Tables 6 (depicting our best results) indicate that the optimal parameter setting seems to be collection-dependant. Moreover, performance improvement also seems to be collection dependant (or language dependant), with no improvement for the English corpus yet an increase of 8.55% for the Spanish corpus (from a mean average precision of 51.71 to 56.13), 9.85% for the French corpus (from 48.41 to 53.18), 12.91% for the Italian language (41.05 to 46.35) and 13.26% for the German collection (from 41.25 to 46.72, combined model, Table 6b).

Mean average precision

For the English, French, Spanish, Italian and Russian languages, we assumed that the n-gram indexing and word-based document representation approaches are distinct and independent sources of evidence regarding the content of documents. For the German, Dutch, Swedish and Finnish languages, we added the decompounding indexing approach in our documents (and queries) representation scheme.

In order to combine these two and three indexing schemes respectively, we evaluated various fusion operators, as suggested by Fox and Shaw [Fox 1994 ]. Table 7 shows their precise description. For example, the combSUM operator indicates that the combined document score (or the final retrieval status value) is simply the sum of the retrieval status value (RSVk) of the corresponding document Dk computed by each single indexing scheme. CombNBZ specifies that we multiply the sum of the document scores by the number of retrieval schemes that are able to retrieve the corresponding document. In Table 7, we can see that both the combRSV% and combRSVnorm apply a normalization procedure when combining document scores. When combining the retrieval status value (RSVk) for various indexing schemes, we may multiply the document score by a constant ai (usually equal to 1) in order to favor the ith more efficient retrieval scheme. In addition to use these data fusion operators, we also considered the round-robin approach, whereby in turn we take one document from all individual lists and remove duplicates, keeping the most highly ranked instance.

combMAX combMIN combSUM combANZ combNBZ combRSV% combRSVnorm

Query TD

Model Okapi expand doc/term Prosit expand doc/term combMAX combMIN combSUM combANZ combNBZ combRSV% combRSVnorm round-robin

MAX (ai . RSVk) MIN (ai . RSVk)

SUM (ai . RSVk) SUM (ai . RSVk) / # of nonzero (RSVk) SUM (ai . RSVk) * (# of nonzero (RSVk))

SUM (ai . (RSVk / MAXRSV))

SUM [ai . ((RSVk-MINRSV) / (MAXRSV-MINRSV))] Table 8a: Mean average precision using different combination operators (ai = 1, with blind-query expansion)

Run name Language Query UniNEfr French UniNEfr2 French UniNEsp Spanish UniNEsp2 Spanish UniNEde German UniNEde2 German UniNEit UniNEit2 UniNEnl Italian Italian Dutch UniNEnl2 Dutch UniNEsv Swedish UniNEsv2 Swedish UniNEfi Finnish UniNEfi2 Finnish UniNEru Russian UniNEru1 Russian UniNEru2 Russian UniNEru3 Russian TD TD TD

TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TDN TDN TDN TDN word Pro+Oka decomp. Pro+Oka 5-gram Pro+Oka word Pro+Oka decomp. Pro+Oka 4-gram Pro+Oka word Pro+Oka decomp. Pro+Oka 4-gram Pro+Oka Index word word word word word word word word word decomp. 5-gram word word word word word decomp. 5-gram word decomp. 5-gram word decomp. 5-gram word decomp. 5-gram word word word word 5-gram 5-gram 4-gram 4-gram word word

Okapi

Prosit

Okapi

Prosit

Okapi

Prosit

Okapi

Prosit

Prosit Prosit Prosit Okapi

Prosit

Okapi

Prosit

Okapi

Okapi Prosit

Okapi Okapi Prosit Prosit

Prosit Prosit

Prosit

Prosit Prosit

Okapi

Prosit

Okapi

Prosit

Okapi

Prosit Okapi Prosit

Okapi

Prosit

Query expansion 10 best docs / 10 terms 5 best docs / 30 terms 10 best docs / 10 terms 5 best docs / 30 terms 10 best docs / 10 terms 10 best docs / 10 terms 5 best docs / 10 terms 10 best docs / 10 terms 5 best docs / 20 terms 10 best docs / 40 terms 5 best docs / 175 terms 5 best docs / 20 terms 10 best docs / 40 terms 5 best docs / 175 terms 10 best docs / 20 terms 10 best docs / 50 terms 10 best docs / 20 terms 10 best docs / 50 terms 10 best docs / 20 terms 10 best docs / 20 terms 10 best docs / 150 terms 3 best docs / 15 terms 3 best docs / 15 terms 3 best docs / 40 terms 5 best docs / 30 terms 5 best docs / 50 terms 5 best docs / 30 terms 5 best docs / 30 terms 5 best docs / 15 terms 3 best docs / 125 terms 5 best docs / 30 terms 5 best docs / 15 terms 3 best docs / 125 terms 1 0b e ds to c/ s20 terms 5 best docs / 30 terms 1 0b e ds to c/ s20 terms 5 best docs / 30 terms 10 best docs / 50 terms 5 best docs / 40 terms 10 best docs / 50 terms 5 best docs / 40 terms

1 0b e ds to c/ s10 terms 5 best docs / 20 terms combined round-robin

RSV%

RSVnorm RSVnorm RSVnorm

sumRSV RSV% sumRSV sumRSV RSV% RSVnorm sumRSV sumRSV sumRSV sumRSV sumRSV sumRSV 52.61 54.50

Prosit word doc/term

Prosit decomp doc/term Prosit n-gram doc/term combMAX combMIN combSUM combANZ combNBZ combRSV% combRSVnorm round-robin

Conclusion

Acknowledgments [ Amati 2002 b] [ Chen 2002 ] [ Fox 1994 ]

In this fourth CLEF evaluation campaign, we proposed a general stopword list and stemming procedure for eight European languages (excluding English). Currently it is not clear if a stemming procedure such as that suggested and that only removes inflectional suffixes from nouns and adjectives, could produce better retrieval effectiveness than a stemming approach that takes both inflectional and derivational suffixes into account. We also suggested a simple decompounding approach for the German, Dutch, Swedish and Finnish language. In order to achieve better retrieval performance, we used a data fusion approach, one requiring that document (and query) representation be based on two or three indexing schemes.

The author would like to thank C. Buckley from SabIR for giving us the opportunity to use the SMART system. This research was supported by the Swiss National Science Foundation under grant #21-66 742.01.

Amati , G. , Carpineto , C. & Romano , G. ( 2002 ). Italian monolingual information retrieval with PROSIT . In Proceedings of CLEF-2002 , (pp. 145 - 151 ). Roma.

Amati , G . & van Rijsbergen , C.J. ( 2002 ). Probabilistic models of information retrieval based on measuring the divergence from randomness . ACM TOIS , 20 ( 4 ), 357 - 389 .

Braschler , M. & Ripplinger , B. ( 2003 ). Stemming and decompounding for German text retrieval . In Proceedings 25th European Conference in IR (pp. 177 - 192 ). Berlin: Springer.

Buckley , C. , Singhal , A. , Mitra , M. & Salton , G. ( 1996 ). New retrieval approaches using SMART . In Proceedings of TREC'4 , (pp. 25 - 48 ). Gaithersburg: NIST Publication # 500 - 236 .

Chen , A. ( 2002 ). Cross-language retrieval experiments at CLEF-2002 . In Proceedings of CLEF-2002 , (pp. 5 - 20 ). Roma.

Fox , E.A. & Shaw , J.A. ( 1994 ). Combination of multiple searches . In Proceedings TREC-2 , (pp. 243 - 249 ). Gaithersburg: NIST Publication # 500 - 215 .

Kraaij , W. & Pohlmann , R. ( 1996 ). Viewing stemming as recall enhancement . In Proceedings of the ACM-SIGIR'96 , (pp. 40 - 48 ). New York: The ACM Press.

Lovins , J.B. ( 1968 ). Development of a stemming algorithm . Mechanical Translation and Computational Linguistics , 11 ( 1 ), 22 - 31 .

Porter , M.F. ( 1980 ). An algorithm for suffix stripping . Program , 14 , 130 - 137 .

Robertson , S.E. , Walker , S. & Beaulieu , M. ( 2000 ). Experimentation as a way of life: Okapi at TREC . Information Processing & Management , 36 ( 1 ), 95 - 108 .

Savoy J. ( 2002 ). Report on CLEF-2002 experiments: Combining multiple sources of evidence . In Proceedings of CLEF-2002 , (pp. 31 - 46 ). Roma.

Singhal , A. , Choi , J. , Hindle , D. , Lewis , D.D. & Pereira , F. ( 1999 ). AT& T at TREC-7.

In Proceedings TREC-7 , (pp. 239 - 251 ). Gaithersburg: NIST Publication # 500 - 242 .