Statistical vs. Rule-Based Stemming for Monolingual French Retrieval Prasenjit Majumder1 Mandar Mitra1 Kalyankumar Datta2 1 CVPR Unit, Indian Statistical Institute, Kolkata 2 Dept. of EE, Jadavpur University, Kolkata {prasenjit t,mandar}@isical.ac.in, kalyandatta@debesh.wb.nic.in Abstract This paper describes our approach to the 2006 Adhoc Monolingual Information Re- trieval run for French. The goal of our experiment was to compare the performance of a proposed statistical stemmer with that of a rule-based stemmer, specifically the French version of Porter’s stemmer. The statistical stemming approach is based on lexicon clustering, using a novel string distance measure. We submitted three official runs, besides a baseline run that uses no stemming. The results show that stem- ming significantly improves retrieval performance (as expected) by about 9-10%, and the performance of the statistical stemmer is comparable with that of the rule-based stemmer. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database Managment]: Languages—Query Languages General Terms Performance, Experimentation Keywords statistical stemming, string distance, clustering, Porter’s algorithm, monolingual information re- trieval 1 Introduction We have recently been experimenting with languages that have not been studied much from the IR perspective. These languages are typically resource-poor, in the sense that few language resources or tools are available for them. As a specific example, no comprehensive stemming algorithms are available for these languages. The stemmers that are available for more widely studied languages (e.g. English) usually make use of an extensive set of linguistic rules. Rule based stemmers for most resource-poor languages are either unavailable or lack comprehensive coverage. In earlier work, therefore, we have looked at the problem of stemming for such resource-poor languages, and proposed a stemming approach that is based on purely unsupervised clustering techniques. Since the proposed approach does not assume any language-specific information, we expect the approach to work for multiple languages. The motivation behind our experiments at CLEF 2006 was to test this hypothesis. Thus, we focused on mono-lingual retrieval for French (a language which we know nothing about), and tried our statistical stemming approach on French data. We give a brief overview of the proposed statistical stemming algorithm in the next section. We outline our experimental setup in Section 3, and discuss the results of the runs that we submitted. 2 Statistical Stemmer 2.1 String Distance Measures Distance functions map a pair of strings s and t to a real number r, where a smaller value of r indicates greater similarity between s and t. In the context of stemming, an appropriate distance measure would be one that assigns a low distance value to a pair of strings when they are morphologically similar, and assigns a high distance value to morphologically unrelated words. The languages that we have been experimenting with are primarily suffixing in nature, i.e. words are usually inflected by the addition of suffixes, and possible modifications to the tail-end of the word. Thus, for these languages, two strings are likely to be morphologically related if they share a long matching prefix. Based on this intuition, we define a string distance measure D which rewards long matching prefixes, and penalizes an early mismatch. Given two strings X = x0 x1 . . . xn and Y = y0 y1 . . . yn0 , we first define a Boolean function pi (for penalty) as follows: 0 if xi = yi 0 ≤ i ≤ min(n, n0 ) pi = 1 otherwise Thus, pi is 1 if there is a mismatch in the i-th position of X and Y . If X and Y are of unequal length, we pad the shorter string with null characters to make the string lengths equal. Let the length of the strings be n + 1, and let m denote the position of the first mismatch between X and Y (i.e. x0 = y0 , x1 = y1 , . . . , xm−1 = ym−1 , but xm 6= ym ). We now define D as follows: n n−m+1 X 1 D(X, Y ) = × if m > 0, ∞ otherwise (1) m i=m 2i−m Note that D does not consider any match once the first mismatch occurs. The actual dis- tance is obtained by multiplying the total penalty by a factor which is intended to reward a long matching prefix, and penalize significant mismatches. For example, for the pair hastronomer, astronomicallyi, m = 8, n = 13. Thus, D3 = 86 × ( 210 + . . . + 213−8 1 ) = 1.4766. 2.2 Lexicon Clustering Using the distance function defined above, we can cluster all the words in a document collection into groups. Each group, consisting of “similar” strings, is expected to represent an equivalence class consisting of morphological variants of a single root word. The words within a cluster can be stemmed to the ‘central’ word in that cluster. Since the number of natural clusters are unknown apriori, partitive clustering algorithms like k-means are not suitable for our task. Also, the clusters are likely to be of non-convex nature. Graph-theoretic clustering algorithms appear to be the natural choice in this situation because of their ability to detect natural and non-convex clusters in the data. Three variants of graph theoretic clustering are popular in literature, namely, single-linkage, average-linkage, and complete-linkage [2]. Each of these algorithms are of hierarchical (agglom- erative or divisive) nature. In the agglomerative form, the cluster tree (often referred to as a dendogram) consists of individual data points as leaves. The nearest (or most similar) pair(s) of points are merged to form groups, which in turn are successively merged to form progressively larger groups of points. Clustering stops when the similarity between the pair of closest groups falls below a pre-determined threshold. Alternatively, a threshold can be set on the distance value; when the distance between the pair of nearest points exceeds the threshold, clustering stops. The three algorithms mentioned above differ in the way similarity between the groups is defined. We choose the compete-linkage algorithm for our experiments. 3 Results We used the Smart [3] system for all our experiments. We submitted four official runs, including one baseline. For the baseline run (Cbaseline), queries and documents were indexed after elimi- nating stopwords (using the stopword list provided on the CLEF website1 ). The