Czech Monolingual Information Retrieval Using Off-The-Shelf Components - the University of West Bohemia at CLEF 2007 Ad-Hoc track Pavel Ircing and Luděk Müller University of West Bohemia {ircing, muller}@kky.zcu.cz Abstract The paper provides a brief description of the system assembled for the CLEF 2007 Ad-Hoc track by the University of West Bohemia. We have performed only mono- lingual experiments (Czech documents - Czech queries) using two incarnations of the tf.idf model — one with raw term frequency and the other with the BM25 term fre- quency weighting — as implemented in the Lemur toolkit. The effect of the blind relevance feedback was also explored. Czech morphological analyser and tagger were used for lemmatization and stop word removal. The results achieved seem to be quite reasonable, with MAP ranging from 0.11. to 0.30. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval General Terms Experimentation Keywords Monolingual Ad-Hoc Information Retrieval 1 Introduction Although our group is mainly interested in the CL-SR track in the CLEF campaign, we could not resist participating in Ad-Hoc once our native language was introduced to the track. Our runs were generated essentially just by putting together off-the-shelf components available either for Czech NLP or general IR. Such seemingly unambitious approach has, however, proven to be quite successful in the past CLEF campaigns. We have performed monolingual Czech experiments only. 2 System description 2.1 Linguistic preprocessing Stemming (or lemmatization) is considered to be vital for good IR performance. This assumption was experimentally proven by our group also for the Czech language IR in the last year’s CLEF CL-SR track [3]. Thus we have used the same method of linguistic preprocessing, that is, the serial combination of Czech morphological analyser and tagger [2], which provides both the lemma and stem for each input word form, together with a detailed morphological tag. This tag (namely its first position) is used for stop-word removal — we removed from indexing all the words that were tagged as prepositions, conjunctions, particles and interjections. 2.2 Retrieval All our retrieval experiments were performed using the Lemur toolkit [1], which offers a variety of retrieval models. We have decided to stick to the tf.idf model where both documents and queries are represented as weighted term vectors d~i = (wi,1 , wi,2 , · · · , wi,n ) and ~qk = (wk,1 , wk,2 , · · · , wk,n ), respectively (n denotes the total number of distinct terms in the collection). The inner-product of such weighted term vectors then determines the similarity between individual documents and queries. There are many different formulas for computation of the weights wi,j , we have tested two of them, varying in the tf component: Raw term frequency d wi,j = tfi,j · log (1) dfj where tfi,j denotes the number of occurrences of the term tj in the document di (term frequency), d is the total number of documents in the collection and finally dfj denotes the number of documents that contain tj . BM25 term frequency k1 · tfi,j d wi,j = ld · log (2) tfi,j + k1 (1 − b + b lC ) dfj where tfi,j , d and dfj have the same meaning as in (1), ld denotes the length of the document, lC the average length of a document in the collection and finally k1 and b are the parameters to be set. The tf components for queries are defined analogously, except for the average length of a query, which obviously cannot be determined as the system is not aware of the full query set and processes one query at a time. The Lemur documentation is however not clear about the exact way of handling the lC value for queries. The values of k1 and b were set according to the suggestions made by [5] and [4], that is k1 = 1.2 and b = 0.75 for computing document weights and k1 = 1 and b = 01 for query weights. We have also tested the influence of the blind relevance feedback. The simplified version of the Rocchio’s relevance feedback implemented in Lemur [5] was used for this purposes. The original Rocchio’s algorithm is defined by the formula ~qnew = ~qold + α · d~R − β · d~R̄ where R and R̄ denote the set of relevant and non-relevant documents, respectively, and d~R and d~R̄ denote the corresponding centroid vectors of those sets. In other words, the basic idea behind this algorithm is to move the query vector closer to the relevant documents and away from the non-relevant ones. In the case of blind feedback, the top M documents from the first-pass run are simply considered to be relevant. The Lemur modification of this algorithm sets the β = 0 and keeps only the K top-weighted terms in d~R . 1 This is actually not a choice, as the value of b is hard-set to 0 for queries in Lemur. 3 Experimental Evaluation There were 50 topics defined for Ad-Hoc track, in a variety of languages. As we have already mentioned, we have used only the Czech topics for searching Czech documents. The document set consists of electronic versions of articles from two nationwide newspapers (Mladá Fronta Dnes, Lidové Noviny); following the track organisers’ instructions, we have indexed only the