=Paper=
{{Paper
|id=Vol-1173/CLEF2007wn-adhoc-IrcingEt2007
|storemode=property
|title=Czech Monolingual Information Retrieval Using Off-The-Shelf Components - the University of West Bohemia at CLEF 2007 Ad-Hoc track
|pdfUrl=https://ceur-ws.org/Vol-1173/CLEF2007wn-adhoc-IrcingEt2007.pdf
|volume=Vol-1173
|dblpUrl=https://dblp.org/rec/conf/clef/IrcingM07a
}}
==Czech Monolingual Information Retrieval Using Off-The-Shelf Components - the University of West Bohemia at CLEF 2007 Ad-Hoc track==
Czech Monolingual Information Retrieval Using Off-The-Shelf Components - the University of West Bohemia at CLEF 2007 Ad-Hoc track Pavel Ircing and Luděk Müller University of West Bohemia {ircing, muller}@kky.zcu.cz Abstract The paper provides a brief description of the system assembled for the CLEF 2007 Ad-Hoc track by the University of West Bohemia. We have performed only mono- lingual experiments (Czech documents - Czech queries) using two incarnations of the tf.idf model — one with raw term frequency and the other with the BM25 term fre- quency weighting — as implemented in the Lemur toolkit. The effect of the blind relevance feedback was also explored. Czech morphological analyser and tagger were used for lemmatization and stop word removal. The results achieved seem to be quite reasonable, with MAP ranging from 0.11. to 0.30. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval General Terms Experimentation Keywords Monolingual Ad-Hoc Information Retrieval 1 Introduction Although our group is mainly interested in the CL-SR track in the CLEF campaign, we could not resist participating in Ad-Hoc once our native language was introduced to the track. Our runs were generated essentially just by putting together off-the-shelf components available either for Czech NLP or general IR. Such seemingly unambitious approach has, however, proven to be quite successful in the past CLEF campaigns. We have performed monolingual Czech experiments only. 2 System description 2.1 Linguistic preprocessing Stemming (or lemmatization) is considered to be vital for good IR performance. This assumption was experimentally proven by our group also for the Czech language IR in the last year’s CLEF CL-SR track [3]. Thus we have used the same method of linguistic preprocessing, that is, the serial combination of Czech morphological analyser and tagger [2], which provides both the lemma and stem for each input word form, together with a detailed morphological tag. This tag (namely its first position) is used for stop-word removal — we removed from indexing all the words that were tagged as prepositions, conjunctions, particles and interjections. 2.2 Retrieval All our retrieval experiments were performed using the Lemur toolkit [1], which offers a variety of retrieval models. We have decided to stick to the tf.idf model where both documents and queries are represented as weighted term vectors d~i = (wi,1 , wi,2 , · · · , wi,n ) and ~qk = (wk,1 , wk,2 , · · · , wk,n ), respectively (n denotes the total number of distinct terms in the collection). The inner-product of such weighted term vectors then determines the similarity between individual documents and queries. There are many different formulas for computation of the weights wi,j , we have tested two of them, varying in the tf component: Raw term frequency d wi,j = tfi,j · log (1) dfj where tfi,j denotes the number of occurrences of the term tj in the document di (term frequency), d is the total number of documents in the collection and finally dfj denotes the number of documents that contain tj . BM25 term frequency k1 · tfi,j d wi,j = ld · log (2) tfi,j + k1 (1 − b + b lC ) dfj where tfi,j , d and dfj have the same meaning as in (1), ld denotes the length of the document, lC the average length of a document in the collection and finally k1 and b are the parameters to be set. The tf components for queries are defined analogously, except for the average length of a query, which obviously cannot be determined as the system is not aware of the full query set and processes one query at a time. The Lemur documentation is however not clear about the exact way of handling the lC value for queries. The values of k1 and b were set according to the suggestions made by [5] and [4], that is k1 = 1.2 and b = 0.75 for computing document weights and k1 = 1 and b = 01 for query weights. We have also tested the influence of the blind relevance feedback. The simplified version of the Rocchio’s relevance feedback implemented in Lemur [5] was used for this purposes. The original Rocchio’s algorithm is defined by the formula ~qnew = ~qold + α · d~R − β · d~R̄ where R and R̄ denote the set of relevant and non-relevant documents, respectively, and d~R and d~R̄ denote the corresponding centroid vectors of those sets. In other words, the basic idea behind this algorithm is to move the query vector closer to the relevant documents and away from the non-relevant ones. In the case of blind feedback, the top M documents from the first-pass run are simply considered to be relevant. The Lemur modification of this algorithm sets the β = 0 and keeps only the K top-weighted terms in d~R . 1 This is actually not a choice, as the value of b is hard-set to 0 for queries in Lemur. 3 Experimental Evaluation There were 50 topics defined for Ad-Hoc track, in a variety of languages. As we have already mentioned, we have used only the Czech topics for searching Czech documents. The document set consists of electronic versions of articles from two nationwide newspapers (Mladá Fronta Dnes, Lidové Noviny); following the track organisers’ instructions, we have indexed only theand fields, in both the original (non-lemmatized) and the lemmatized version. The results are summarized in Table 1. The upper section shows the MAP for queries con- structed by concatenating the tokens (either words or lemmas) from the and fields of the topics (TD), the lower section then the results for queries made from all three topic fields, i.e. , and (TDN). Both results with (BRF) and without (no FB) application of the blind relevance feedback are shown. Raw TF BM25 TF no FB BRF no FB BRF TD words 0.1405 0.1101 0.2053 0.2500 lemmas 0.1765 0.1247 0.2569 0.3025 TDN words 0.1491 0.1162 0.2219 0.2480 lemmas 0.1869 0.1415 0.2277 0.2405 Table 1: MAP of the individual runs - bold runs were submitted for official scoring. The table reveals several findings. First of them is that the length normalization contained in the BM25 formula seems to have an immense effect on the performance — this is probably something not very surprising to an experienced IR researcher, it did however surprise us as we were dealing with documents of approximately uniform length in last year’s CL-SR track (again see [3] for details). What is truly puzzling is the negligible effect of lemmatization for the runs using BM25 term frequency component and TDN queries; especially when you compare those runs with the other “quadrants” of the table. 4 Conclusion Our participation in the Ad-Hoc track was motivated mainly by two factors — we wanted to enrich the diversity of the pool of results and we wanted to know how our quite strong experience of dealing with Czech language processing and a rather poor experience of designing IR systems will hold up in competition. While we have hopefully succeeded in the former, we still have no idea how we have done in the latter as the organisers did not publish any cross-site comparison. Thus we look forward to seeing such ranking in the track overview paper. Acknowledgments This work was supported by the Grant Agency of the Czech Academy of Sciences project No. 1ET101470416 and the Ministry of Education of the Czech Republic project No. LC536. References [1] Carnegie Mellon University and the University of Massachusetts. The Lemur Toolkit for Lan- guage Modeling and Information Retrieval. (http://www.lemurproject.org/), 2006. [2] Jan Hajič. Disambiguation of Rich Inflection. (Computational Morphology of Czech). Karolinum, Prague, 2004. [3] Pavel Ircing and Luděk Müller. Benefit of Proper Language Processing for Czech Speech Retrieval in the CL-SR Task at CLEF 2006. In C. Peters, P. Clough, F. Gey, J. Karlgren, B. Magnini, D. Oard, M. de Rijke, and M. Stempfhuber, editors, Evaluation of Multilingual and Multi-modal Information Retrieval - 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Lecture Notes in Computer Science, Alicante, Spain, 2007. [4] Stephen Robertson and Steve Walker. Okapi/Keenbow at TREC-8. In The Eight Text RE- trieval Conference (TREC-8), 1999. [5] Chengxiang Zhai. Notes on the Lemur TFIDF model. Note with Lemur 1.9 documentation, School of CS, CMU, 2001.