Attempts to Search Czech Spontaneous Spoken Interviews - the University of West Bohemia at CLEF 2007 CL-SR track Pavel Ircing and Luděk Müller University of West Bohemia {ircing, muller}@kky.zcu.cz Abstract The paper presents an overview of the system build and experiments performed for the CLEF 2007 CL-SR track by the University of West Bohemia. We have concentrated on the monolingual experiments using the Czech collection only. The approach that was successfully employed by our team in the last year’s campaign (simple tf.idf model with blind relevance feedback, accompanied with solid linguistic preprocessing) was used again but the set of performed experiments was broadened. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval General Terms Experimentation Keywords Speech Retrieval 1 Introduction The Czech subtask of the CL-SR track, which was first introduced at CLEF 2006 campaign, is enormously challenging — let us repeat once again that the goal is to identify appropriate replay points (that is, the moments where the discussion about the queried topics starts) in a continuous stream of text generated by automatic transcription of spontaneous speech. Therefore, it is neither the standard document retrieval task (as there are no true documents defined) nor the fully-fledged speech retrieval (since the participants do not have the speech data nor the lattices, so they can’t explore alternative hypotheses and must rely on one-best transcription). However, in order to lower the barrier of entry for teams proficient at classic document retrieval (or, for that matter, even total IR beginners), the last year’s organisers prepared a so called Quickstart collection with artificially defined “documents” that were created by sliding 3-minute window over the stream of transcriptions with a 2-minute step (i.e., the consecutive documents have a one minute overlap).1 The last year’s Quickstart collection was further equipped with both manually 1 It turned out later that the actual timing was different due to some faulty assumptions during the Quickstart collection design, but since the principle of the document creation remains the same, we will still use the “intended” time figures instead of the actual ones, just for the sake of readability. and automatically generated keywords (see [5] for details) but they have shown itself to be of no benefit for IR performance [3](the former for the timing problems, the latter for the problems with their assignment that yet remain to be identified) and thus have been dropped from this year’s data. The scripts for generating such Quickstart collection with variable window and overlap times were also included in the data release. 2 System description Our current system largely builds upon the one that was successful in the last year’s campaign [3], with only minor modifications and larger set of tested settings. 2.1 Linguistic preprocessing Stemming (or lemmatization) is considered to be vital for good IR performance even in the case of weakly inflected languages such as English; thus it is probably even more crucial for Czech as the representative of the richly inflectional language family. This assumption was experimentally proven by our group in the last year’s CLEF CL-SR track [3]. Thus we have used the same method of linguistic preprocessing, that is, the serial combination of Czech morphological analyser and tagger [2], which provides both the lemma and stem for each input word form, together with a detailed morphological tag. This tag (namely it’s first position) is used for stop-word removal — we removed from indexing all the words that were tagged as prepositions, conjunctions, particles and interjections. 2.2 Retrieval All our retrieval experiments were performed using the Lemur toolkit [1], which offers a variety of retrieval models. We have decided to stick to the tf.idf model where both documents and queries are represented as weighted term vectors d~i = (wi,1 , wi,2 , · · · , wi,n ) and ~qk = (wk,1 , wk,2 , · · · , wk,n ), respectively (n denotes the total number of distinct terms in the collection). The inner-product of such weighted term vectors then determines the similarity between individual documents and queries. There are many different formulas for computation of the weights wi,j , we have tested two of them, varying in the tf component: Raw term frequency d wi,j = tfi,j · log (1) dfj where tfi,j denotes the number of occurrences of the term tj in the document di (term frequency), d is the total number of documents in the collection and finally dfj denotes the number of documents that contain tj . BM25 term frequency k1 · tfi,j d wi,j = · log (2) tfi,j + k1 (1 − b + b llCd ) dfj where tfi,j , d and dfj have the same meaning as in (1), ld denotes the length of the document, lC the average length of a document in the collection and finally k1 and b are the parameters to be set. The tf components for queries are defined analogously, except for the average length of a query, which obviously cannot be determined as the system is not aware of the full query set and processes one query at a time. The Lemur documentation is however not clear about the exact way of handling the lC value for queries. The values of k1 and b were set according to the suggestions made by [7] and [6], that is k1 = 1.2 and b = 0.75 for computing document weights and k1 = 1 and b = 02 for query weights. We have also tested the influence of the blind relevance feedback. The simplified version of the Rocchio’s relevance feedback implemented in Lemur [7] was used for this purposes. The original Rocchio’s algorithm is defined by the formula ~qnew = ~qold + α · d~R − β · d~R̄ where R and R̄ denote the set of relevant and non-relevant documents, respectively, and d~R and d~R̄ denote the corresponding centroid vectors of those sets. In other words, the basic idea behind this algorithm is to move the query vector closer to the relevant documents and away from the non-relevant ones. In the case of blind feedback, the top M documents from the first-pass run are simply considered to be relevant. The Lemur modification of this algorithm sets the β = 0 and keeps only the K top-weighted terms in d~R . 3 Experimental Evaluation We have created 3 different indices from the collection — using original data and their lemmatized and stemmed version. There were 29 training topics and 42 evaluation topics defined by the organisers. We have first run the set of experiments for the training topics (see Table 1), comparing: • Results obtained for the queries constructed by concatenating the tokens (either words, lemmas or stems) from the