Introduction

Waterloo Experiments for the CLEF05 SDR Track

[Information Storage

Retrieval]: H.

Information Search

Retrieval

General Terms

Measurement

Performance

Experimentation

Spoken Data Retrieval

Okapi

Query Expansion

0 0 Charles L. A. Clarke School of Computer Science, University of Waterloo , Canada

This year is the rst year that the Information Retrieval Group at the University of Waterloo participated in CLEF. For the Cross-Language Spoken Document Retrieval track we submitted ve o cial runs | three English automatic runs (title-only, title+desc, and title+desc+narr), a Czech automatic run (title-only) and a French automatic run (title-only). All o cial runs used a combination of several query formulation and expansion techniques, including phonetic n-grams and pseudo-relevance feedback expansion over a topic-speci c external corpus crawled from the Web. In addition, a large number of un-o cial runs were generated, including German and Spanish runs. This brief report provides an overview of our experiments, which are summarized in gure 1.

Introduction

All our runs were generated by the Wumpus retrieval system1 using Okapi BM25 as the basic retrieval method.

The Wumpus implementation of Okapi BM25 is a variant of the formula given by Robertson et al. [3]. Given a term set Q, a document d is assigned the score: (1) where

K + dt D Dt qt dt K = = = = = number of documents in the corpus number of documents containing t frequency that t occurs in the topic frequency that t occurs in d k1((1

b) + b ld=lavg) lavg = length of d = average document length All CLEF 2005 runs used parameter settings of k1 = 1:2 and b = 0:75.

Many of our runs incorporated pseudo-relevance feedback, following the process described in Yeung et al. [1]. For feedback purposes, we augmented the CLEF 2005 SDR corpus with a 2.5GB corpus of Web data, generated by a topic-focused crawl, seeded from 17 sites dedicated to the holocaust. Each query was rst executed against this augmented corpus. Terms were extracted from the top results and added to the initial query, which was then executed against the SDR Corpus.

As an alternative to stemming, many runs were based on phoneme 4-grams. For these runs, NIST's text-to-phone tool2 was applied to translate the words in the corpus into phoneme sequences, which were then split into 4-grams and indexed. Queries were pre-processed in a similar fashion before execution.

Several runs, including our o cial English-language submissions, were generated by fusing word and n-gram runs. For these runs, fusion was performed using the standard CombMNZ algorithm [2].

Our non-English runs used translated queries supplied by the University of Ottawa group. The reader should consult their CLEF 2005 paper for further information. 3

Discussion

On the training data, the fusion of feedback and phonetic n-gram runs produced a substantial performance improvement over the baseline Okapi runs. Unfortunately, this the improvement was not seen on the test data, where feedback produced only a modest improvement and the phonetic n-grams generally harmed performance.

Next year, we hope to expand our participation in CLEF, including the evaluation of additional speech-speci c techniques in the context of the SDR track. [1] David L. Yeung, Charles L. A. Clarke, Gordon V. Cormack, Thomas R. Lynam, and Egidio L.

Terra. Task-Speci c Query Expansion (MultiText Experiments for TREC 2003). In Twelfth Text REtrieval Conference. National Institute of Standards and Technology, 2003. [2] E. A. Fox and J. A. Shaw. Combination of multiple searches. In Second Text REtrieval

Conference. National Institute of Standards and Technology, 1994. [3] S. E. Robertson, S. Walker, and M. Beaulieu. Okapi at TREC-7: Automatic ad hoc, ltering, VLC and interactive track. In Seventh Text REtrieval Conference. National Institute of Standards and Technology, 1998. bpref 0.113 0.128 0.147 0.127 0.140 0.142 0.114 0.120 0.127 0.139 0.141 0.061 0.091 0.093 0.095 0.121 0.137 0.116 0.122 0.112 0.112 0.105 0.108 0.109 0.121 0.122 0.117 0.224 0.243 0.260 0.244 0.264 0.270 stemming, no feedback stemming, no feedback stemming, no feedback stemming, feedback stemming, feedback stemming, feedback phonetic 4-grams, no feedback phonetic 4-grams, no feedback fusion of uw5XETfb and uw5XETph fusion of uw5XETDfb and uw5XETDph fusion of uw5XETDNfb and uw5XETph stemming, no feedback stemming, no feedback phonetic 4-grams, no feedback phonetic 4-grams, no feedback stemming, no feedback stemming, no feedback phonetic 4-grams, no feedback phonetic 4-grams, no feedback stemming, no feedback stemming, no feedback phonetic 4-grams, no feedback phonetic 4-grams, no feedback stemming, no feedback stemming, no feedback phonetic 4-grams, no feedback phonetic 4-grams, no feedback MANUAL FIELDS, stemming, no feedback MANUAL FIELDS, stemming, no feedback MANUAL FIELDS, stemming, no feedback MANUAL FIELDS, stemming, feedback MANUAL FIELDS, stemming, feedback MANUAL FIELDS, stemming, feedback map

uw5XET 0.090 uw5XETD 0.099 uw5XETDN 0.116 uw5XETfb 0.100 uw5XETDfb 0.110 uw5XETDNfb 0.116 uw5XETph 0.087 uw5XETDph 0.097 uw5XETfs 0.098 uw5XETDfs 0.112 uw5XETDNfs 0.114 uw5XCT 0.039 uw5XCTD 0.054 uw5XCTph 0.047 uw5XCTDph 0.055 uw5XFT 0.094 uw5XFTD 0.108 uw5XFTph 0.085 uw5XFTDph 0.101 uw5XGT 0.079 uw5XGTD 0.077 uw5XGTph 0.064 uw5XGTDph 0.072 uw5XST 0.087 uw5XSTD 0.092 uw5XSTph 0.086 uw5XSTDph 0.095 uw5XMT 0.224 uw5XMTD 0.235 uw5XMTDN 0.251 uw5XMTfb 0.226 uw5XMTDfb 0.258 uw5XMTDNfb 0 . 255