<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Waterloo Experiments for the CLEF05 SDR Track</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>[Information Storage</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Retrieval]: H.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Information Search</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Retrieval</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Measurement</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Performance</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Experimentation</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Spoken Data Retrieval</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Okapi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Query Expansion</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Charles L. A. Clarke School of Computer Science, University of Waterloo</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This year is the rst year that the Information Retrieval Group at the University of Waterloo participated in CLEF. For the Cross-Language Spoken Document Retrieval track we submitted ve o cial runs | three English automatic runs (title-only, title+desc, and title+desc+narr), a Czech automatic run (title-only) and a French automatic run (title-only). All o cial runs used a combination of several query formulation and expansion techniques, including phonetic n-grams and pseudo-relevance feedback expansion over a topic-speci c external corpus crawled from the Web. In addition, a large number of un-o cial runs were generated, including German and Spanish runs. This brief report provides an overview of our experiments, which are summarized in gure 1.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>All our runs were generated by the Wumpus retrieval system1 using Okapi BM25 as the basic
retrieval method.</p>
      <p>The Wumpus implementation of Okapi BM25 is a variant of the formula given by Robertson
et al. [3]. Given a term set Q, a document d is assigned the score:
(1)
where</p>
      <p>K + dt
D
Dt
qt
dt
K
=
=
=
=
=
number of documents in the corpus
number of documents containing t
frequency that t occurs in the topic
frequency that t occurs in d
k1((1</p>
      <p>b) + b ld=lavg)
lavg
= length of d
= average document length
All CLEF 2005 runs used parameter settings of k1 = 1:2 and b = 0:75.</p>
      <p>Many of our runs incorporated pseudo-relevance feedback, following the process described in
Yeung et al. [1]. For feedback purposes, we augmented the CLEF 2005 SDR corpus with a 2.5GB
corpus of Web data, generated by a topic-focused crawl, seeded from 17 sites dedicated to the
holocaust. Each query was rst executed against this augmented corpus. Terms were extracted
from the top results and added to the initial query, which was then executed against the SDR
Corpus.</p>
      <p>As an alternative to stemming, many runs were based on phoneme 4-grams. For these runs,
NIST's text-to-phone tool2 was applied to translate the words in the corpus into phoneme
sequences, which were then split into 4-grams and indexed. Queries were pre-processed in a similar
fashion before execution.</p>
      <p>Several runs, including our o cial English-language submissions, were generated by fusing
word and n-gram runs. For these runs, fusion was performed using the standard CombMNZ
algorithm [2].</p>
      <p>Our non-English runs used translated queries supplied by the University of Ottawa group. The
reader should consult their CLEF 2005 paper for further information.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Discussion</title>
      <p>On the training data, the fusion of feedback and phonetic n-gram runs produced a substantial
performance improvement over the baseline Okapi runs. Unfortunately, this the improvement was
not seen on the test data, where feedback produced only a modest improvement and the phonetic
n-grams generally harmed performance.</p>
      <p>Next year, we hope to expand our participation in CLEF, including the evaluation of additional
speech-speci c techniques in the context of the SDR track.
[1] David L. Yeung, Charles L. A. Clarke, Gordon V. Cormack, Thomas R. Lynam, and Egidio L.</p>
      <p>Terra. Task-Speci c Query Expansion (MultiText Experiments for TREC 2003). In Twelfth
Text REtrieval Conference. National Institute of Standards and Technology, 2003.
[2] E. A. Fox and J. A. Shaw. Combination of multiple searches. In Second Text REtrieval</p>
      <p>Conference. National Institute of Standards and Technology, 1994.
[3] S. E. Robertson, S. Walker, and M. Beaulieu. Okapi at TREC-7: Automatic ad hoc,
ltering, VLC and interactive track. In Seventh Text REtrieval Conference. National Institute of
Standards and Technology, 1998.
bpref
0.113
0.128
0.147
0.127
0.140
0.142
0.114
0.120
0.127
0.139
0.141
0.061
0.091
0.093
0.095
0.121
0.137
0.116
0.122
0.112
0.112
0.105
0.108
0.109
0.121
0.122
0.117
0.224
0.243
0.260
0.244
0.264
0.270
stemming, no feedback
stemming, no feedback
stemming, no feedback
stemming, feedback
stemming, feedback
stemming, feedback
phonetic 4-grams, no feedback
phonetic 4-grams, no feedback
fusion of uw5XETfb and uw5XETph
fusion of uw5XETDfb and uw5XETDph
fusion of uw5XETDNfb and uw5XETph
stemming, no feedback
stemming, no feedback
phonetic 4-grams, no feedback
phonetic 4-grams, no feedback
stemming, no feedback
stemming, no feedback
phonetic 4-grams, no feedback
phonetic 4-grams, no feedback
stemming, no feedback
stemming, no feedback
phonetic 4-grams, no feedback
phonetic 4-grams, no feedback
stemming, no feedback
stemming, no feedback
phonetic 4-grams, no feedback
phonetic 4-grams, no feedback
MANUAL FIELDS, stemming, no feedback
MANUAL FIELDS, stemming, no feedback
MANUAL FIELDS, stemming, no feedback
MANUAL FIELDS, stemming, feedback
MANUAL FIELDS, stemming, feedback
MANUAL FIELDS, stemming, feedback
map</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>uw5XET 0.090 uw5XETD 0.099 uw5XETDN 0.116 uw5XETfb 0.100 uw5XETDfb 0.110 uw5XETDNfb 0.116 uw5XETph 0.087 uw5XETDph 0.097 uw5XETfs 0.098 uw5XETDfs 0.112 uw5XETDNfs 0.114 uw5XCT 0.039 uw5XCTD 0.054 uw5XCTph 0.047 uw5XCTDph 0.055 uw5XFT 0.094 uw5XFTD 0.108 uw5XFTph 0.085 uw5XFTDph 0.101 uw5XGT 0.079 uw5XGTD 0.077 uw5XGTph 0.064 uw5XGTDph 0.072 uw5XST 0.087 uw5XSTD 0.092 uw5XSTph 0.086 uw5XSTDph 0.095 uw5XMT 0.224 uw5XMTD 0.235 uw5XMTDN 0.251 uw5XMTfb 0.226 uw5XMTDfb 0.258 uw5XMTDNfb 0</source>
          .
          <fpage>255</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>