<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Waterloo Experiments for the CLEF05 SDR Track</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Charles</forename><forename type="middle">L A</forename><surname>Clarke</surname></persName>
							<email>claclark@plg.uwaterloo.ca</email>
							<affiliation key="aff0">
								<orgName type="department">School of Computer Science</orgName>
								<orgName type="institution">University of Waterloo</orgName>
								<address>
									<country key="CA">Canada</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Waterloo Experiments for the CLEF05 SDR Track</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">895AE6E3E2F8DA3BCA3AAF14E9AF4458</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T00:36+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>H.3 [Information Storage and Retrieval]: H.3.3 Information Search and Retrieval Measurement</term>
					<term>Performance</term>
					<term>Experimentation Spoken Data Retrieval</term>
					<term>Okapi</term>
					<term>Query Expansion</term>
				</keywords>
			</textClass>
			<abstract/>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>This year is the first year that the Information Retrieval Group at the University of Waterloo participated in CLEF. For the Cross-Language Spoken Document Retrieval track we submitted five official runs -three English automatic runs (title-only, title+desc, and title+desc+narr), a Czech automatic run (title-only) and a French automatic run (title-only). All official runs used a combination of several query formulation and expansion techniques, including phonetic n-grams and pseudo-relevance feedback expansion over a topic-specific external corpus crawled from the Web. In addition, a large number of un-official runs were generated, including German and Spanish runs. This brief report provides an overview of our experiments, which are summarized in figure <ref type="figure">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Retrieval Methods</head><p>All our runs were generated by the Wumpus retrieval system 1 using Okapi BM25 as the basic retrieval method.</p><p>The Wumpus implementation of Okapi BM25 is a variant of the formula given by Robertson et al. <ref type="bibr" target="#b2">[3]</ref>. Given a term set Q, a document d is assigned the score:</p><formula xml:id="formula_0">t∈Q q t • log (D/D t ) (k 1 + 1)d t K + d t<label>(1)</label></formula><p>where D = number of documents in the corpus D t = number of documents containing t q t = frequency that t occurs in the topic</p><formula xml:id="formula_1">d t = frequency that t occurs in d K = k 1 ((1 − b) + b • l d /l avg ) l d = length of d l avg = average document length</formula><p>All CLEF 2005 runs used parameter settings of k 1 = 1.2 and b = 0.75. Many of our runs incorporated pseudo-relevance feedback, following the process described in Yeung et al. <ref type="bibr" target="#b0">[1]</ref>. For feedback purposes, we augmented the CLEF 2005 SDR corpus with a 2.5GB corpus of Web data, generated by a topic-focused crawl, seeded from 17 sites dedicated to the holocaust. Each query was first executed against this augmented corpus. Terms were extracted from the top results and added to the initial query, which was then executed against the SDR Corpus.</p><p>As an alternative to stemming, many runs were based on phoneme 4-grams. For these runs, NIST's text-to-phone tool<ref type="foot" target="#foot_0">2</ref> was applied to translate the words in the corpus into phoneme sequences, which were then split into 4-grams and indexed. Queries were pre-processed in a similar fashion before execution.</p><p>Several runs, including our official English-language submissions, were generated by fusing word and n-gram runs. For these runs, fusion was performed using the standard CombMNZ algorithm <ref type="bibr" target="#b1">[2]</ref>.</p><p>Our non-English runs used translated queries supplied by the University of Ottawa group. The reader should consult their CLEF 2005 paper for further information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Discussion</head><p>On the training data, the fusion of feedback and phonetic n-gram runs produced a substantial performance improvement over the baseline Okapi runs. Unfortunately, this the improvement was not seen on the test data, where feedback produced only a modest improvement and the phonetic n-grams generally harmed performance.</p><p>Next year, we hope to expand our participation in CLEF, including the evaluation of additional speech-specific techniques in the context of the SDR track. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>Summary of runs and results. The name of submitted runs appear in boldface.</figDesc><table><row><cell cols="2">Lang run</cell><cell>map</cell><cell cols="3">bpref Fields Description</cell></row><row><cell>E</cell><cell>uw5XET</cell><cell cols="2">0.090 0.113</cell><cell>T</cell><cell>stemming, no feedback</cell></row><row><cell>E</cell><cell>uw5XETD</cell><cell cols="2">0.099 0.128</cell><cell>TD</cell><cell>stemming, no feedback</cell></row><row><cell>E</cell><cell>uw5XETDN</cell><cell cols="4">0.116 0.147 TDN stemming, no feedback</cell></row><row><cell>E</cell><cell>uw5XETfb</cell><cell cols="2">0.100 0.127</cell><cell>T</cell><cell>stemming, feedback</cell></row><row><cell>E</cell><cell>uw5XETDfb</cell><cell cols="2">0.110 0.140</cell><cell>TD</cell><cell>stemming, feedback</cell></row><row><cell>E</cell><cell>uw5XETDNfb</cell><cell cols="4">0.116 0.142 TDN stemming, feedback</cell></row><row><cell>E</cell><cell>uw5XETph</cell><cell cols="2">0.087 0.114</cell><cell>T</cell><cell>phonetic 4-grams, no feedback</cell></row><row><cell>E</cell><cell>uw5XETDph</cell><cell cols="2">0.097 0.120</cell><cell>TD</cell><cell>phonetic 4-grams, no feedback</cell></row><row><cell>E</cell><cell>uw5XETfs</cell><cell cols="2">0.098 0.127</cell><cell>T</cell><cell>fusion of uw5XETfb and uw5XETph</cell></row><row><cell>E</cell><cell>uw5XETDfs</cell><cell cols="2">0.112 0.139</cell><cell>TD</cell><cell>fusion of uw5XETDfb and uw5XETDph</cell></row><row><cell>E</cell><cell cols="5">uw5XETDNfs 0.114 0.141 TDN fusion of uw5XETDNfb and uw5XETph</cell></row><row><cell>C</cell><cell>uw5XCT</cell><cell cols="2">0.039 0.061</cell><cell>T</cell><cell>stemming, no feedback</cell></row><row><cell>C</cell><cell>uw5XCTD</cell><cell cols="2">0.054 0.091</cell><cell>TD</cell><cell>stemming, no feedback</cell></row><row><cell>C</cell><cell>uw5XCTph</cell><cell cols="2">0.047 0.093</cell><cell>T</cell><cell>phonetic 4-grams, no feedback</cell></row><row><cell>C</cell><cell>uw5XCTDph</cell><cell cols="2">0.055 0.095</cell><cell>TD</cell><cell>phonetic 4-grams, no feedback</cell></row><row><cell>F</cell><cell>uw5XFT</cell><cell cols="2">0.094 0.121</cell><cell>T</cell><cell>stemming, no feedback</cell></row><row><cell>F</cell><cell>uw5XFTD</cell><cell cols="2">0.108 0.137</cell><cell>TD</cell><cell>stemming, no feedback</cell></row><row><cell>F</cell><cell>uw5XFTph</cell><cell cols="2">0.085 0.116</cell><cell>T</cell><cell>phonetic 4-grams, no feedback</cell></row><row><cell>F</cell><cell>uw5XFTDph</cell><cell cols="2">0.101 0.122</cell><cell>TD</cell><cell>phonetic 4-grams, no feedback</cell></row><row><cell>G</cell><cell>uw5XGT</cell><cell cols="2">0.079 0.112</cell><cell>T</cell><cell>stemming, no feedback</cell></row><row><cell>G</cell><cell>uw5XGTD</cell><cell cols="2">0.077 0.112</cell><cell>TD</cell><cell>stemming, no feedback</cell></row><row><cell>G</cell><cell>uw5XGTph</cell><cell cols="2">0.064 0.105</cell><cell>T</cell><cell>phonetic 4-grams, no feedback</cell></row><row><cell>G</cell><cell>uw5XGTDph</cell><cell cols="2">0.072 0.108</cell><cell>TD</cell><cell>phonetic 4-grams, no feedback</cell></row><row><cell>S</cell><cell>uw5XST</cell><cell cols="2">0.087 0.109</cell><cell>T</cell><cell>stemming, no feedback</cell></row><row><cell>S</cell><cell>uw5XSTD</cell><cell cols="2">0.092 0.121</cell><cell>TD</cell><cell>stemming, no feedback</cell></row><row><cell>S</cell><cell>uw5XSTph</cell><cell cols="2">0.086 0.122</cell><cell>T</cell><cell>phonetic 4-grams, no feedback</cell></row><row><cell>S</cell><cell>uw5XSTDph</cell><cell cols="2">0.095 0.117</cell><cell>TD</cell><cell>phonetic 4-grams, no feedback</cell></row><row><cell>E</cell><cell>uw5XMT</cell><cell cols="2">0.224 0.224</cell><cell>T</cell><cell>MANUAL FIELDS, stemming, no feedback</cell></row><row><cell>E</cell><cell>uw5XMTD</cell><cell cols="2">0.235 0.243</cell><cell>TD</cell><cell>MANUAL FIELDS, stemming, no feedback</cell></row><row><cell>E</cell><cell>uw5XMTDN</cell><cell cols="4">0.251 0.260 TDN MANUAL FIELDS, stemming, no feedback</cell></row><row><cell>E</cell><cell>uw5XMTfb</cell><cell cols="2">0.226 0.244</cell><cell>T</cell><cell>MANUAL FIELDS, stemming, feedback</cell></row><row><cell>E</cell><cell>uw5XMTDfb</cell><cell cols="2">0.258 0.264</cell><cell>TD</cell><cell>MANUAL FIELDS, stemming, feedback</cell></row><row><cell>E</cell><cell>uw5XMTDNfb</cell><cell cols="4">0.255 0.270 TDN MANUAL FIELDS, stemming, feedback</cell></row><row><cell cols="2">Figure 1:</cell><cell></cell><cell></cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">www.nist.gov/speech/tools/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Task-Specific Query Expansion (MultiText Experiments for TREC</title>
		<author>
			<persName><forename type="first">L</forename><surname>David</surname></persName>
		</author>
		<author>
			<persName><surname>Yeung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">A</forename><surname>Charles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Gordon</forename><forename type="middle">V</forename><surname>Clarke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Thomas</forename><forename type="middle">R</forename><surname>Cormack</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Egidio</forename><forename type="middle">L</forename><surname>Lynam</surname></persName>
		</author>
		<author>
			<persName><surname>Terra</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Twelfth Text REtrieval Conference</title>
				<imprint>
			<publisher>National Institute of Standards and Technology</publisher>
			<date type="published" when="2003">2003. 2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Combination of multiple searches</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">A</forename><surname>Fox</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Shaw</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Second Text REtrieval Conference</title>
				<imprint>
			<publisher>National Institute of Standards and Technology</publisher>
			<date type="published" when="1994">1994</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Okapi at TREC-7: Automatic ad hoc, filtering, VLC and interactive track</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">E</forename><surname>Robertson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Walker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Beaulieu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Seventh Text REtrieval Conference</title>
				<imprint>
			<publisher>National Institute of Standards and Technology</publisher>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
