<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">The University of Amsterdam at the CLEF Cross Language Speech Retrieval Track 2007</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Bouke</forename><surname>Huurnink</surname></persName>
							<email>bhuurnin@science.uva.nl</email>
							<affiliation key="aff0">
								<orgName type="department">ISLA</orgName>
								<orgName type="institution">University of Amsterdam</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">The University of Amsterdam at the CLEF Cross Language Speech Retrieval Track 2007</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">253933F7A0BDECCB6AF88810E32A4DCC</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T07:02+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Speech Retrieval</term>
					<term>Cross-Language Information Retrieval</term>
					<term>Text Transformations</term>
					<term>Field Combination</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper we present the contents of the University of Amsterdam submission in the CLEF Cross Language Speech Retrieval 2007 English task. We describe the effects of using character n-grams and field combinations on both monolingual English retrieval, and crosslingual Dutch to English retrieval.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Even in a well-funded archive, it is often infeasible to manually annotate all documents in the collection. The digitisation of multimedia collections opens the door to automatic techniques for discovering interesting documents, provided that we can leverage automatically generated annotations to their best advantage. The University of Amsterdam participated in the CLEF CL-SR 2007 English task in the hope of applying lessons learned there to the retrieval of documents from large Dutch audio-visual archives, in particular the Netherlands Institute for Sound and Vision<ref type="foot" target="#foot_0">1</ref> which stores the nation's public television broadcasts. These archives contain a lot of spoken material, some of which has been manually annotated by a team of archivists. A significant portion, however, has not been annotated at all. Therefore we investigated strategies both for search using only automatically generated text, as well as combining this text with manually generated annotations.</p><p>Our focus was on simple techniques that can easily be transferred to other domains. In our experiments we explored the use of character n-grams to improve the retrieval of documents using automatically generated text. We also explored the combination of manually generated with automatically generated text. In both cases we contrasted monolingual retrieval of English documents using English queries with cross-lingual retrieval of English results using Dutch queries.</p><p>The remainder of this paper is structured as follows. We first describe the setup of the retrieval system and experiments in Section 2. This is followed by the runs and results in Section 3. Finally we present our conclusions in Section 4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Experimental Setup</head><p>We work with the CLEF CL-SR experimental English spoken document collection, which consists of a series of English language interviews that have been manually split into short segments. Each segment has been associated with manually and automatically assigned metadata, including manual summaries, manually assigned keywords, automatic speech transcriptions, and a number of fields containing automatically assigned keywords. For experiments using only automatically assigned information we use the ASRTEXT2006B (speech transcription) and ASRKEY-WORD2004A2 (automatic keyword) fields. Other automatic transcripts and keywords were available, but we chose to use only one of each, which may have had a negative impact on our results. For experiments including manual annotations we also added the MANUALKEYWORD (manual keyword) and SUMMARY (manual summary) fields.</p><p>The CLEF CL-SR benchmark provided 63 training topics with a ground truth, as well as 33 test topics. The original topic descriptions are in English, and have the traditional TREC titledescription -narrative structure. Also available were manually created Dutch topic translations, donated by the University of Twente. We used these Dutch topics for the cross-lingual runs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Retrieval Infrastructure</head><p>All documents were indexed and retrieved using the Indri engine from the Lemur retrieval toolkit<ref type="foot" target="#foot_1">2</ref> . This engine allows for fielded search in a language modeling framework. As is standard in English text retrieval, commonly occurring stop words were removed. Terms were stemmed to their morphological roots using the Porter <ref type="bibr" target="#b2">[3]</ref> stemming algorithm. Retrieval parameters were optimised for automatic monolingual retrieval on the ASRTEXT2006B field, using the training topics to find the best combination.</p><p>As for the topics, the title and description fields of each topic were combined to make a text query. The Dutch topics were automatically translated to English using online resources, in order to be able to retrieve the English documents. As different translation systems perform better for different topics <ref type="bibr" target="#b3">[4]</ref>, we used two different online tools to translate the topics from Dutch to English. We used the SYSTRAN<ref type="foot" target="#foot_2">3</ref> and FreeTranslation.com<ref type="foot" target="#foot_3">4</ref> systems, and combined the results to form a large 'bag-of-words' cross-lingual query. Some of the differences between translations can be seen in the example given in Table <ref type="table" target="#tab_0">1</ref>. For instance, the word 'acts' is translated into 'deeds' by FreeTranslation.com and 'prowesses' by SYSTRAN. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Character n-Gram Experiments</head><p>Character n-gram tokenisation has been shown to boost retrieval in certain situations <ref type="bibr" target="#b1">[2]</ref>, such as retrieval from English newspapers <ref type="bibr" target="#b0">[1]</ref>. We were interested to see whether this would also prove useful for the specific situation of (cross-lingual) retrieval of automatically generated text. To test this, we followed the tokenisation strategy in <ref type="bibr" target="#b1">[2]</ref>, and created overlapping, cross-word character n-grams of the text before it was indexed. An example is shown in Table <ref type="table" target="#tab_1">2</ref>. In designing the experiment, we used only the (weighted) ASRTEXT2006B and AUTOKEYWORD2004A2 fields.</p><p>We evaluated MAP for retrieval at different n-gram sizes on the training topics prior to submission, and found that 4-grams provided the best performance. Likewise, we evaluated different weightings for the ASRTEXT2006B and AUTOKEYWORD2004A2 fields. Here we found the best setting to be ASRTEXT2006B = 0.75 and AUTOKEYWORD2004A2= 0.25. These, then, are the settings that we used in our officially submitted runs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Field Combination Experiments</head><p>We evaluated field combination, as we may later wish to apply this technique to retrieving the annotated portion of multimedia documents in a audio-visual archive. Combination was done using the Indri query language, giving different fields different weights. The fields that we used were MANUALKEYWORD, SUMMARY, ASRTEXT2006B, and AUTOKEYWORD2004A2.</p><p>As with the n-gram experiments, we determined the optimal combination setting on the set of 63 training topics that were provided, using MAP as our evaluation measure. We found that the best weighting for monolingual retrieval was MANUALKEYWORD = 0.375, SUMMARY = 0.375, ASRTEXT2006B = 0.125, and AUTOKEYWORD2004A2 = 0.125. For the cross-lingual task, the automatic keywords gave no contribution to retrieval performance and the best weighting was MANUALKEYWORD = 0.375, SUMMARY = 0.375 and ASRTEXT2006B = 0.25.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Runs and Results</head><p>Table <ref type="table" target="#tab_2">3</ref> shows the results of the official runs submitted to CLEF CL-SR. Also shown are two runs that were generated post-hoc to allow fair comparison of the n-gram techniques to a baseline. The post-hoc runs were both generated using stopped and stemmed text from both the ASRTEXT2006B and AUTOKEYWORD2004A2 fields, weighted as described in Section 2.2.</p><p>Examining the n-gram runs, we found that monolingual retrieval of the automatic fields using character 4-grams decreased MAP by 9.6% . Cross-lingual retrieval, on the other hand, benefited from the use of 4-grams with an increase in MAP of 4%.</p><p>The combination runs, which included both manual and automatic information, performed much better than runs containing only automatically derived text. This is not surprising, it has been demonstrated in previous CLEF CL-SR tracks that manual annotation allows much better retrieval than automatic information alone. The weightings derived in the training phase indicate that automatically generated keywords are helpful for monolingual retrieval, but do not help for this specific case of cross-lingual retrieval. Automatically recognised speech, however, was useful for both monolingual and cross-lingual retrieval.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusions</head><p>This paper has described the setup and performance of the University of Amsterdam's entry in the CLEF CL-SR 2007 English retrieval task. We investigated the effect of using n-grams to retrieve automatically generated text, finding that they decreased monolingual performance but improved cross-lingual performance. Furthermore, we examined the effects of combining manual and automatically generated text, and saw that both can be useful. We hope that the lessons learned here will aid us in practice, and help us enhance search through Dutch audio-visual archives. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Sample Topic Translations (Topic 15602)</figDesc><table><row><cell cols="2">Original English Topic FreeTranslation.com</cell><cell>SYSTRAN</cell></row><row><cell>Heroic survival stories.</cell><cell>Heroic survivals story.</cell><cell>Herosche overlevingsver-</cell></row><row><cell>Stories of heroic acts or ac-</cell><cell>Tell of heroic deeds or</cell><cell>halen. Tales of prowesses</cell></row><row><cell>tivities that led to the sur-</cell><cell>heroic actions that led</cell><cell>or herosche action which</cell></row><row><cell>vival of one or more indi-</cell><cell>till the (save) [survive] of</cell><cell>led to (save) [survive] of</cell></row><row><cell>viduals are desired.</cell><cell>an or several individuals</cell><cell>one or more individuals</cell></row><row><cell></cell><cell>have been wished.</cell><cell>have been needed.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>n-Gram Tokenisation Example</figDesc><table><row><cell>Original Text</cell><cell>4-Grams</cell></row><row><cell>heroic survivals story</cell><cell>hero eroi roic oic* ic*s</cell></row><row><cell></cell><cell>c*su *sur surv urvi rviv</cell></row><row><cell></cell><cell>viva ival vals als* ls*s s*st</cell></row><row><cell></cell><cell>*sto stor tory</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Results of CLEF CL-SR runs</figDesc><table><row><cell>Run ID</cell><cell>Type</cell><cell>Fields</cell><cell>MAP</cell></row><row><cell>UvA 1 base</cell><cell>monolingual baseline</cell><cell>ASRTEXT2006B</cell><cell>0.0430</cell></row><row><cell>UvA 2 en4g</cell><cell>monolingual 4-grams</cell><cell>ASRTEXT2006B,</cell><cell>0.0444</cell></row><row><cell></cell><cell></cell><cell>AUTOKEYWORD2004A2</cell><cell></cell></row><row><cell>UvA 3 nl4g</cell><cell>cross-lingual 4-grams</cell><cell>ASRTEXT2006B,</cell><cell>0.0400</cell></row><row><cell></cell><cell></cell><cell>AUTOKEYWORD2004A2</cell><cell></cell></row><row><cell>UvA 4 enopt</cell><cell cols="2">monolingual combination MANUALKEYWORD,</cell><cell>0.2088</cell></row><row><cell></cell><cell></cell><cell>SUMMARY,</cell><cell></cell></row><row><cell></cell><cell></cell><cell>ASRTEXT2006B,</cell><cell></cell></row><row><cell></cell><cell></cell><cell>AUTOKEYWORD2004A2</cell><cell></cell></row><row><cell>UvA 5 nlopt</cell><cell cols="2">cross-lingual combination MANUALKEYWORD,</cell><cell>0.1408</cell></row><row><cell></cell><cell></cell><cell>SUMMARY,</cell><cell></cell></row><row><cell></cell><cell></cell><cell>ASRTEXT2006B,</cell><cell></cell></row><row><cell></cell><cell></cell><cell>AUTOKEYWORD2004A2</cell><cell></cell></row><row><cell cols="2">unsubmitted run monolingual</cell><cell>ASRTEXT2006B,</cell><cell>0.0491</cell></row><row><cell></cell><cell></cell><cell>AUTOKEYWORD2004A2</cell><cell></cell></row><row><cell cols="2">unsubmitted run cross-lingual</cell><cell>ASRTEXT2006B,</cell><cell>0.0385</cell></row><row><cell></cell><cell></cell><cell>AUTOKEYWORD2004A2</cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://www.beeldengeluid.nl/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://www.lemurproject.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">http://www.systran.co.uk/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">http://www.freetranslation.com/</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This research was supported by the Netherlands Organisation for Scientific Research (NWO) MUNCH project under project number 640.002.501.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Monolingual document retrieval for European languages</title>
		<author>
			<persName><forename type="first">V</forename><surname>Hollink</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kamps</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Monz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>De Rijke</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Inf. Retr</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="issue">1-2</biblScope>
			<biblScope unit="page" from="33" to="52" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Character n-gram tokenization for European language text retrieval</title>
		<author>
			<persName><forename type="first">P</forename><surname>Mcnamee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mayfield</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Inf. Retr</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="issue">1-2</biblScope>
			<biblScope unit="page" from="73" to="97" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">An algorithm for suffix stripping</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">F</forename><surname>Porter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Readings in information retrieval</title>
				<meeting><address><addrLine>San Francisco, CA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Morgan Kaufmann Publishers Inc</publisher>
			<date type="published" when="1997">1997</date>
			<biblScope unit="page" from="313" to="316" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Report on CLEF-2003 multilingual tracks</title>
		<author>
			<persName><forename type="first">Jacques</forename><surname>Savoy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CLEF</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<editor>
			<persName><forename type="first">Carol</forename><surname>Peters</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Julio</forename><surname>Gonzalo</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Martin</forename><surname>Braschler</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Michael</forename><surname>Kluck</surname></persName>
		</editor>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2003">2003</date>
			<biblScope unit="volume">3237</biblScope>
			<biblScope unit="page" from="64" to="73" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
