<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">ELiRF at MediaEval 2013: Spoken Web Search Task</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Jon</forename><forename type="middle">A</forename><surname>Gómez</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Lluís-F</forename><surname>Hurtado</surname></persName>
							<email>lhurtado@dsic.upv.es</email>
						</author>
						<author>
							<persName><forename type="first">Marcos</forename><surname>Calvo</surname></persName>
							<email>mcalvo@dsic.upv.es</email>
						</author>
						<author>
							<persName><forename type="first">Emilio</forename><surname>Sanchis</surname></persName>
							<email>esanchis@dsic.upv.es</email>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="department">Departament de Sistemes Informàtics</orgName>
								<orgName type="institution">Computació</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="institution">Universitat Politècnica de València</orgName>
								<address>
									<addrLine>Camí de Vera s/n</addrLine>
									<postCode>46020</postCode>
									<settlement>València</settlement>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">ELiRF at MediaEval 2013: Spoken Web Search Task</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">9E00724E0FF162663C28182E4F6FFDC4</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-19T17:58+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper, we present the systems that the Natural Language Engineering and Pattern Recognition group (ELiRF) has submitted to the MediaEval 2013 Spoken Web Search task. All of them are based on a Subsequence Dynamic Time Warping algorithm and are zero-resources systems.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>In this paper, we present the systems that we have sumitted to the MediaEval 2013 Spoken Web Search task <ref type="bibr" target="#b1">[2]</ref>. This task can be placed in the framework of Query-by-Example Spoken Term Detection (QbE-STD) tasks, where a set of documents and queries are provided, and the goal of the task is to find all the occurrences of each query within each document in the collection. In this particular case, a variety of languages and acoustic conditions are represented, but no information about them is provided to the participants.</p><p>All the systems we have submitted to this MediaEval 2013 Evaluation are based on a Subsequence Dynamic Time Warping (S-DTW) algorithm <ref type="bibr" target="#b0">[1]</ref>, but using different distances, sets of possible movements, and feature vectors. Also, all our systems are zero-resources systems, that is, they do not use any external information, but just the one provided by the task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">DESCRIPTION OF THE SYSTEMS</head><p>For this task, we have submitted four different systems, all of them based on the S-DTW algorithm. S-DTW is a Dynamic Programming (DP) algorithm which aim is to find multiple local alignments of two input sequences of objects using a set of allowed movements, but allowing one of the sequences to start at any position of the other. Equation <ref type="formula">1</ref>shows the generic formulation of the S-DTW algorithm.</p><formula xml:id="formula_0">M (i, j) =            +∞ i &lt; 0 +∞ j &lt; 0 0 j = 0 min ∀(x,y)∈S M (i − x, j − y) + D(A(i), B(j)) j ≥ 1 (1)</formula><p>where M is the DP matrix; S is the set of allowed movements, represented as pairs (x, y) of horizontal and vertical increments; A(i), B(j) are the objects representing the positions i and j of their respective sequences; and D is a func-Copyright is held by the author/owner(s).</p><p>MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain tion that computes some distance or dissimilarity between two objects.</p><p>In this task the sequences of objects to be aligned are the sequences of feature vectors obtained from the audio files corresponding to the documents and the queries. This approach allows us to find the best alignment of the query in each document taking as the starting point every frame of the document. For this work we have used the cosine distance in all our systems, since it provided the best results for the development set.</p><p>Each of the computed alignments should be considered a candidate detection. Hence, this strategy provides a too large number of candidates. This way, it is necessary to find a criterion to find the set of definitive detections among the elements of this set.</p><p>Another common step for all our systems is that, as part of the preprocessing, we deleted the leading and trailing silences of the queries by using a Voice Activity Detection strategy based on a Smith trigger. This led our systems to a better performance.</p><p>Thus, our systems differ basically on three different aspects: how the feature vectors are obtained, how to determine which of the candidate detections are considered as definitive, and which are the allowed movements in the Dynamic Programming algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">System 1</head><p>In this system, the acoustic signal is parametrized using the energy, the first twelve cepstral coeficients and their first and second derivatives, using a sampling period of 10 ms. Thus, we represent each frame as a 39-dimensional vector. Then, we perform the S-DTW step, using a particular set of movements: {(1,2), (1,1), (2,1)}. Also, in each step of the S-DTW algorithm, we have kept and maximized the accumulated distance normalized by the number of operations carried out until that point. The set of candidate detections are all the hypotheses that arrived to any cell corresponding to the last frame of the query in the DP matrix. Furthermore, the set of movements used guarantees the size of any detection will be between 0.5 and 2 times the size of the query. These candidates are filtered using Algorithm 1. The idea of this algorithm is to find all the local minima that do not overlap any other local minimum with a better score, and then fix a threshold according to a linear combination of the average and standard deviation of the scores of the "cleaned" set of local minima (the parameter λ of this linear combination is empirically adjusted). Also, a maximum number of filtered detections for each query d is allowed. In this system, we have adjusted the parameters in order to obtain just a few definitive detections per query. Delete from SCD all the detections h such that timespan(h )∩timespan(h) = ∅ 7: end while 8: t = avg + λ • sd, where avg and sd represent the average and the standard deviation of the elements in FD2 9: F D = first d elements of F D2 with a score ≥ t 10: return F D</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">System 2</head><p>This system is very similar to System 1, but the thresholds were adjusted in a less restrictive way. The number of hypotheses provided by this system is much larger than for System 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">System 3</head><p>This system uses the same parametrization as Systems 1 and 2. However, the allowed movements for the S-DTW are {(0,1), (1,0), (1,1)}. Also the algorithm to filter the candidate detections is a bit different (see <ref type="bibr">Algorithm 2)</ref>. In this algorithm the condition for local minima not to be pruned is that: (i) they have a value larger than a threshold and (ii) there is not any other detection with a better score within a window of 2 seconds. Finally, at most n occurrences per query and k detections per document are allowed.</p><p>Algorithm 2 Another way of filtering a list of candidate detections Require: A list of candidate detections CD, a maximum number of occurrences per query n, a maximum number of detections per document k Ensure: A list of filtered detections F D 1: SCD = empty list 2: for all Query q do 3: for all Document d do 4: m = minimum score of a detection of q within d 5:</p><p>M = maximum score of a detection of q within d 6:</p><formula xml:id="formula_1">t = m + 0.1(M − m) 7:</formula><p>Add to SCD all the hypotheses from CD with a score larger than t and that do not overlap a better detection within a window of 2 seconds. 8:</p><p>end for 9: end for 10: F DP = For each query, keep the at most n best occurrences in SCD 11: F D = For each document, keep the at most k best detections in F DP 12: return F D</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">System 4</head><p>This system is similar to System 3, but the way of obtaining the feature vectors varies. The features are here obtained by using a Dissimilarity Space. 300 frames are selected from the development set applying the Katsavounidis criterion with the cosine distance as metric <ref type="bibr" target="#b2">[3]</ref>. Then each frame is moved into the dissimilarity space, where each component of the new feature vectors is computed as the distance from the sample to each one of the 300 taken as references. Thus, in this system the feature vectors have 300 dimensions. All the frames from both the documents and the queries are converted to this Dissimilarity Space, and the S-DTW is performed using these vectors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">EXPERIMENTS AND RESULTS</head><p>For this MediaEval 2013 Spoken Web Search Evaluation, we submitted one run for each of the four systems described above. The results we obtained are shown in Tables <ref type="table" target="#tab_2">1 and 2</ref>, where P stands for Precision and R means Recall. All the software of the systems presented here was completely developed in our research group. Also, all these systems were run on a standard PC with an i7 processor and 32 GB of RAM, using 8 threads. The memory peaks for systems 1 and 2 were around 12 GB, and for systems 3 and 4 were around 1 GB.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Algorithm 1</head><label>1</label><figDesc>Algorithm to filter a list of candidate detections Require: A list of candidate detections CD, a maximum number of filtered detections d, a coefficient λ Ensure: A list of filtered detections F D 1: SCD = sort the hypothesis in CD by their score 2: F D2 = empty list 3: while SCD is not empty do 4: h = first element of SCD 5: Move h to F D2 6:</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 2</head><label>2</label><figDesc>also shows the Real Time factor (RT) obtained for the test set. Its value for the development set is very similar.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1 :</head><label>1</label><figDesc>Results obtained for the development set.</figDesc><table><row><cell cols="4">System MTWV ATWV P(%) R(%) Cnxe</cell></row><row><cell>Sys. 1</cell><cell>0.1699</cell><cell>0.1697</cell><cell>3.47 15.69 2.45</cell></row><row><cell>Sys. 2</cell><cell>0.1296</cell><cell>0.1291</cell><cell>2.21 16.71 3.91</cell></row><row><cell>Sys. 3</cell><cell>0.1480</cell><cell>0.1478</cell><cell>3.18 14.37 1.03</cell></row><row><cell>Sys. 4</cell><cell>0.1463</cell><cell>0.1461</cell><cell>2.55 15.76 1.00</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2 :</head><label>2</label><figDesc>Results obtained for the test set.</figDesc><table><row><cell cols="4">Sys. MTWV ATWV P(%) R(%) Cnxe</cell><cell>RT</cell></row><row><cell>S. 1</cell><cell>0.1593</cell><cell>0.1591</cell><cell cols="2">3.29 14.89 2.53 3•10 −3</cell></row><row><cell>S. 2</cell><cell>0.1016</cell><cell>0.1016</cell><cell cols="2">1.99 12.44 4.83 3•10 −3</cell></row><row><cell>S. 3</cell><cell>0.1481</cell><cell>0.1475</cell><cell cols="2">3.03 13.66 1.03 5•10 −4</cell></row><row><cell>S. 4</cell><cell>0.1462</cell><cell>0.1457</cell><cell cols="2">2.47 15.08 1.00 2•10 −3</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">ACKNOWLEDGEMENTS</head><p>Work funded by the Spanish Government and the E.U. under contract TIN2011-28169-C05 and FPU Grant AP2010-4193.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Memory efficient subsequence DTW for Query-by-Example spoken term detection</title>
		<author>
			<persName><forename type="first">X</forename><surname>Anguera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ferrarons</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on Multimedia and Expo</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">The Spoken Web Search Task</title>
		<author>
			<persName><forename type="first">X</forename><surname>Anguera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Metze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Buzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Szoke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">J</forename><surname>Rodriguez-Fuentes</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">MediaEval 2013 Workshop</title>
				<imprint>
			<date type="published" when="2013-10">October 2013</date>
			<biblScope unit="page" from="18" to="19" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">A new initialization technique for generalized Lloyd iteration</title>
		<author>
			<persName><forename type="first">I</forename><surname>Katsavounidis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-C. Jay</forename><surname>Kuo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Signal Processing Letters</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">10</biblScope>
			<biblScope unit="page" from="144" to="146" />
			<date type="published" when="1994">1994</date>
		</imprint>
	</monogr>
	<note>IEEE</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
