<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">IIIT-H System for MediaEval 2014 QUESST</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Santosh</forename><surname>Kesiraju</surname></persName>
							<email>santosh.k@research.iiit.ac.in</email>
							<affiliation key="aff0">
								<orgName type="institution">International Institute of Information Technology</orgName>
								<address>
									<settlement>Hyderabad</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Gautam</forename><surname>Mantena</surname></persName>
							<email>gautam.mantena@research.iiit.ac.in</email>
							<affiliation key="aff0">
								<orgName type="institution">International Institute of Information Technology</orgName>
								<address>
									<settlement>Hyderabad</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Kishore</forename><surname>Prahallad</surname></persName>
							<email>kishore@iiit.ac.in</email>
							<affiliation key="aff0">
								<orgName type="institution">International Institute of Information Technology</orgName>
								<address>
									<settlement>Hyderabad</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">IIIT-H System for MediaEval 2014 QUESST</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">824823D945A4F236D99C19FC076ED97A</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T16:11+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper describes the experiments and observations for Query-by-Example Search on Speech Task (QUESST) at MediaEval 2014. In this paper, we describe two different representations of speech that were explored for the task. We also show the capabilities and limitations of non-segmental dynamic time warping (NS-DTW) technique for searching various types of queries. This paper mainly focuses on the experiments and analysis of the existing NS-DTW algorithm for various types of queries. The observations show that for a specific representation of speech, the algorithm is capable of detecting partial matches.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>Some of the approaches for query-by-example spoken term detection rely on building models from resource rich languages, and use these models to convert the speech data into sequence of symbols. Building models for multi-lingual data is a challenging task as phone classes are not language universal. Another way is relying on dynamic time warping (DTW) based techniques for matching two time series vectors. Here, speech data is usually represented as Gaussian posteriorgrams (GP) of various acoustic features.</p><p>For MediaEval 2014 QUESST task <ref type="bibr" target="#b3">[2]</ref>, we have explored unsupervised techniques involving various representations for the speech data. Initially, we represented the speech data using GP of acoustic and bottle-neck features. We have also built a cross-lingual ASR and decoded the speech data into a sequence of symbols (phone sequences). Both the representations rely on DTW to detect the queries in the audio references.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">FEATURE REPRESENTATION</head><p>A three step process to generate the features for queries and the audio references is described here. (a) 39 dimensional frequency domain linear prediction (FDLP) features along with delta and acceleration coefficients were extracted for every 25 ms window and a shift of 10 ms. An all-pole model of order 160 poles/sec and 37 filter banks were considered to extract FDLP features. (b) Bottle neck (BN) features were derived from Multi-layer perceptron (MLP) trained with articulatory features (AF) (c) Gaussian posteriorgrams were computed for speech parameters (FDLP)</p><p>Copyright is held by the author/owner(s).</p><p>MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain in tandem with articulatory bottle neck features. Bottle neck features are a form of compressed features which are of lower dimension and also capture the classification properties of the target classes. These features were obtained from the MLP trained on 24 hours of labeled Telugu database <ref type="bibr" target="#b4">[3]</ref>. The articulatory bottle neck features were extracted as described in <ref type="bibr" target="#b6">[5]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">NS-DTW FOR SEARCH</head><p>We used a variant of DTW called non-segmental DTW (NS-DTW) <ref type="bibr" target="#b5">[4]</ref>, which differs in the local constraints. As a post processing method, we have pruned out some of the results. The pruning criteria is based on the slope of the aligned path. If m is the slope of the aligned path, then, only the paths satisfying (0.5 &lt; m &lt; 2), were considered. This helped us in eliminating some of the false alarms. We have used the linear calibration function in bosaris toolkit<ref type="foot" target="#foot_0">1</ref> to calibrate the scores. Table <ref type="table" target="#tab_0">1</ref> shows the results on development and evaluation dataset for different types of queries. All the experiments were performed on a single HP SL230 node which is equipped with two Intel E5-2640 processors with 12 cores each and 64 GB of main memory. The peak memory usage (PMU) was approximately 12 GB. The searching speed factor (SSF) was 3.46.</p><p>To increase the search speed, the distance computation was parallelized on a GPU (NVIDIA GT 610 with 48 cores and 2 GB of GPU memory). The SSF was reduced to 0.85.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">ANALYSIS OF THE EXPERIMENTS</head><p>We have analyzed the cases of false alarms and misses for all types of queries. The analysis on false alarms helped us in enforcing a slope constraint on the aligned path which was described in Section 3. The results in Table <ref type="table" target="#tab_0">1</ref> show that the NS-DTW algorithm is able to detect some of the type 2 queries, but fails in detecting type 3 queries. Fig. <ref type="figure" target="#fig_0">1(a)</ref> shows the similarity matrix plot for a multi-word query with filler content present in the reference. The dark bands represent the match between the query and the reference. In Fig. <ref type="figure" target="#fig_0">1</ref>(a) there are multiple dark bands, each showing a match between parts of the query (word) to the specific locations (words) in the reference. The peaks in the alignment scores in Fig. <ref type="figure" target="#fig_0">1</ref>(b) reflects the partial matches. This shows that for this specific (FDLP + AF-BN) feature representation of speech, the algorithm is capable of detecting smaller/partial matches. Even though the scores reflect the partial matches, we have observed that the poor performance of the system is due to the number of false alarms. Further investigation is required to find the that can penalize the false alarms. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">USING PHONE DECODER</head><p>In this work, we have also built a cross-lingual phone decoder and used NS-DTW for search. The cross-lingual decoder was built in a two step process. As the first step, we trained acoustic models on 24 hours of Telugu database <ref type="bibr" target="#b4">[3]</ref>. Then these models were used to decode MediaEval 2013 SWS database <ref type="bibr" target="#b1">[1]</ref>. The decoded symbols were bootstrapped and the models were re-trained. This process was repeated 4 times and the resulting acoustic models were used to obtain the hypotheses (global hypotheses).</p><p>We have built a phone confusion matrix in an unsupervised way which is as follows: (a) We divided the SWS 2013 database into 4 parts and 4 acoustic models were built (b) 4 hypotheses (local hypotheses), each corresponding to a different part of the database were obtained (c) A string alignment was done between the global hypotheses and each of the local hypotheses to obtain the phone confusions. The global hypotheses was considered as the reference in computing the phone confusions. Next, the queries and the audio references were decoded using the bootstrapped models, and the search was performed using the NS-DTW. The phone confusion matrix was used in the computation of similarity matrix in the NS-DTW framework.</p><p>If i and j are the indices of phones and N is the number of phones in the dictionary, then the similarity between them is given by, d(i, j) = c(i, j) ∀ 0 ≤ i, j ≤ N where c(i, j) is the confusion matrix of i being the reference phone and j being the query phone.</p><p>The SSF in this case was 0.38 and the PMU was approximately 2 GB. The results for various types of queries on development dataset are shown in Table <ref type="table" target="#tab_1">2</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">CONCLUSION</head><p>In this work, we have explored two different representations of speech. We have observed the capabilities and limitations of NS-DTW algorithm for various types of queries. We have also observed that the same algorithm is able to detect some of the type 2 queries in the reference documents. The future work is focused on improving the NS-DTW algorithm for detecting type 2 and type 3 queries and also in developing robust cross-lingual phone decoders.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: An example similarity matrix obtained using NS-DTW, when multi-word query with filler content is present in the reference.</figDesc><graphic coords="2,92.04,305.24,70.27,141.41" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Scores for various types of queries for</figDesc><table><row><cell cols="5">(FDLP + AF-BN) feature representation on dev and</cell></row><row><cell>eval datasets</cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell>dev dataset</cell><cell></cell></row><row><cell></cell><cell></cell><cell cols="2">Type of queries</cell></row><row><cell>Scores</cell><cell>All</cell><cell cols="3">Type 1 Type 2 Type 3</cell></row><row><cell cols="3">MinCnxe 0.8070 0.6734</cell><cell>0.8739</cell><cell>0.8986</cell></row><row><cell>Cnxe</cell><cell cols="2">0.9121 0.8032</cell><cell>1.0121</cell><cell>1.0235</cell></row><row><cell cols="3">MTWV 0.2263 0.3715</cell><cell>0.1472</cell><cell>0.0430</cell></row><row><cell>ATWV</cell><cell cols="2">0.2261 0.3662</cell><cell>0.1467</cell><cell>0.0425</cell></row><row><cell></cell><cell></cell><cell>eval dataset</cell><cell></cell></row><row><cell></cell><cell></cell><cell cols="2">Type of queries</cell></row><row><cell>Scores</cell><cell>All</cell><cell cols="3">Type 1 Type 2 Type 3</cell></row><row><cell cols="3">MinCnxe 0.8117 0.7006</cell><cell>0.8576</cell><cell>0.8936</cell></row><row><cell>Cnxe</cell><cell cols="2">0.9218 0.8115</cell><cell>1.0205</cell><cell>1.0012</cell></row><row><cell cols="3">MTWV 0.2062 0.3506</cell><cell>0.1188</cell><cell>0.0770</cell></row><row><cell>ATWV</cell><cell cols="2">0.2026 0.3475</cell><cell>0.1151</cell><cell>0.0655</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Scores for various types of queries for phone representation on dev dataset</figDesc><table><row><cell></cell><cell cols="3">Phone representation</cell></row><row><cell></cell><cell></cell><cell cols="2">Type of queries</cell></row><row><cell>Scores</cell><cell>All</cell><cell cols="2">Type 1 Type 2 Type 3</cell></row><row><cell cols="3">MinCnxe 0.9487 0.9331</cell><cell>0.9599</cell><cell>0.9641</cell></row><row><cell cols="3">MTWV 0.0477 0.0799</cell><cell>0.0308</cell><cell>0.0134</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://sites.google.com/site/bosaristoolkit/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title/>
		<author>
			<persName><surname>References</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">The Spoken Web Search Task</title>
		<author>
			<persName><forename type="first">X</forename><surname>Anguera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Metze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Buzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Szoke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">J</forename><surname>Rodriguez-Fuentes</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes Proceedings of the MediaEval</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m">Workshop</title>
				<meeting><address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2013">October 18-19 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Query by Example Search on Speech at Mediaeval</title>
		<author>
			<persName><forename type="first">X</forename><surname>Anguera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">J</forename><surname>Rodriguez-Fuentes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Szoke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Buzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Metze</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes Proceedings of the Mediaeval 2014 Workshop</title>
				<meeting><address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014-10-16">2014. October 16-17 2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Development of Indian language speech databases for LVCSR</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">K</forename><surname>Anumanchipalli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Chitturi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Joshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S R</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Sitaram</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kishore</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of SPECOM</title>
				<meeting>of SPECOM<address><addrLine>Patras, Greece</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Query-by-example spoken term detection using frequency domain linear prediction and non-segmental dynamic time warping</title>
		<author>
			<persName><forename type="first">G</forename><surname>Mantena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Achanta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Prahallad</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE/ACM Transactions on Audio, Speech, and Language Processing</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="946" to="955" />
			<date type="published" when="2014-05">May 2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Use of articulatory bottle-neck features for query-by-example spoken term detection in low resource scenarios</title>
		<author>
			<persName><forename type="first">G</forename><surname>Mantena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Prahallad</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title>
				<imprint>
			<date type="published" when="2014-05">2014. May 2014</date>
			<biblScope unit="page" from="7128" to="7132" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
