<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">IIIT-H SWS 2013: Gaussian Posteriorgrams of Bottle-Neck Features for Query-by-Example Spoken Term Detection</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Gautam</forename><surname>Mantena</surname></persName>
							<email>gautam.mantena@research.iiit.ac.in</email>
							<affiliation key="aff0">
								<orgName type="institution">International Institute of Information Technology</orgName>
								<address>
									<settlement>Hyderabad</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Kishore</forename><surname>Prahallad</surname></persName>
							<email>kishore@iiit.ac.in</email>
							<affiliation key="aff0">
								<orgName type="institution">International Institute of Information Technology</orgName>
								<address>
									<settlement>Hyderabad</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">IIIT-H SWS 2013: Gaussian Posteriorgrams of Bottle-Neck Features for Query-by-Example Spoken Term Detection</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">18CD8B11DA632D86EA1F0932CAC355E7</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-19T17:57+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper describes the experiments conducted for spoken web search (SWS) at MediaEval 2013 evaluations. A conventional approach is to train a multi-layer perceptron using high resource languages and then use it in the low resource scenario. However, phone posteriorgrams have been found to under-perform when the language they were trained on differs from the target language.</p><p>In this paper, we use bottle-neck features derived from MLP to generate Gaussian posteriorgrams. We also use a variant of dynamic time warping (DTW) based technique which exploits the redundancy in speech signal and thus averages the successive Gaussian posteriorgrams to reduce the length of the spoken query and spoken reference.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>Gaussian and phone posteriorgrams are a popular feature representation for query-by-example spoken term detection (QbE-STD). Gaussian posteriorgrams are typically trained in an unsupervised manner often referred to as zero-resource scenario, whereas, phone posteriorgrams are obtained by training a multi-layer perceptron (MLP) in a supervised manner. For low/zero resource languages, an MLP is trained on high resource languages and then it is used in the low resource scenario. However, phone posteriorgrams have been found to under-perform when the language they were trained on differs from the target language. These MLP classifier outputs, though capture acoustic phonetic properties of a speech signal, are not sufficient as a feature representation. This is because the language used for training MLP is not enough to capture the complete acoustic characteristics of the multi-lingual data. To utilize this complimentary information captured, we derive features from an MLP for obtaining Gaussian posteriorgrams. A similar kind of feature representation has been explored in paper <ref type="bibr" target="#b0">[1]</ref> for a better search performance.</p><p>An alternative representation for phone posteriorgrams are the articulatory features (AFs). AFs are a better representation as they are more language universal than phones.</p><p>This paper describes the experiments conducted for spoken web search (SWS) at MediaEval 2013 <ref type="bibr" target="#b1">[2]</ref>. The primary focus of this work is to explore the use of bottle-neck (BN) features for QbE-STD derived from phone and AF MLPs.</p><p>Copyright is held by the author/owner(s).</p><p>MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">FEATURE EXTRACTION</head><p>We use a three step process to generate the features for QbE-STD: (a) Extracting speech parameters such as frequency domain linear prediction (FDLP) <ref type="bibr" target="#b2">[3]</ref>(b) Train a phone or AF MLP and extract the bottle-neck features for each of the speech parameters, and (c) Compute Gaussian posteriorgrams using speech parameters in combination with the derived BN features.</p><p>In <ref type="bibr" target="#b3">[4]</ref>, we show that Gaussian posteriorgrams computed from FDLP perform better than those obtained from shorttime spectral analysis such as Mel-frequency cepstral coefficients. In this paper, we use FDLP as the acoustic parameters of the speech signal.</p><p>A 25 ms window length with 10 ms shift was considered to extract 13 dimensional features along with delta and acceleration coefficients for FDLP. An all-pole model of order 160 poles/sec and 37 filter banks are considered to extract FDLP.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Phone and AF Bottle-Neck Features</head><p>In this paper, we train phone and AF MLPs using labelled Telugu database (≈ 24 hours) consisting of 49 phones <ref type="bibr" target="#b4">[5]</ref>. MLP is trained to obtain 49 dimensional phone posteriorgrams and 23 dimensional articulatory features (AFs) using 39 dimensional FDLP features. The articulatory features (AFs) used in this work represent the characteristics of speech production process, which include vowel properties, place of articulation, manner of articulation, etc. We modified the AFs described in <ref type="bibr" target="#b5">[6]</ref> to suit the training data available. We use nine different articulatory properties as shown in Table <ref type="table" target="#tab_0">1</ref>. Each articulatory property is further divided into sub classes resulting in a 23 dimensional AF vector. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">EXPERIMENTS AND RESULTS</head><p>Gaussian posteriorgrams are computed by training a Gaussian mixture model (GMM) on the spoken data and the posterior probability obtained from each Gaussian is used to represent the speech parameters. The number of Gaussians represent the approximate number of acoustic units present in the spoken data. We computed Gaussian posteriorgrams as described in <ref type="bibr" target="#b6">[7]</ref>. We trained the Gaussian mixture models (GMM) using 128 Gaussians. Before performing the DTW search we removed the Gaussian posteriorgrams corresponding to silence regions as described in <ref type="bibr" target="#b7">[8]</ref>. All the experiments were conducted on a HPC cluster with HP SL230s compute nodes. Each HP SL230s node is equipped with two Intel E5-2640 processors with 12 cores each</p><p>We used a variant of DTW-based approach, referred to as non-segmental DTW (NS-DTW), for obtaining the search results <ref type="bibr" target="#b3">[4]</ref>. NS-DTW is similar to that of the DTW-based search given in <ref type="bibr" target="#b6">[7]</ref> but differs in the local constraints. Table <ref type="table" target="#tab_2">3</ref> show the maximum term weighted values (MTWV) obtained by using each of the features. From Table <ref type="table" target="#tab_2">3</ref>, it can be seen that the use of bottle-neck features has improved the performance of the system. To perform the search our algorithm requires approximately 10 GB of memory. To improve the computational performance, we reduce the query and reference Gaussian posteriorgrams vectors before performing search. Given a reduction factor α ∈ N, a window of size α is considered over the posteriorgram features and a mean is computed. The window is then shifted by α and another mean vector is computed. The posteriorgram vectors are replaced with the reduced number of posteriorgram features during this process. The averaging of Gaussian posteriorgrams also reduce the amount of memory required to compute the similarity matrix. In a conventional approach the space complexity required to compute the similarity matrix between a query and reference is of order O(mnd 2 ) where m,n are the length of reference and query and d is the dimension of the feature vector. The averaging of Gaussian posteriorgrams will reduce the space complexity to an order of O( mnd 2 α 2 ) .  <ref type="table" target="#tab_3">4</ref> show the MTWV and the runtime factor (RT) for various values of α using FDLP + AF-BN features. The results show an improvement in speed at the cost of the search accuracy. We have considered α = 2 as an optimum value based on MTWV and the speed improvements.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">CONCLUSIONS</head><p>In this work we have used the bottle-neck features obtained from phone and articulatory MLPs. We have shown that these BN features perform better than the conventional Gaussian posteriorgrams computed from FDLP. This motivates us to build models using high resource languages and use it in the low resource scenario.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Articulatory Features</figDesc><table><row><cell>Articulatory Property</cell><cell>Classes</cell><cell># bits</cell></row><row><cell>Voicing</cell><cell>±voicing</cell><cell>1</cell></row><row><cell>Vowel length</cell><cell>short, long, diphthong</cell><cell>3</cell></row><row><cell>Vowel height</cell><cell>high, mid, low</cell><cell>3</cell></row><row><cell>Vowel frontness</cell><cell>front, central, back</cell><cell>3</cell></row><row><cell>Lip rounding</cell><cell>±rounding</cell><cell>1</cell></row><row><cell>Manner of</cell><cell>stop, fricative, affricative</cell><cell>5</cell></row><row><cell>articulation</cell><cell>nasal, approximant</cell><cell></cell></row><row><cell>Place of</cell><cell>velar, alveolar, palatal,</cell><cell>5</cell></row><row><cell>articulation</cell><cell>labial, dental</cell><cell></cell></row><row><cell>Aspiration</cell><cell>±aspiration</cell><cell>1</cell></row><row><cell>Silence</cell><cell>±silence</cell><cell>1</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Architecture of the MLPs trained to derive</figDesc><table><row><cell>bottle-neck features</cell></row><row><cell>Architecture</cell></row><row><cell>PH MLP 39L 120N 13L 120N 49S</cell></row><row><cell>AF MLP 39L 120N 13L 120N 23S</cell></row><row><cell>Table 2, shows the architectures used to build phone and</cell></row><row><cell>AF MLPs. The integer values in the MLP architecture in-</cell></row><row><cell>dicate the number of nodes, and L (linear), N (non-linear)</cell></row><row><cell>and S (sigmoid) represent the activation functions in each</cell></row><row><cell>of the layers.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc></figDesc><table><row><cell cols="3">MTWV using Gaussian posteriorgrams</cell></row><row><cell cols="2">computed from various features</cell><cell></cell></row><row><cell>Feats.</cell><cell>dev</cell><cell>eval</cell></row><row><cell>FDLP</cell><cell cols="2">0.1652 0.1557</cell></row><row><cell>PH-BN</cell><cell cols="2">0.2491 0.2133</cell></row><row><cell>AF-BN</cell><cell cols="2">0.2627 0.2122</cell></row><row><cell cols="3">FDLP + PH-BN 0.2741 0.2492</cell></row><row><cell cols="3">FDLP + AF-BN 0.2765 0.2413</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4 :</head><label>4</label><figDesc>Evaluation using FNS-DTW for various val-</figDesc><table><row><cell>ues of α</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>α</cell><cell cols="4">dev MTWV RT (10 −4 ) MTWV RT (10 −4 ) eval</cell></row><row><cell>1</cell><cell>0.2765</cell><cell>16.55</cell><cell>0.2413</cell><cell>15.67</cell></row><row><cell>2</cell><cell>0.2530</cell><cell>4.21</cell><cell>0.2236</cell><cell>4.16</cell></row><row><cell>3</cell><cell>0.2252</cell><cell>1.92</cell><cell>0.1995</cell><cell>1.85</cell></row><row><cell>4</cell><cell>0.2043</cell><cell>1.11</cell><cell>0.1773</cell><cell>1.11</cell></row><row><cell>Table</cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection</title>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-C</forename><surname>Leung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of ICASSP</title>
				<meeting>of ICASSP</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">The spoken web search task</title>
		<author>
			<persName><forename type="first">X</forename><surname>Anguera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Metze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Buso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Szoke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">J</forename><surname>Rodriguez-Fuentes</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">MediaEval 2013 Workshop</title>
				<meeting><address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2013">October 18-19 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Recognition of reverberant speech using frequency domain linear prediction</title>
		<author>
			<persName><forename type="first">S</forename><surname>Thomas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ganapathy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hermansky</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Signal Processing Letters</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page" from="681" to="684" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Query-by-example spoken term detection using frequency domain linear prediction and non-segmental dynamic time warping</title>
		<author>
			<persName><forename type="first">G</forename><surname>Mantena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Achanta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Prahallad</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">submitted to IEEE Trans. Audio, Speech and Lang. Processing</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Development of Indian language speech databases for LVCSR</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">K</forename><surname>Anumanchipalli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Chitturi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Joshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">N</forename><surname>Sitaram</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">P</forename><surname>Kishore</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of SPECOM</title>
				<meeting>of SPECOM<address><addrLine>Patras, Greece</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Modelling a noisy-channel for voice conversion using articulatory features</title>
		<author>
			<persName><forename type="first">B</forename><surname>Bollepalli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">W</forename><surname>Black</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Prahallad</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of INTERSPEECH</title>
				<meeting>of INTERSPEECH</meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Speaker independent discriminant feature extraction for acoustic pattern-matching</title>
		<author>
			<persName><forename type="first">X</forename><surname>Anguera</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of ICASSP</title>
				<meeting>of ICASSP</meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="485" to="488" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Telefonica Research system for the spoken web search task at MediaEval 2012</title>
		<author>
			<persName><forename type="first">X</forename><surname>Anguera</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">MediaEval 2012 Workshop</title>
				<meeting><address><addrLine>Pisa, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2012-10">October 2012</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
