=Paper=
{{Paper
|id=None
|storemode=property
|title=IIIT-H SWS 2013: Gaussian Posteriorgrams of Bottle-Neck Features for Query-by-Example Spoken Term Detection
|pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_67.pdf
|volume=Vol-1043
|dblpUrl=https://dblp.org/rec/conf/mediaeval/MantenaP13
}}
==IIIT-H SWS 2013: Gaussian Posteriorgrams of Bottle-Neck Features for Query-by-Example Spoken Term Detection==
IIIT-H SWS 2013: Gaussian Posteriorgrams of Bottle-Neck
Features for Query-by-Example Spoken Term Detection
Gautam Mantena, Kishore Prahallad
International Institute of Information Technology-Hyderabad, India
gautam.mantena@research.iiit.ac.in, kishore@iiit.ac.in
ABSTRACT 2. FEATURE EXTRACTION
This paper describes the experiments conducted for spoken We use a three step process to generate the features for
web search (SWS) at MediaEval 2013 evaluations. A con- QbE-STD: (a) Extracting speech parameters such as fre-
ventional approach is to train a multi-layer perceptron using quency domain linear prediction (FDLP) [3](b) Train a phone
high resource languages and then use it in the low resource or AF MLP and extract the bottle-neck features for each of
scenario. However, phone posteriorgrams have been found the speech parameters, and (c) Compute Gaussian posteri-
to under-perform when the language they were trained on orgrams using speech parameters in combination with the
differs from the target language. derived BN features.
In this paper, we use bottle-neck features derived from In [4], we show that Gaussian posteriorgrams computed
MLP to generate Gaussian posteriorgrams. We also use a from FDLP perform better than those obtained from short-
variant of dynamic time warping (DTW) based technique time spectral analysis such as Mel-frequency cepstral coeffi-
which exploits the redundancy in speech signal and thus cients. In this paper, we use FDLP as the acoustic parame-
averages the successive Gaussian posteriorgrams to reduce ters of the speech signal.
the length of the spoken query and spoken reference. A 25 ms window length with 10 ms shift was considered
to extract 13 dimensional features along with delta and ac-
celeration coefficients for FDLP. An all-pole model of order
1. INTRODUCTION 160 poles/sec and 37 filter banks are considered to extract
Gaussian and phone posteriorgrams are a popular feature FDLP.
representation for query-by-example spoken term detection 2.1 Phone and AF Bottle-Neck Features
(QbE-STD). Gaussian posteriorgrams are typically trained
in an unsupervised manner often referred to as zero-resource In this paper, we train phone and AF MLPs using labelled
scenario, whereas, phone posteriorgrams are obtained by Telugu database (≈ 24 hours) consisting of 49 phones [5].
training a multi-layer perceptron (MLP) in a supervised MLP is trained to obtain 49 dimensional phone posterior-
manner. For low/zero resource languages, an MLP is trained grams and 23 dimensional articulatory features (AFs) using
on high resource languages and then it is used in the low re- 39 dimensional FDLP features.
source scenario. However, phone posteriorgrams have been
found to under-perform when the language they were trained Table 1: Articulatory Features
on differs from the target language. These MLP classifier Articulatory Property Classes # bits
outputs, though capture acoustic phonetic properties of a
Voicing ±voicing 1
speech signal, are not sufficient as a feature representation.
Vowel length short, long, diphthong 3
This is because the language used for training MLP is not
Vowel height high, mid, low 3
enough to capture the complete acoustic characteristics of
the multi-lingual data. To utilize this complimentary infor- Vowel frontness front, central, back 3
mation captured, we derive features from an MLP for ob- Lip rounding ±rounding 1
taining Gaussian posteriorgrams. A similar kind of feature Manner of stop, fricative, affricative 5
representation has been explored in paper [1] for a better articulation nasal, approximant
search performance. Place of velar, alveolar, palatal, 5
An alternative representation for phone posteriorgrams articulation labial, dental
are the articulatory features (AFs). AFs are a better repre- Aspiration ±aspiration 1
sentation as they are more language universal than phones. Silence ±silence 1
This paper describes the experiments conducted for spo-
ken web search (SWS) at MediaEval 2013 [2]. The primary The articulatory features (AFs) used in this work repre-
focus of this work is to explore the use of bottle-neck (BN) sent the characteristics of speech production process, which
features for QbE-STD derived from phone and AF MLPs. include vowel properties, place of articulation, manner of
articulation, etc. We modified the AFs described in [6] to
suit the training data available. We use nine different artic-
Copyright is held by the author/owner(s). ulatory properties as shown in Table 1. Each articulatory
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain property is further divided into sub classes resulting in a 23
dimensional AF vector. averaging of Gaussian posteriorgrams will reduce the space
2
complexity to an order of O( mnd
α2
).
Table 2: Architecture of the MLPs trained to derive
bottle-neck features
Architecture Table 4: Evaluation using FNS-DTW for various val-
PH MLP 39L 120N 13L 120N 49S ues of α
dev eval
AF MLP 39L 120N 13L 120N 23S α
MTWV RT (10−4 ) MTWV RT (10−4 )
1 0.2765 16.55 0.2413 15.67
Table 2, shows the architectures used to build phone and 2 0.2530 4.21 0.2236 4.16
AF MLPs. The integer values in the MLP architecture in- 3 0.2252 1.92 0.1995 1.85
dicate the number of nodes, and L (linear), N (non-linear) 4 0.2043 1.11 0.1773 1.11
and S (sigmoid) represent the activation functions in each
of the layers. Table 4 show the MTWV and the runtime factor (RT)
for various values of α using FDLP + AF-BN features. The
3. EXPERIMENTS AND RESULTS results show an improvement in speed at the cost of the
Gaussian posteriorgrams are computed by training a Gaus- search accuracy. We have considered α = 2 as an optimum
sian mixture model (GMM) on the spoken data and the pos- value based on MTWV and the speed improvements.
terior probability obtained from each Gaussian is used to
represent the speech parameters. The number of Gaussians 4. CONCLUSIONS
represent the approximate number of acoustic units present In this work we have used the bottle-neck features ob-
in the spoken data. We computed Gaussian posteriorgrams tained from phone and articulatory MLPs. We have shown
as described in [7]. We trained the Gaussian mixture models that these BN features perform better than the conventional
(GMM) using 128 Gaussians. Before performing the DTW Gaussian posteriorgrams computed from FDLP. This moti-
search we removed the Gaussian posteriorgrams correspond- vates us to build models using high resource languages and
ing to silence regions as described in [8]. All the experiments use it in the low resource scenario.
were conducted on a HPC cluster with HP SL230s compute
nodes. Each HP SL230s node is equipped with two Intel
E5-2640 processors with 12 cores each 5. REFERENCES
We used a variant of DTW-based approach, referred to as [1] H. Wang, T. Lee, C.-C. Leung, B. Ma, and H. Li,
non-segmental DTW (NS-DTW), for obtaining the search “Using parallel tokenizers with DTW matrix
results [4]. NS-DTW is similar to that of the DTW-based combination for low-resource spoken term detection,” in
search given in [7] but differs in the local constraints. Table in Proc. of ICASSP, 2013.
3 show the maximum term weighted values (MTWV) ob- [2] X. Anguera, F. Metze, A. Buso, I. Szoke, and L. J.
tained by using each of the features. From Table 3, it can Rodriguez-Fuentes, “The spoken web search task,” in
be seen that the use of bottle-neck features has improved MediaEval 2013 Workshop, Barcelona, Spain, October
the performance of the system. To perform the search our 18-19 2013.
algorithm requires approximately 10 GB of memory. [3] S. Thomas, S. Ganapathy, and H. Hermansky,
“Recognition of reverberant speech using frequency
domain linear prediction,” IEEE Signal Processing
Table 3: MTWV using Gaussian posteriorgrams Letters, vol. 15, pp. 681 –684, 2008.
computed from various features
[4] G. Mantena, S. Achanta, and K. Prahallad,
Feats. dev eval
“Query-by-example spoken term detection using
FDLP 0.1652 0.1557
frequency domain linear prediction and non-segmental
PH-BN 0.2491 0.2133 dynamic time warping,” submitted to IEEE Trans.
AF-BN 0.2627 0.2122 Audio, Speech and Lang. Processing, 2013.
FDLP + PH-BN 0.2741 0.2492 [5] G. K. Anumanchipalli, R. Chitturi, S. Joshi, S. Singh
FDLP + AF-BN 0.2765 0.2413 R. Kumar, R.N.V Sitaram, and S.P. Kishore,
“Development of Indian language speech databases for
To improve the computational performance, we reduce the LVCSR,” in Proc. of SPECOM, Patras, Greece, 2005.
query and reference Gaussian posteriorgrams vectors before [6] B. Bollepalli, A. W. Black, and K. Prahallad,
performing search. Given a reduction factor α ∈ N, a win- “Modelling a noisy-channel for voice conversion using
dow of size α is considered over the posteriorgram features articulatory features,” in Proc. of INTERSPEECH,
and a mean is computed. The window is then shifted by 2012.
α and another mean vector is computed. The posterior- [7] X. Anguera, “Speaker independent discriminant feature
gram vectors are replaced with the reduced number of pos- extraction for acoustic pattern-matching,” in Proc. of
teriorgram features during this process. The averaging of ICASSP, 2012, pp. 485–488.
Gaussian posteriorgrams also reduce the amount of memory
[8] X. Anguera, “Telefonica Research system for the spoken
required to compute the similarity matrix. In a conven-
web search task at MediaEval 2012,” in MediaEval 2012
tional approach the space complexity required to compute
Workshop, Pisa, Italy, October 2012.
the similarity matrix between a query and reference is of
order O(mnd2 ) where m,n are the length of reference and
query and d is the dimension of the feature vector. The