=Paper= {{Paper |id=None |storemode=property |title=IIIT-H SWS 2013: Gaussian Posteriorgrams of Bottle-Neck Features for Query-by-Example Spoken Term Detection |pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_67.pdf |volume=Vol-1043 |dblpUrl=https://dblp.org/rec/conf/mediaeval/MantenaP13 }} ==IIIT-H SWS 2013: Gaussian Posteriorgrams of Bottle-Neck Features for Query-by-Example Spoken Term Detection== https://ceur-ws.org/Vol-1043/mediaeval2013_submission_67.pdf
 IIIT-H SWS 2013: Gaussian Posteriorgrams of Bottle-Neck
   Features for Query-by-Example Spoken Term Detection

                                        Gautam Mantena, Kishore Prahallad
                              International Institute of Information Technology-Hyderabad, India
                                     gautam.mantena@research.iiit.ac.in, kishore@iiit.ac.in



ABSTRACT                                                           2.    FEATURE EXTRACTION
This paper describes the experiments conducted for spoken             We use a three step process to generate the features for
web search (SWS) at MediaEval 2013 evaluations. A con-             QbE-STD: (a) Extracting speech parameters such as fre-
ventional approach is to train a multi-layer perceptron using      quency domain linear prediction (FDLP) [3](b) Train a phone
high resource languages and then use it in the low resource        or AF MLP and extract the bottle-neck features for each of
scenario. However, phone posteriorgrams have been found            the speech parameters, and (c) Compute Gaussian posteri-
to under-perform when the language they were trained on            orgrams using speech parameters in combination with the
differs from the target language.                                  derived BN features.
  In this paper, we use bottle-neck features derived from             In [4], we show that Gaussian posteriorgrams computed
MLP to generate Gaussian posteriorgrams. We also use a             from FDLP perform better than those obtained from short-
variant of dynamic time warping (DTW) based technique              time spectral analysis such as Mel-frequency cepstral coeffi-
which exploits the redundancy in speech signal and thus            cients. In this paper, we use FDLP as the acoustic parame-
averages the successive Gaussian posteriorgrams to reduce          ters of the speech signal.
the length of the spoken query and spoken reference.                  A 25 ms window length with 10 ms shift was considered
                                                                   to extract 13 dimensional features along with delta and ac-
                                                                   celeration coefficients for FDLP. An all-pole model of order
1.   INTRODUCTION                                                  160 poles/sec and 37 filter banks are considered to extract
   Gaussian and phone posteriorgrams are a popular feature         FDLP.
representation for query-by-example spoken term detection          2.1    Phone and AF Bottle-Neck Features
(QbE-STD). Gaussian posteriorgrams are typically trained
in an unsupervised manner often referred to as zero-resource         In this paper, we train phone and AF MLPs using labelled
scenario, whereas, phone posteriorgrams are obtained by            Telugu database (≈ 24 hours) consisting of 49 phones [5].
training a multi-layer perceptron (MLP) in a supervised            MLP is trained to obtain 49 dimensional phone posterior-
manner. For low/zero resource languages, an MLP is trained         grams and 23 dimensional articulatory features (AFs) using
on high resource languages and then it is used in the low re-      39 dimensional FDLP features.
source scenario. However, phone posteriorgrams have been
found to under-perform when the language they were trained                     Table 1: Articulatory Features
on differs from the target language. These MLP classifier           Articulatory Property            Classes              # bits
outputs, though capture acoustic phonetic properties of a
                                                                           Voicing                  ±voicing                1
speech signal, are not sufficient as a feature representation.
                                                                        Vowel length        short, long, diphthong          3
This is because the language used for training MLP is not
                                                                        Vowel height             high, mid, low             3
enough to capture the complete acoustic characteristics of
the multi-lingual data. To utilize this complimentary infor-           Vowel frontness        front, central, back          3
mation captured, we derive features from an MLP for ob-                 Lip rounding               ±rounding                1
taining Gaussian posteriorgrams. A similar kind of feature                Manner of       stop, fricative, affricative      5
representation has been explored in paper [1] for a better               articulation         nasal, approximant
search performance.                                                        Place of         velar, alveolar, palatal,        5
   An alternative representation for phone posteriorgrams                articulation             labial, dental
are the articulatory features (AFs). AFs are a better repre-             Aspiration                ±aspiration               1
sentation as they are more language universal than phones.                 Silence                  ±silence                 1
   This paper describes the experiments conducted for spo-
ken web search (SWS) at MediaEval 2013 [2]. The primary              The articulatory features (AFs) used in this work repre-
focus of this work is to explore the use of bottle-neck (BN)       sent the characteristics of speech production process, which
features for QbE-STD derived from phone and AF MLPs.               include vowel properties, place of articulation, manner of
                                                                   articulation, etc. We modified the AFs described in [6] to
                                                                   suit the training data available. We use nine different artic-
Copyright is held by the author/owner(s).                          ulatory properties as shown in Table 1. Each articulatory
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain     property is further divided into sub classes resulting in a 23
dimensional AF vector.                                            averaging of Gaussian posteriorgrams will reduce the space
                                                                                                   2
                                                                  complexity to an order of O( mnd
                                                                                                α2
                                                                                                     ).
Table 2: Architecture of the MLPs trained to derive
bottle-neck features
                         Architecture                             Table 4: Evaluation using FNS-DTW for various val-
          PH MLP 39L 120N 13L 120N 49S                            ues of α
                                                                                  dev                eval
          AF MLP 39L 120N 13L 120N 23S                                 α
                                                                           MTWV RT (10−4 ) MTWV RT (10−4 )
                                                                       1   0.2765     16.55   0.2413      15.67
   Table 2, shows the architectures used to build phone and            2   0.2530      4.21   0.2236       4.16
AF MLPs. The integer values in the MLP architecture in-                3   0.2252      1.92   0.1995       1.85
dicate the number of nodes, and L (linear), N (non-linear)             4   0.2043      1.11   0.1773       1.11
and S (sigmoid) represent the activation functions in each
of the layers.                                                      Table 4 show the MTWV and the runtime factor (RT)
                                                                  for various values of α using FDLP + AF-BN features. The
3.   EXPERIMENTS AND RESULTS                                      results show an improvement in speed at the cost of the
   Gaussian posteriorgrams are computed by training a Gaus-       search accuracy. We have considered α = 2 as an optimum
sian mixture model (GMM) on the spoken data and the pos-          value based on MTWV and the speed improvements.
terior probability obtained from each Gaussian is used to
represent the speech parameters. The number of Gaussians          4.   CONCLUSIONS
represent the approximate number of acoustic units present          In this work we have used the bottle-neck features ob-
in the spoken data. We computed Gaussian posteriorgrams           tained from phone and articulatory MLPs. We have shown
as described in [7]. We trained the Gaussian mixture models       that these BN features perform better than the conventional
(GMM) using 128 Gaussians. Before performing the DTW              Gaussian posteriorgrams computed from FDLP. This moti-
search we removed the Gaussian posteriorgrams correspond-         vates us to build models using high resource languages and
ing to silence regions as described in [8]. All the experiments   use it in the low resource scenario.
were conducted on a HPC cluster with HP SL230s compute
nodes. Each HP SL230s node is equipped with two Intel
E5-2640 processors with 12 cores each                             5.   REFERENCES
   We used a variant of DTW-based approach, referred to as        [1] H. Wang, T. Lee, C.-C. Leung, B. Ma, and H. Li,
non-segmental DTW (NS-DTW), for obtaining the search                  “Using parallel tokenizers with DTW matrix
results [4]. NS-DTW is similar to that of the DTW-based               combination for low-resource spoken term detection,” in
search given in [7] but differs in the local constraints. Table       in Proc. of ICASSP, 2013.
3 show the maximum term weighted values (MTWV) ob-                [2] X. Anguera, F. Metze, A. Buso, I. Szoke, and L. J.
tained by using each of the features. From Table 3, it can            Rodriguez-Fuentes, “The spoken web search task,” in
be seen that the use of bottle-neck features has improved             MediaEval 2013 Workshop, Barcelona, Spain, October
the performance of the system. To perform the search our              18-19 2013.
algorithm requires approximately 10 GB of memory.                 [3] S. Thomas, S. Ganapathy, and H. Hermansky,
                                                                      “Recognition of reverberant speech using frequency
                                                                      domain linear prediction,” IEEE Signal Processing
Table 3: MTWV using Gaussian posteriorgrams                           Letters, vol. 15, pp. 681 –684, 2008.
computed from various features
                                                                  [4] G. Mantena, S. Achanta, and K. Prahallad,
             Feats.        dev    eval
                                                                      “Query-by-example spoken term detection using
             FDLP         0.1652 0.1557
                                                                      frequency domain linear prediction and non-segmental
             PH-BN        0.2491 0.2133                               dynamic time warping,” submitted to IEEE Trans.
             AF-BN        0.2627 0.2122                               Audio, Speech and Lang. Processing, 2013.
         FDLP + PH-BN 0.2741 0.2492                               [5] G. K. Anumanchipalli, R. Chitturi, S. Joshi, S. Singh
         FDLP + AF-BN 0.2765 0.2413                                   R. Kumar, R.N.V Sitaram, and S.P. Kishore,
                                                                      “Development of Indian language speech databases for
   To improve the computational performance, we reduce the            LVCSR,” in Proc. of SPECOM, Patras, Greece, 2005.
query and reference Gaussian posteriorgrams vectors before        [6] B. Bollepalli, A. W. Black, and K. Prahallad,
performing search. Given a reduction factor α ∈ N, a win-             “Modelling a noisy-channel for voice conversion using
dow of size α is considered over the posteriorgram features           articulatory features,” in Proc. of INTERSPEECH,
and a mean is computed. The window is then shifted by                 2012.
α and another mean vector is computed. The posterior-             [7] X. Anguera, “Speaker independent discriminant feature
gram vectors are replaced with the reduced number of pos-             extraction for acoustic pattern-matching,” in Proc. of
teriorgram features during this process. The averaging of             ICASSP, 2012, pp. 485–488.
Gaussian posteriorgrams also reduce the amount of memory
                                                                  [8] X. Anguera, “Telefonica Research system for the spoken
required to compute the similarity matrix. In a conven-
                                                                      web search task at MediaEval 2012,” in MediaEval 2012
tional approach the space complexity required to compute
                                                                      Workshop, Pisa, Italy, October 2012.
the similarity matrix between a query and reference is of
order O(mnd2 ) where m,n are the length of reference and
query and d is the dimension of the feature vector. The