=Paper= {{Paper |id=Vol-1263/paper76 |storemode=property |title=IIIT-H System for MediaEval 2014 QUESST |pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_76.pdf |volume=Vol-1263 |dblpUrl=https://dblp.org/rec/conf/mediaeval/KesirajuMP14 }} ==IIIT-H System for MediaEval 2014 QUESST== https://ceur-ws.org/Vol-1263/mediaeval2014_submission_76.pdf
                   IIIT-H System for MediaEval 2014 QUESST

                            Santosh Kesiraju, Gautam Mantena, Kishore Prahallad
                              International Institute of Information Technology-Hyderabad, India
                              {santosh.k, gautam.mantena}@research.iiit.ac.in, kishore@iiit.ac.in




ABSTRACT                                                          in tandem with articulatory bottle neck features. Bottle
This paper describes the experiments and observations for         neck features are a form of compressed features which are of
Query-by-Example Search on Speech Task (QUESST) at                lower dimension and also capture the classification proper-
MediaEval 2014. In this paper, we describe two different rep-     ties of the target classes. These features were obtained from
resentations of speech that were explored for the task. We        the MLP trained on 24 hours of labeled Telugu database
also show the capabilities and limitations of non-segmental       [3]. The articulatory bottle neck features were extracted as
dynamic time warping (NS-DTW) technique for searching             described in [5].
various types of queries. This paper mainly focuses on the
experiments and analysis of the existing NS-DTW algorithm         3.      NS-DTW FOR SEARCH
for various types of queries. The observations show that for         We used a variant of DTW called non-segmental DTW
a specific representation of speech, the algorithm is capable     (NS-DTW) [4], which differs in the local constraints. As a
of detecting partial matches.                                     post processing method, we have pruned out some of the
                                                                  results. The pruning criteria is based on the slope of the
1.   INTRODUCTION                                                 aligned path. If m is the slope of the aligned path, then,
                                                                  only the paths satisfying (0.5 < m < 2), were considered.
  Some of the approaches for query-by-example spoken term
                                                                  This helped us in eliminating some of the false alarms. We
detection rely on building models from resource rich lan-
                                                                  have used the linear calibration function in bosaris toolkit 1
guages, and use these models to convert the speech data
                                                                  to calibrate the scores. Table 1 shows the results on devel-
into sequence of symbols. Building models for multi-lingual
                                                                  opment and evaluation dataset for different types of queries.
data is a challenging task as phone classes are not language
universal. Another way is relying on dynamic time warping         Table 1: Scores for various types of queries for
(DTW) based techniques for matching two time series vec-          (FDLP + AF-BN) feature representation on dev and
tors. Here, speech data is usually represented as Gaussian        eval datasets
posteriorgrams (GP) of various acoustic features.                                     dev dataset
  For MediaEval 2014 QUESST task [2], we have explored                                    Type of queries
unsupervised techniques involving various representations                Scores   All    Type 1 Type 2 Type 3
for the speech data. Initially, we represented the speech data         MinCnxe 0.8070 0.6734       0.8739 0.8986
using GP of acoustic and bottle-neck features. We have also              Cnxe   0.9121 0.8032      1.0121 1.0235
built a cross-lingual ASR and decoded the speech data into              MTWV 0.2263 0.3715         0.1472 0.0430
a sequence of symbols (phone sequences). Both the repre-                ATWV    0.2261 0.3662      0.1467 0.0425
sentations rely on DTW to detect the queries in the audio
                                                                                      eval dataset
references.
                                                                                          Type of queries
                                                                         Scores   All    Type 1 Type 2 Type 3
2.   FEATURE REPRESENTATION                                            MinCnxe 0.8117 0.7006       0.8576 0.8936
   A three step process to generate the features for queries             Cnxe   0.9218 0.8115      1.0205 1.0012
and the audio references is described here. (a) 39 dimen-               MTWV 0.2062 0.3506         0.1188 0.0770
sional frequency domain linear prediction (FDLP) features               ATWV    0.2026 0.3475      0.1151 0.0655
along with delta and acceleration coefficients were extracted
for every 25 ms window and a shift of 10 ms. An all-pole
                                                                    All the experiments were performed on a single HP SL230
model of order 160 poles/sec and 37 filter banks were con-
                                                                  node which is equipped with two Intel E5-2640 processors
sidered to extract FDLP features. (b) Bottle neck (BN)
                                                                  with 12 cores each and 64 GB of main memory. The peak
features were derived from Multi-layer perceptron (MLP)
                                                                  memory usage (PMU) was approximately 12 GB. The search-
trained with articulatory features (AF) (c) Gaussian pos-
                                                                  ing speed factor (SSF) was 3.46.
teriorgrams were computed for speech parameters (FDLP)
                                                                    To increase the search speed, the distance computation
                                                                  was parallelized on a GPU (NVIDIA GT 610 with 48 cores
                                                                  and 2 GB of GPU memory). The SSF was reduced to 0.85.
Copyright is held by the author/owner(s).                         1
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain        https://sites.google.com/site/bosaristoolkit/
4.         ANALYSIS OF THE EXPERIMENTS                                                                        global hypotheses was considered as the reference in comput-
   We have analyzed the cases of false alarms and misses for                                                  ing the phone confusions. Next, the queries and the audio
all types of queries. The analysis on false alarms helped us                                                  references were decoded using the bootstrapped models, and
in enforcing a slope constraint on the aligned path which                                                     the search was performed using the NS-DTW. The phone
was described in Section 3. The results in Table 1 show that                                                  confusion matrix was used in the computation of similarity
the NS-DTW algorithm is able to detect some of the type 2                                                     matrix in the NS-DTW framework.
queries, but fails in detecting type 3 queries. Fig. 1(a) shows                                                  If i and j are the indices of phones and N is the number of
the similarity matrix plot for a multi-word query with filler                                                 phones in the dictionary, then the similarity between them
content present in the reference. The dark bands represent                                                    is given by,
the match between the query and the reference. In Fig.                                                                     d(i, j) = c(i, j)   ∀   0 ≤ i, j ≤ N
1(a) there are multiple dark bands, each showing a match
between parts of the query (word) to the specific locations                                                   where c(i, j) is the confusion matrix of i being the reference
(words) in the reference. The peaks in the alignment scores                                                   phone and j being the query phone.
in Fig. 1(b) reflects the partial matches. This shows that                                                      The SSF in this case was 0.38 and the PMU was approx-
for this specific (FDLP + AF-BN) feature representation of                                                    imately 2 GB. The results for various types of queries on
speech, the algorithm is capable of detecting smaller/partial                                                 development dataset are shown in Table 2.
matches. Even though the scores reflect the partial matches,
we have observed that the poor performance of the system                                                      Table 2: Scores for various types of queries for phone
is due to the number of false alarms. Further investigation                                                   representation on dev dataset
is required to find the methods that can penalize the false                                                                    Phone representation
alarms.                                                                                                                                 Type of queries
                                                                                                                    Scores      All    Type 1 Type 2 Type 3
                                           (a)                                             (b)                     MinCnxe 0.9487 0.9331        0.9599   0.9641
                                                                                                                    MTWV 0.0477 0.0799          0.0308   0.0134
                          600                                               600


                          500                                               500                               6.   CONCLUSION
     Reference (frames)




                                                       Reference (frames)




                                                                                                                 In this work, we have explored two different representa-
                          400                                               400                               tions of speech. We have observed the capabilities and lim-
                                                                                                              itations of NS-DTW algorithm for various types of queries.
                          300                                               300                               We have also observed that the same algorithm is able to de-
                                                                                                              tect some of the type 2 queries in the reference documents.
                          200                                               200                               The future work is focused on improving the NS-DTW al-
                                                                                                              gorithm for detecting type 2 and type 3 queries and also in
                          100                                               100
                                                                                                              developing robust cross-lingual phone decoders.
                            0                                                 0
                             0   20   40     60
                                 Query (frames)
                                                  80 100                       0   2   4         6
                                                                                   Alignment scores
                                                                                                     8   10   7.   REFERENCES
                                                                                                              [1] X. Anguera, F. Metze, A. Buzo, I. Szoke, and L. J.
                                                                                                                  Rodriguez-Fuentes. The Spoken Web Search Task. In
Figure 1: An example similarity matrix obtained                                                                   Working Notes Proceedings of the MediaEval 2013
using NS-DTW, when multi-word query with filler                                                                   Workshop, Barcelona, Spain, October 18-19 2013.
content is present in the reference.                                                                          [2] X. Anguera, L. J. Rodriguez-Fuentes, I. Szoke,
                                                                                                                  A. Buzo, and F. Metze. Query by Example Search on
                                                                                                                  Speech at Mediaeval 2014. In Working Notes
                                                                                                                  Proceedings of the Mediaeval 2014 Workshop,
5.         USING PHONE DECODER                                                                                    Barcelona, Spain, October 16-17 2014.
   In this work, we have also built a cross-lingual phone de-                                                 [3] G. K. Anumanchipalli, R. Chitturi, S. Joshi, S. S.
coder and used NS-DTW for search. The cross-lingual de-                                                           R. Kumar, R. Sitaram, and S. Kishore. Development of
coder was built in a two step process. As the first step,                                                         Indian language speech databases for LVCSR. In Proc.
we trained acoustic models on 24 hours of Telugu database                                                         of SPECOM, Patras, Greece, 2005.
[3]. Then these models were used to decode MediaEval 2013                                                     [4] G. Mantena, S. Achanta, and K. Prahallad.
SWS database [1]. The decoded symbols were bootstrapped                                                           Query-by-example spoken term detection using
and the models were re-trained. This process was repeated 4                                                       frequency domain linear prediction and non-segmental
times and the resulting acoustic models were used to obtain                                                       dynamic time warping. IEEE/ACM Transactions on
the hypotheses (global hypotheses).                                                                               Audio, Speech, and Language Processing,
   We have built a phone confusion matrix in an unsuper-                                                          22(5):946–955, May 2014.
vised way which is as follows: (a) We divided the SWS 2013                                                    [5] G. Mantena and K. Prahallad. Use of articulatory
database into 4 parts and 4 acoustic models were built (b)                                                        bottle-neck features for query-by-example spoken term
4 hypotheses (local hypotheses), each corresponding to a                                                          detection in low resource scenarios. In 2014 IEEE
different part of the database were obtained (c) A string                                                         International Conference on Acoustics, Speech and
alignment was done between the global hypotheses and each                                                         Signal Processing (ICASSP), pages 7128–7132, May
of the local hypotheses to obtain the phone confusions. The                                                       2014.