=Paper= {{Paper |id=Vol-1263/paper76 |storemode=property |title=IIIT-H System for MediaEval 2014 QUESST |pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_76.pdf |volume=Vol-1263 |dblpUrl=https://dblp.org/rec/conf/mediaeval/KesirajuMP14 }} ==IIIT-H System for MediaEval 2014 QUESST== https://ceur-ws.org/Vol-1263/mediaeval2014_submission_76.pdf

IIIT-H System for MediaEval 2014 QUESST

Santosh Kesiraju, Gautam Mantena, Kishore Prahallad
International Institute of Information Technology-Hyderabad, India
{santosh.k, gautam.mantena}@research.iiit.ac.in, kishore@iiit.ac.in

ABSTRACT in tandem with articulatory bottle neck features. Bottle
This paper describes the experiments and observations for neck features are a form of compressed features which are of
Query-by-Example Search on Speech Task (QUESST) at lower dimension and also capture the classification proper-
MediaEval 2014. In this paper, we describe two different rep- ties of the target classes. These features were obtained from
resentations of speech that were explored for the task. We the MLP trained on 24 hours of labeled Telugu database
also show the capabilities and limitations of non-segmental [3]. The articulatory bottle neck features were extracted as
dynamic time warping (NS-DTW) technique for searching described in [5].
various types of queries. This paper mainly focuses on the
experiments and analysis of the existing NS-DTW algorithm 3. NS-DTW FOR SEARCH
for various types of queries. The observations show that for We used a variant of DTW called non-segmental DTW
a specific representation of speech, the algorithm is capable (NS-DTW) [4], which differs in the local constraints. As a
of detecting partial matches. post processing method, we have pruned out some of the
results. The pruning criteria is based on the slope of the
1. INTRODUCTION aligned path. If m is the slope of the aligned path, then,
only the paths satisfying (0.5 < m < 2), were considered.
Some of the approaches for query-by-example spoken term
This helped us in eliminating some of the false alarms. We
detection rely on building models from resource rich lan-
have used the linear calibration function in bosaris toolkit 1
guages, and use these models to convert the speech data
to calibrate the scores. Table 1 shows the results on devel-
into sequence of symbols. Building models for multi-lingual
opment and evaluation dataset for different types of queries.
data is a challenging task as phone classes are not language
universal. Another way is relying on dynamic time warping Table 1: Scores for various types of queries for
(DTW) based techniques for matching two time series vec- (FDLP + AF-BN) feature representation on dev and
tors. Here, speech data is usually represented as Gaussian eval datasets
posteriorgrams (GP) of various acoustic features. dev dataset
For MediaEval 2014 QUESST task [2], we have explored Type of queries
unsupervised techniques involving various representations Scores All Type 1 Type 2 Type 3
for the speech data. Initially, we represented the speech data MinCnxe 0.8070 0.6734 0.8739 0.8986
using GP of acoustic and bottle-neck features. We have also Cnxe 0.9121 0.8032 1.0121 1.0235
built a cross-lingual ASR and decoded the speech data into MTWV 0.2263 0.3715 0.1472 0.0430
a sequence of symbols (phone sequences). Both the repre- ATWV 0.2261 0.3662 0.1467 0.0425
sentations rely on DTW to detect the queries in the audio
eval dataset
references.
Type of queries
Scores All Type 1 Type 2 Type 3
2. FEATURE REPRESENTATION MinCnxe 0.8117 0.7006 0.8576 0.8936
A three step process to generate the features for queries Cnxe 0.9218 0.8115 1.0205 1.0012
and the audio references is described here. (a) 39 dimen- MTWV 0.2062 0.3506 0.1188 0.0770
sional frequency domain linear prediction (FDLP) features ATWV 0.2026 0.3475 0.1151 0.0655
along with delta and acceleration coefficients were extracted
for every 25 ms window and a shift of 10 ms. An all-pole
All the experiments were performed on a single HP SL230
model of order 160 poles/sec and 37 filter banks were con-
node which is equipped with two Intel E5-2640 processors
sidered to extract FDLP features. (b) Bottle neck (BN)
with 12 cores each and 64 GB of main memory. The peak
features were derived from Multi-layer perceptron (MLP)
memory usage (PMU) was approximately 12 GB. The search-
trained with articulatory features (AF) (c) Gaussian pos-
ing speed factor (SSF) was 3.46.
teriorgrams were computed for speech parameters (FDLP)
To increase the search speed, the distance computation
was parallelized on a GPU (NVIDIA GT 610 with 48 cores
and 2 GB of GPU memory). The SSF was reduced to 0.85.
Copyright is held by the author/owner(s). 1
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain https://sites.google.com/site/bosaristoolkit/
4. ANALYSIS OF THE EXPERIMENTS global hypotheses was considered as the reference in comput-
We have analyzed the cases of false alarms and misses for ing the phone confusions. Next, the queries and the audio
all types of queries. The analysis on false alarms helped us references were decoded using the bootstrapped models, and
in enforcing a slope constraint on the aligned path which the search was performed using the NS-DTW. The phone
was described in Section 3. The results in Table 1 show that confusion matrix was used in the computation of similarity
the NS-DTW algorithm is able to detect some of the type 2 matrix in the NS-DTW framework.
queries, but fails in detecting type 3 queries. Fig. 1(a) shows If i and j are the indices of phones and N is the number of
the similarity matrix plot for a multi-word query with filler phones in the dictionary, then the similarity between them
content present in the reference. The dark bands represent is given by,
the match between the query and the reference. In Fig. d(i, j) = c(i, j) ∀ 0 ≤ i, j ≤ N
1(a) there are multiple dark bands, each showing a match
between parts of the query (word) to the specific locations where c(i, j) is the confusion matrix of i being the reference
(words) in the reference. The peaks in the alignment scores phone and j being the query phone.
in Fig. 1(b) reflects the partial matches. This shows that The SSF in this case was 0.38 and the PMU was approx-
for this specific (FDLP + AF-BN) feature representation of imately 2 GB. The results for various types of queries on
speech, the algorithm is capable of detecting smaller/partial development dataset are shown in Table 2.
matches. Even though the scores reflect the partial matches,
we have observed that the poor performance of the system Table 2: Scores for various types of queries for phone
is due to the number of false alarms. Further investigation representation on dev dataset
is required to find the methods that can penalize the false Phone representation
alarms. Type of queries
Scores All Type 1 Type 2 Type 3
(a) (b) MinCnxe 0.9487 0.9331 0.9599 0.9641
MTWV 0.0477 0.0799 0.0308 0.0134
600 600

500 500 6. CONCLUSION
Reference (frames)

Reference (frames)

In this work, we have explored two different representa-
400 400 tions of speech. We have observed the capabilities and lim-
itations of NS-DTW algorithm for various types of queries.
300 300 We have also observed that the same algorithm is able to de-
tect some of the type 2 queries in the reference documents.
200 200 The future work is focused on improving the NS-DTW al-
gorithm for detecting type 2 and type 3 queries and also in
100 100
developing robust cross-lingual phone decoders.
0 0
0 20 40 60
Query (frames)
80 100 0 2 4 6
Alignment scores
8 10 7. REFERENCES
[1] X. Anguera, F. Metze, A. Buzo, I. Szoke, and L. J.
Rodriguez-Fuentes. The Spoken Web Search Task. In
Figure 1: An example similarity matrix obtained Working Notes Proceedings of the MediaEval 2013
using NS-DTW, when multi-word query with filler Workshop, Barcelona, Spain, October 18-19 2013.
content is present in the reference. [2] X. Anguera, L. J. Rodriguez-Fuentes, I. Szoke,
A. Buzo, and F. Metze. Query by Example Search on
Speech at Mediaeval 2014. In Working Notes
Proceedings of the Mediaeval 2014 Workshop,
5. USING PHONE DECODER Barcelona, Spain, October 16-17 2014.
In this work, we have also built a cross-lingual phone de- [3] G. K. Anumanchipalli, R. Chitturi, S. Joshi, S. S.
coder and used NS-DTW for search. The cross-lingual de- R. Kumar, R. Sitaram, and S. Kishore. Development of
coder was built in a two step process. As the first step, Indian language speech databases for LVCSR. In Proc.
we trained acoustic models on 24 hours of Telugu database of SPECOM, Patras, Greece, 2005.
[3]. Then these models were used to decode MediaEval 2013 [4] G. Mantena, S. Achanta, and K. Prahallad.
SWS database [1]. The decoded symbols were bootstrapped Query-by-example spoken term detection using
and the models were re-trained. This process was repeated 4 frequency domain linear prediction and non-segmental
times and the resulting acoustic models were used to obtain dynamic time warping. IEEE/ACM Transactions on
the hypotheses (global hypotheses). Audio, Speech, and Language Processing,
We have built a phone confusion matrix in an unsuper- 22(5):946–955, May 2014.
vised way which is as follows: (a) We divided the SWS 2013 [5] G. Mantena and K. Prahallad. Use of articulatory
database into 4 parts and 4 acoustic models were built (b) bottle-neck features for query-by-example spoken term
4 hypotheses (local hypotheses), each corresponding to a detection in low resource scenarios. In 2014 IEEE
different part of the database were obtained (c) A string International Conference on Acoustics, Speech and
alignment was done between the global hypotheses and each Signal Processing (ICASSP), pages 7128–7132, May
of the local hypotheses to obtain the phone confusions. The 2014.