=Paper= {{Paper |id=None |storemode=property |title=UNIZA System for the Spoken Web Search Task at MediaEval2013 |pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_79.pdf |volume=Vol-1043 |dblpUrl=https://dblp.org/rec/conf/mediaeval/JarinaKGCP13 }} ==UNIZA System for the Spoken Web Search Task at MediaEval2013== https://ceur-ws.org/Vol-1043/mediaeval2013_submission_79.pdf

UNIZA System for the Spoken Web Search Task at
MediaEval2013
Roman Jarina, Michal Kuba, Róbert Gubka, Michal Chmulik, Martin Paralič
Audiolab, Department of Telecommunications and Multimedia,
University of Žilina, Univerzitná 1, 010 26 Žilina, Slovakia
{roman.jarina, michal.kuba, robert.gubka, michal.chmulik, martin.paralic}@fel.uniza.sk

ABSTRACT phoneme-like ES units with aid of unsupervised clustering. Then,
In this paper, we present an approach to detect spoken keywords the HMM for each query is built up by concatenation of such ES
according to a given query, as part of the MediaEval benchmark. models (or fillers). Searching the query over the database is
The proposed approach is based on a concept of modelling the performed by Viterbi decoding [2]. The decoder generates output
speech query as a concatenation of language-independent quasi- cumulative probabilities (confidence) and starting positions, from
phoneme models, which are derived by unsupervised clustering which the cumulative probabilities were computed. Candidates for
on various audio data. Since only an initial version of the system search results are obtained by thresholding the confidence curve.
is presented, issues concerning further system improvements are We used only the provided SWS2013 development data during
also discussed. system development.

1. INTRODUCTION
Details about UNIZA submission for the Spoken Web Search
(SWS) task within MediaEval 2013 benchmark iniciative are
described below. SWS requires to develop a language–
independent audio search system so that, given an audio query, it
should be able to find the same speech phrases in audio content
[1]. Our proposed method is motivated by the generally accepted
approach of keyword spotting that relies on concatenation of
probabilistic models (usualy Hidden Markov Models) of speech
units [2]. Such approach is implicitly based on the fact that a
language structure of the speech is a priori known and hence
acoustical models of speech units can be developed in advance by Figure 1. System layout
using labeled training speech data. But this is not the case of the
SWS task where neither language information nor any 2.1 Feature extraction
transcription is provided [1]. Hence the objective is to find a We applied a 30 ms sliding window with 10 ms shift for
similar but more generalized language-independent and low- computation of MFCCs (Mel Frequency Cepstral Coefficients),
resource approach to acoustical modelling of speech. which may be considered as a standard in speech recognition.
Each frame is represented by the 39–dimensional feature vector
The developed system generates stochastic models of “elementary that is composed of 13 MFCCs along with their delta and delta–
sounds” (ES) derivated from the provided speech data in various delta derivates (incl. 0th cepstral coefficient), followed by
languages. These ESs are used as building blocks for speech Cepstral Mean Normalisation. The feature vectors form
modelling instead of conventional phoneme based models. We observations for further acoustical modelling by HMM.
have recently adopted this approach to the task of generic sound
modelling and retrieval [3] with promising results. The approach 2.2 Concept of quasi phoneme-like models
in [3] is built on the assumption that, in general, many types of The overall performance of our proposed approach for the SWS
generic sounds can be modelled as a sequences of ES units, which task relies on how precisely the developed models of ES units
are picked from a sufficiently large (much greater than the mimic real phoneme-based units. Thus they should preserve the
number of speech units), though finite inventory. Due to the linguistic information as much as possible while the non-linguistic
diverse nature of generic sounds, it is infeasible to create an features (related to speaker/gender variability, emotion,
acoustical form of the elementary units; instead the ESs can be background noise, channel characteristics, etc) should be
defined only by their stochastic models. suppressed. This is a very challenging problem and can be solved
only to a certain extent. We are aware of the fact that the
2. PROPOSED APPROACH developed ES units are only a rough approximation of linguistic
System layout is depicted in Figure 1. Each utterance (query) is units if no information about language structure is taken into
modelled by a HMM-based statistical model. Due to lack of account.
information about language and linguistic structure of the In these initial experiments, we developed the ES models by the
utterances, obviously, models of phonemes (or any other following procedure:
linguistic units) could not be built in advance. Instead, we have 1) All 20 hours of audio of the SWS2013 dataset were
proposed to build models of language-independent quasi parameterized as a sequence of MFCC vectors.
2) Means and variances of Gaussian components of semi-
continuous Gaussian Mixture PDF were estimated by
Copyright is held by the author/owner(s).
unsupervised clustering of all MFCC vectors from the dataset.
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain
The clustering was performed by the K–variable K–means a very important issue is that the features belonging to the same
procedure [4]. During the clustering, small clusters (containing linguistic unit should be grouped together in the feature vector
less than 30 vectors) were discarded. This iterative procedure space. The problem with MFCC based features is that they are
converged into 257 clusters, whose centroids defined the means of advisable for both speech recognition as well as speaker
PDF’s Gaussian components. discrimination tasks, what is disadvantageous in our case. Some
3) In the next step, the sequence of the MFCC vectors computed discriminant analysis transform on MFCC, to obtain the speaker–
from the SWS2013 dataset was divided into segments of 20 independent features, might be applied prior to clustering;
vectors with 50% overlap, each spanning 0.1 seconds of audio, - We haven’t investigated the impact of the size of the ES
which is initially considered the average duration of ES units.. inventory on the search performance. Due to lack of time also
Then 1-state semi-continuous density HMM (SD-HMM) was only a simplified HMM training without re-estimation was
estimated for each segment. Since all SD-HMMs utilize the PDFs applied for ES model development;
with the same Gaussian components, only weights of the - After inspection of the results we have noticed that ES models
Gaussians in the mixtures were computed during ES model of noise and silence were very often found in the decoding
estimation. Thus the HMM states are defined by PDFs composed sequences, and overall confidence was affected by these “non
of Gaussian mixtures described by 257-dimensional weight linguistic” models. Prior proper speech activity detection on
vectors. Such approach is much less computational demanding queries as well as audio content (to avoid modelling of non-
than conventional HMM training. Since it results in the creation speech events) would also help.
of about 1.5 million models, the amount of weight vectors was
massively reduced, again by the K-variable K-means clustering Table 1. The submission results
(only 207 clusters with the highest amount of vectors were ATWV MTWV Cnxe Cnxe_min
preserved). This procedure results in the creation of 207 models Devel. - 0.091 0 1.011 0.951
of ES units. Eval. - 0.027 0.001 1.011 0.945
The system was run on a workstation with 32-core CPU. All the
Audio data
D ata stream segm entation programming was made in Matlab and it was not optimized for
Feature extraction speed. In pre-processing (indexing) phase, only MFCC features
E S m odel estim ation were precomputed. It took about 1/60 of real-time, that means
C luster analysis ISF = 1205/ (71839+696 sec.) = 0.017 for the evaluation data.
- estim ation of PD F C luster analysis Note that ES models computation is not considered in ISF. In the
G auss. com ponents - reduction of m odels num ber submitted version of the system, audio content-based adaptation
of ES models is not considered, thus the development of ES
Figure 2. ES models development models might be seen as an extra part. The processing time
employed in searching (recomputed for single CPU) is approx.
The queries were represented by concatenation of these ES SSF = 1.08 x 106 / (71839 x 696 sec.) = 0.022. The peak memory
models as it is shown in Figure 1. For each query, the best usage in searching is very low because only the Viterbi algorithm
sequence of the ES models was obtained by Viterbi decoding is performed. Rough estimation is about 1 – 10 MBytes. Hence,
according to [2]. It should be remarked that during decoding, each the proposed system, if it is tuned (and compiled in other more
1-state HMM was replaced by the 10-equal-state HMM (obtained suitable programming environment), might be computationally
by concatenation of the same states), with an aim to avoid very efficient.
detection of very short segments. This procedure secured that the
decoded sequence passed through the same state at least 10 times,
i.e. the process retained in the same state at least 0.1 seconds. 4. REFERENCES
[1] X. Anguera, F. Metze, A. Buzo, I. Szoke, and L. J.
Rodriguez-Fuentes, “The spoken web search task,”
3. RESULTS AND DISCUSSION MediaEval 2013 Workshop, Oct. 18-19, 2013, Barcelona,
Maximal confidence obtained by Viterbi decoding was chosen as Spain.
the score in submission. The segments with score about the given
[2] J. Nouza, J. Silovsky, “Fast keyword spotting in telephone
threshold (the threshold was tuned on the development queries)
speech”. Radioengineering, 18(4), 2009, 665-670.
were labelled as YES decision. With an effort to decrease high
false alarm rate, the segments with duration D > 2Q or D < 0.5Q [3] R. Gubka, M. Kuba, "Elementary sound based audio pattern
(where Q is the duration of the query) were filtered out. searching," 23rd Int. Conf. Radioelektronika 2013, April 16-
17, 2013, 325-328. DOI = http://dx.doi.org/10.1109/
The performance of the submitted system was evaluated in terms RadioElek.2013.6530940
of several measures, as defined by the SWS2013 task [1,5]: [4] Reyes-Gomez, M.J.; Ellis, D.P.W., ”Selection, parameter
Actual/ Maximum Term-Weighted Values ATWV/ MTWV estimation, and discriminative training of hidden Markov
(weighted combination of miss and false alarm error rates); and a models for general audio modeling,” Int. Conf. on Multi-
normalized cross-entropy metric Cnxe. The official results (late media and Expo, ICME ’03, July 2003. I73-76
submission) are summarized in Table 1. In addition the amount of
[5] L. J. Rodriguez–Fuentes, M. Penagarikano, "MediaEval
processing resources, namely Indexing/Searching Speed Factors
2013 Spoken Web Search Task: System Performance
(ISF/SSF) as defined in [5], are estimated.
Measures," n. TR-2013-1, Dept. of Electricity and Electro-
We reckon that the following bottlenecks caused very low nics, Univ. of the Basque Country, 2013, http://gtts.ehu.es/
performance of this first version of the submitted system: gtts/NT/fulltext/rodriguezmediaeval13.pdf
- Since the ES models are obtained by unsupervised clustering,