=Paper= {{Paper |id=None |storemode=property |title=UNIZA System for the Spoken Web Search Task at MediaEval2013 |pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_79.pdf |volume=Vol-1043 |dblpUrl=https://dblp.org/rec/conf/mediaeval/JarinaKGCP13 }} ==UNIZA System for the Spoken Web Search Task at MediaEval2013== https://ceur-ws.org/Vol-1043/mediaeval2013_submission_79.pdf
          UNIZA System for the Spoken Web Search Task at
                          MediaEval2013
                        Roman Jarina, Michal Kuba, Róbert Gubka, Michal Chmulik, Martin Paralič
                            Audiolab, Department of Telecommunications and Multimedia,
                               University of Žilina, Univerzitná 1, 010 26 Žilina, Slovakia
        {roman.jarina, michal.kuba, robert.gubka, michal.chmulik, martin.paralic}@fel.uniza.sk

ABSTRACT                                                              phoneme-like ES units with aid of unsupervised clustering. Then,
In this paper, we present an approach to detect spoken keywords       the HMM for each query is built up by concatenation of such ES
according to a given query, as part of the MediaEval benchmark.       models (or fillers). Searching the query over the database is
The proposed approach is based on a concept of modelling the          performed by Viterbi decoding [2]. The decoder generates output
speech query as a concatenation of language-independent quasi-        cumulative probabilities (confidence) and starting positions, from
phoneme models, which are derived by unsupervised clustering          which the cumulative probabilities were computed. Candidates for
on various audio data. Since only an initial version of the system    search results are obtained by thresholding the confidence curve.
is presented, issues concerning further system improvements are       We used only the provided SWS2013 development data during
also discussed.                                                       system development.


1. INTRODUCTION
Details about UNIZA submission for the Spoken Web Search
(SWS) task within MediaEval 2013 benchmark iniciative are
described below. SWS requires to develop a language–
independent audio search system so that, given an audio query, it
should be able to find the same speech phrases in audio content
[1]. Our proposed method is motivated by the generally accepted
approach of keyword spotting that relies on concatenation of
probabilistic models (usualy Hidden Markov Models) of speech
units [2]. Such approach is implicitly based on the fact that a
language structure of the speech is a priori known and hence
acoustical models of speech units can be developed in advance by                          Figure 1. System layout
using labeled training speech data. But this is not the case of the
SWS task where neither language information nor any                   2.1 Feature extraction
transcription is provided [1]. Hence the objective is to find a       We applied a 30 ms sliding window with 10 ms shift for
similar but more generalized language-independent and low-            computation of MFCCs (Mel Frequency Cepstral Coefficients),
resource approach to acoustical modelling of speech.                  which may be considered as a standard in speech recognition.
                                                                      Each frame is represented by the 39–dimensional feature vector
The developed system generates stochastic models of “elementary       that is composed of 13 MFCCs along with their delta and delta–
sounds” (ES) derivated from the provided speech data in various       delta derivates (incl. 0th cepstral coefficient), followed by
languages. These ESs are used as building blocks for speech           Cepstral Mean Normalisation. The feature vectors form
modelling instead of conventional phoneme based models. We            observations for further acoustical modelling by HMM.
have recently adopted this approach to the task of generic sound
modelling and retrieval [3] with promising results. The approach      2.2 Concept of quasi phoneme-like models
in [3] is built on the assumption that, in general, many types of     The overall performance of our proposed approach for the SWS
generic sounds can be modelled as a sequences of ES units, which      task relies on how precisely the developed models of ES units
are picked from a sufficiently large (much greater than the           mimic real phoneme-based units. Thus they should preserve the
number of speech units), though finite inventory. Due to the          linguistic information as much as possible while the non-linguistic
diverse nature of generic sounds, it is infeasible to create an       features (related to speaker/gender variability, emotion,
acoustical form of the elementary units; instead the ESs can be       background noise, channel characteristics, etc) should be
defined only by their stochastic models.                              suppressed. This is a very challenging problem and can be solved
                                                                      only to a certain extent. We are aware of the fact that the
2. PROPOSED APPROACH                                                  developed ES units are only a rough approximation of linguistic
System layout is depicted in Figure 1. Each utterance (query) is      units if no information about language structure is taken into
modelled by a HMM-based statistical model. Due to lack of             account.
information about language and linguistic structure of the            In these initial experiments, we developed the ES models by the
utterances, obviously, models of phonemes (or any other               following procedure:
linguistic units) could not be built in advance. Instead, we have     1) All 20 hours of audio of the SWS2013 dataset were
proposed to build models of language-independent quasi                parameterized as a sequence of MFCC vectors.
                                                                      2) Means and variances of Gaussian components of semi-
                                                                      continuous Gaussian Mixture PDF were estimated by
Copyright is held by the author/owner(s).
                                                                      unsupervised clustering of all MFCC vectors from the dataset.
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain
The clustering was performed by the K–variable K–means               a very important issue is that the features belonging to the same
procedure [4]. During the clustering, small clusters (containing     linguistic unit should be grouped together in the feature vector
less than 30 vectors) were discarded. This iterative procedure       space. The problem with MFCC based features is that they are
converged into 257 clusters, whose centroids defined the means of    advisable for both speech recognition as well as speaker
PDF’s Gaussian components.                                           discrimination tasks, what is disadvantageous in our case. Some
3) In the next step, the sequence of the MFCC vectors computed       discriminant analysis transform on MFCC, to obtain the speaker–
from the SWS2013 dataset was divided into segments of 20             independent features, might be applied prior to clustering;
vectors with 50% overlap, each spanning 0.1 seconds of audio,        - We haven’t investigated the impact of the size of the ES
which is initially considered the average duration of ES units..     inventory on the search performance. Due to lack of time also
Then 1-state semi-continuous density HMM (SD-HMM) was                only a simplified HMM training without re-estimation was
estimated for each segment. Since all SD-HMMs utilize the PDFs       applied for ES model development;
with the same Gaussian components, only weights of the               - After inspection of the results we have noticed that ES models
Gaussians in the mixtures were computed during ES model              of noise and silence were very often found in the decoding
estimation. Thus the HMM states are defined by PDFs composed         sequences, and overall confidence was affected by these “non
of Gaussian mixtures described by 257-dimensional weight             linguistic” models. Prior proper speech activity detection on
vectors. Such approach is much less computational demanding          queries as well as audio content (to avoid modelling of non-
than conventional HMM training. Since it results in the creation     speech events) would also help.
of about 1.5 million models, the amount of weight vectors was
massively reduced, again by the K-variable K-means clustering                        Table 1. The submission results
(only 207 clusters with the highest amount of vectors were                        ATWV          MTWV         Cnxe          Cnxe_min
preserved). This procedure results in the creation of 207 models     Devel.       - 0.091       0            1.011         0.951
of ES units.                                                         Eval.        - 0.027       0.001        1.011         0.945
                                                                     The system was run on a workstation with 32-core CPU. All the
          Audio data
                                    D ata stream segm entation       programming was made in Matlab and it was not optimized for
      Feature extraction                                             speed. In pre-processing (indexing) phase, only MFCC features
                                       E S m odel estim ation        were precomputed. It took about 1/60 of real-time, that means
      C luster analysis                                              ISF = 1205/ (71839+696 sec.) = 0.017 for the evaluation data.
     - estim ation of PD F             C luster analysis             Note that ES models computation is not considered in ISF. In the
     G auss. com ponents          - reduction of m odels num ber     submitted version of the system, audio content-based adaptation
                                                                     of ES models is not considered, thus the development of ES
                Figure 2. ES models development                      models might be seen as an extra part. The processing time
                                                                     employed in searching (recomputed for single CPU) is approx.
The queries were represented by concatenation of these ES            SSF = 1.08 x 106 / (71839 x 696 sec.) = 0.022. The peak memory
models as it is shown in Figure 1. For each query, the best          usage in searching is very low because only the Viterbi algorithm
sequence of the ES models was obtained by Viterbi decoding           is performed. Rough estimation is about 1 – 10 MBytes. Hence,
according to [2]. It should be remarked that during decoding, each   the proposed system, if it is tuned (and compiled in other more
1-state HMM was replaced by the 10-equal-state HMM (obtained         suitable programming environment), might be computationally
by concatenation of the same states), with an aim to avoid           very efficient.
detection of very short segments. This procedure secured that the
decoded sequence passed through the same state at least 10 times,
i.e. the process retained in the same state at least 0.1 seconds.    4. REFERENCES
                                                                     [1] X. Anguera, F. Metze, A. Buzo, I. Szoke, and L. J.
                                                                         Rodriguez-Fuentes, “The spoken web search task,”
3. RESULTS AND DISCUSSION                                                MediaEval 2013 Workshop, Oct. 18-19, 2013, Barcelona,
Maximal confidence obtained by Viterbi decoding was chosen as            Spain.
the score in submission. The segments with score about the given
                                                                     [2] J. Nouza, J. Silovsky, “Fast keyword spotting in telephone
threshold (the threshold was tuned on the development queries)
                                                                         speech”. Radioengineering, 18(4), 2009, 665-670.
were labelled as YES decision. With an effort to decrease high
false alarm rate, the segments with duration D > 2Q or D < 0.5Q      [3] R. Gubka, M. Kuba, "Elementary sound based audio pattern
(where Q is the duration of the query) were filtered out.                searching," 23rd Int. Conf. Radioelektronika 2013, April 16-
                                                                         17, 2013, 325-328. DOI = http://dx.doi.org/10.1109/
The performance of the submitted system was evaluated in terms           RadioElek.2013.6530940
of several measures, as defined by the SWS2013 task [1,5]:           [4] Reyes-Gomez, M.J.; Ellis, D.P.W., ”Selection, parameter
Actual/ Maximum Term-Weighted Values ATWV/ MTWV                          estimation, and discriminative training of hidden Markov
(weighted combination of miss and false alarm error rates); and a        models for general audio modeling,” Int. Conf. on Multi-
normalized cross-entropy metric Cnxe. The official results (late         media and Expo, ICME ’03, July 2003. I73-76
submission) are summarized in Table 1. In addition the amount of
                                                                     [5] L. J. Rodriguez–Fuentes, M. Penagarikano, "MediaEval
processing resources, namely Indexing/Searching Speed Factors
                                                                         2013 Spoken Web Search Task: System Performance
(ISF/SSF) as defined in [5], are estimated.
                                                                         Measures," n. TR-2013-1, Dept. of Electricity and Electro-
We reckon that the following bottlenecks caused very low                 nics, Univ. of the Basque Country, 2013, http://gtts.ehu.es/
performance of this first version of the submitted system:               gtts/NT/fulltext/rodriguezmediaeval13.pdf
- Since the ES models are obtained by unsupervised clustering,