LIA @ MediaEval 2013 Spoken Web Search Task:
                    An I-Vector based Approach

                                     Mohamed Bouallegue, Grégory Senay,
                                   Mohamed Morchid, Driss Matrouf, Georges
                                         Linarès and Richard Dufour
                                             LIA - University of Avignon (France)
                                       {firstname.lastname}@univ-avignon.fr

ABSTRACT                                                         speech super-vectors according to a linear-Gaussian model.
In this paper, we describe the LIA system proposed for           The speech (of given speech recording) super-vector ms of
the MediaEval 2013 Spoken Web Search task. This multi-           concatenated Gaussian Mixture Model (GMM) means is pro-
language task involves searching for an audio content query,     jected in a low dimensionality space, named Total Variability
in a database, with no training resources available. The par-    space:
ticipants must then find locations of each given query term
within a large database of untranscribed audio files. For                               ms = m + T xs                        (1)
this task, we propose to build a language-independent audio
search system using an i-vector based approach [2].                 where m is the mean super-vector of Universal Background
                                                                 Model (UBM)1 . T is a low rank matrix (M D × R), where M
                                                                 is the number of Gaussians in the UBM and D is the cep-
1.    INTRODUCTION                                               stral feature size (39 in our case), which represents a basis of
   The Spoken Web Search (SWS) task is characterized by          the reduced total variability space. T is named Total Vari-
two major difficulties. Firstly, the reference set is composed   ability matrix ; the components of xs are the total factors
of audio files coming from different languages, accents and      and they represent the coordinates of the speech recording
acoustic conditions. Secondly, no transcription or language      in the reduced total variability space.
resources are provided. Systems should then be built as             The proposed approach uses i-vectors to model speech seg-
generic as possible to succeed in finding queries appearing      ments. These short segments are considered as a language
in these multiple condition sources.                             basic unit. Indeed, each file and each query are segmented
   In this work, a language-independent audio search system      in short segments of 20 frames. In our model, the segment
based on an i-vector approach is proposed. Inspired by the       super-vector m(seg) is modeled as follows:
success of i-vectors in speaker recognition [2], we apply the
same idea for this audio search task. To identify the loca-
tions of each query term within the audio files, our idea is                          mseg = m + T x(seg)                    (2)
to model each file and each query by a set of i-vectors and
then align them.                                                 2.2    System overview
                                                                   In this section, the different steps used to build the pro-
2.    PROPOSED APPROACH                                          posed language-independent audio search system are de-
                                                                 tailed.
   Initially introduced for speaker recognition, i-vectors [2]
have become very popular in the field of speech process-
                                                                 Step 0: Parametrization
ing and recent publications show that they are also reliable
for language recognition [5] and speaker diarization [3]. I-
                                                                 In this first step, the MFCCs (39-dimensional feature vec-
vectors are an elegant way of reducing the large-dimensional
                                                                 tors) are computed for all the database audio files. Each
input data to a small-dimensional feature vector while re-
                                                                 vector represents 30 ms of Hamming windowed speech sig-
taining most of the relevant information. The technique
                                                                 nal (the window is shifted every 10 ms).
was originally inspired by the Joint Factor Analysis frame-
work [4]. Hence, i-vectors convey the speaker characteris-
                                                                 Step 1: Segmentation of database files
tics among other information such as transmission channel,
acoustic environment or phonetic content of the speech seg-
                                                                 This step consists of segmenting the database files in short
ment.
                                                                 segments of 20 frames. In this step, a sliding window of
2.1    I-vector extraction                                       200 ms with an offset of 100 ms is used in order to avoid
                                                                 information lost. 713,315 short segments are obtained after
   The i-vector extraction could be seen as a probabilis-        the segmentation of the 10,762 audio files. The same pro-
tic compression process that reduces the dimensionality of       cedure is applied on the sets of queries (development and
                                                                 evaluation). The 505 evaluation queries are segmented in
                                                                 1
Copyright is held by the author/owner(s).                          The UBM is a GMM that represents all the possible obser-
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain   vations. It is sometimes also called the world model.
6,602 short segments (nearly the same for the development         on the development data reached a better performance than
queries).                                                         the baseline system provided by the organizers, these results
                                                                  are surprisingly low on the evaluation data.
Step 2: Estimation of the matrix T and the i-vectors
of database files
                                                                  Table 1: Results obtained on the dev and the eval-
The i-vectors of each of the 713,315 segments of the database     uation data in terms of ATWV and Cnxe scores.
                                                                                               Dev             Evaluation
files (see step 1) are estimated based on the equation 2.
                                                                                        ATWV       Cnxe     ATWV      Cnxe
An i-vector xseg of size 20 is then obtained for each seg-
                                                                    Primary (2)          0.0045   7.22683   -0.0013 93.2051
ment. The UBM used (512 Gaussians) is estimated on all
                                                                    Contrastive 1 (3)    0.0040   7.23728   -0.0021 93.2154
the database files. The Total Variability matrix T (dimen-
                                                                    Contrastive 2 (4)    0.0040   7.24818   -0.0029 93.2255
sion of 19, 968 × 20) is estimated using all segments of audio
                                                                    Contrastive 3 (6)    0.0029   7.26928   -0.0043 93.2461
files.                                                              Contrastive 4 (8)    0.0014   7.29037   -0.0055 93.2673
Step 3: Estimation of the i-vectors of queries
                                                                    The indexing and the searching modules have been per-
In this step, the i-vectors for the queries (development and      formed on a 48-core cluster (Intel Xeon processor 2.6 GHz).
evaluation) are estimated using the equation 2. We used the       The memory peak reached 1.2 GB. The real-time ratio of
same U BM and the Total Variability matrix T obtained in          the searching module have been computed:
the last step. Finally, 6,602 i-vectors (size 20) for the eval-   Real-time ratio of the searching module: ((step 1+3+
uation queries are obtained (nearly the same for the devel-       4)/(total duration of audio f iles∗total duration of queries))
opment queries).                                                  = 6, 600/(71, 839 ∗ 696) = 0.000132

Step 4: Alignment
                                                                  4.   CONCLUSIONS
In order to identify the locations of each query term within         In this paper, we proposed a language-independent audio
the database audio files, an alignment is performed between       search system based on an i-vector approach. Although the
the i-vectors of each query and the i-vectors of all database     results on the evaluation queries are poor, the encouraging
audio files using an adapted Dynamic Time Warping (DTW)           results obtained on the development data show that the i-
algorithm.                                                        vectors are an interesting and original unsupervised way to
   The dissimilarity between two i-vectors is computed with       search audio content using an audio content query. In the fu-
the Mahalanobis distance2 . The matrix used in the Maha-          ture, we plan to investigate in details the mismatch between
lanobis distance is the total covariance matrix estimated on      the development and the evaluation data performance. We
all the i-vectors of the database audio files.                    will also explore the use of a Voice Activity Detection (VAD):
   In order to find the query start, alignment can start at       it could help to discard silence sections that are contaminat-
any point of the audio file but have to last at most two          ing audio queries.
times the size of the query. The start and end times of
a matching query are given by the start time of the first i-
vector and the end time of the last i-vector of the alignment,
                                                                  5.   ACKNOWLEDGMENTS
respectively. The query score is the cost of the best path          This work was funded by the ContNomina project sup-
(more precisely, minus the cost). For each database file, the     ported by the French National Research Agency (ANR) un-
system searches the best costs of all queries although certain    der contract ANR-12-BS02-0009.
files do not contain any query. Only the n-best alignments
for a document are kept (2, 3, 4, 6 or 8).                        6.   REFERENCES
                                                                  [1] X. Anguera, F. Metze, A. Buso, I. Szoke, and L. J.
3.     EXPERIMENTS                                                    Rodriguez-Fuentes. The spoken web search task. In
                                                                      MediaEval 2013 Workshop, Barcelona, Spain, October
   The proposed system is evaluated in the MediaEval 2013
                                                                      18-19 2013.
SWS benchmark [1]. The number of queries is 500 and the
                                                                  [2] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and
number of audio files (dataset) is 10,000 (around 20 hours).
                                                                      P. Ouellet. Front-end factor analysis for speaker
The sets of audio files include many languages: non-native
                                                                      verification. In IEEE TASLP, 2009.
English, Albanian, Czech, Basque, Romanian and 4 African
languages. The main metric evaluation used is the Actual          [3] J. Franco-Pedroso, I. Lopez-Moreno, and D. T.
Term Weighted Value (ATWV) [1]. Table 1 presents the re-              Toledano. Atvs-uam system description for the audio
sults obtained on the development and the evaluation data             segmentation and speaker diarization albayzin 2010
in terms of ATWV and Cnxe scores. In the Primary sys-                 evaluation. In SLTech Workshop, 2010.
tem, the number of occurrences of each query in the audio         [4] P. Kenny. Joint factor analysis versus eigenchannes in
files has been fixed to 2 (i.e. each query is detected twice in       speaker recognition. In IEEE Transactions on AShLP,
the database). In the four contrastive systems, we increased          2007.
the number of occurrences to 3, 4, 6 and 8. Best results are      [5] D. Martinez, O. Plchot, L. Burget, O. Glembek, and
obtained with the primary system whether it be on the de-             P. Matejka. Language recognition in i-vectors space. In
velopment or the evaluation data. While the system applied            Interspeech, 2011.
2
    http://classifion.sicyon.com/References/M distance.pdf