LIA @ MediaEval 2013 Spoken Web Search Task: An I-Vector based Approach Mohamed Bouallegue, Grégory Senay, Mohamed Morchid, Driss Matrouf, Georges Linarès and Richard Dufour LIA - University of Avignon (France) {firstname.lastname}@univ-avignon.fr ABSTRACT speech super-vectors according to a linear-Gaussian model. In this paper, we describe the LIA system proposed for The speech (of given speech recording) super-vector ms of the MediaEval 2013 Spoken Web Search task. This multi- concatenated Gaussian Mixture Model (GMM) means is pro- language task involves searching for an audio content query, jected in a low dimensionality space, named Total Variability in a database, with no training resources available. The par- space: ticipants must then find locations of each given query term within a large database of untranscribed audio files. For ms = m + T xs (1) this task, we propose to build a language-independent audio search system using an i-vector based approach [2]. where m is the mean super-vector of Universal Background Model (UBM)1 . T is a low rank matrix (M D × R), where M is the number of Gaussians in the UBM and D is the cep- 1. INTRODUCTION stral feature size (39 in our case), which represents a basis of The Spoken Web Search (SWS) task is characterized by the reduced total variability space. T is named Total Vari- two major difficulties. Firstly, the reference set is composed ability matrix ; the components of xs are the total factors of audio files coming from different languages, accents and and they represent the coordinates of the speech recording acoustic conditions. Secondly, no transcription or language in the reduced total variability space. resources are provided. Systems should then be built as The proposed approach uses i-vectors to model speech seg- generic as possible to succeed in finding queries appearing ments. These short segments are considered as a language in these multiple condition sources. basic unit. Indeed, each file and each query are segmented In this work, a language-independent audio search system in short segments of 20 frames. In our model, the segment based on an i-vector approach is proposed. Inspired by the super-vector m(seg) is modeled as follows: success of i-vectors in speaker recognition [2], we apply the same idea for this audio search task. To identify the loca- tions of each query term within the audio files, our idea is mseg = m + T x(seg) (2) to model each file and each query by a set of i-vectors and then align them. 2.2 System overview In this section, the different steps used to build the pro- 2. PROPOSED APPROACH posed language-independent audio search system are de- tailed. Initially introduced for speaker recognition, i-vectors [2] have become very popular in the field of speech process- Step 0: Parametrization ing and recent publications show that they are also reliable for language recognition [5] and speaker diarization [3]. I- In this first step, the MFCCs (39-dimensional feature vec- vectors are an elegant way of reducing the large-dimensional tors) are computed for all the database audio files. Each input data to a small-dimensional feature vector while re- vector represents 30 ms of Hamming windowed speech sig- taining most of the relevant information. The technique nal (the window is shifted every 10 ms). was originally inspired by the Joint Factor Analysis frame- work [4]. Hence, i-vectors convey the speaker characteris- Step 1: Segmentation of database files tics among other information such as transmission channel, acoustic environment or phonetic content of the speech seg- This step consists of segmenting the database files in short ment. segments of 20 frames. In this step, a sliding window of 2.1 I-vector extraction 200 ms with an offset of 100 ms is used in order to avoid information lost. 713,315 short segments are obtained after The i-vector extraction could be seen as a probabilis- the segmentation of the 10,762 audio files. The same pro- tic compression process that reduces the dimensionality of cedure is applied on the sets of queries (development and evaluation). The 505 evaluation queries are segmented in 1 Copyright is held by the author/owner(s). The UBM is a GMM that represents all the possible obser- MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain vations. It is sometimes also called the world model. 6,602 short segments (nearly the same for the development on the development data reached a better performance than queries). the baseline system provided by the organizers, these results are surprisingly low on the evaluation data. Step 2: Estimation of the matrix T and the i-vectors of database files Table 1: Results obtained on the dev and the eval- The i-vectors of each of the 713,315 segments of the database uation data in terms of ATWV and Cnxe scores. Dev Evaluation files (see step 1) are estimated based on the equation 2. ATWV Cnxe ATWV Cnxe An i-vector xseg of size 20 is then obtained for each seg- Primary (2) 0.0045 7.22683 -0.0013 93.2051 ment. The UBM used (512 Gaussians) is estimated on all Contrastive 1 (3) 0.0040 7.23728 -0.0021 93.2154 the database files. The Total Variability matrix T (dimen- Contrastive 2 (4) 0.0040 7.24818 -0.0029 93.2255 sion of 19, 968 × 20) is estimated using all segments of audio Contrastive 3 (6) 0.0029 7.26928 -0.0043 93.2461 files. Contrastive 4 (8) 0.0014 7.29037 -0.0055 93.2673 Step 3: Estimation of the i-vectors of queries The indexing and the searching modules have been per- In this step, the i-vectors for the queries (development and formed on a 48-core cluster (Intel Xeon processor 2.6 GHz). evaluation) are estimated using the equation 2. We used the The memory peak reached 1.2 GB. The real-time ratio of same U BM and the Total Variability matrix T obtained in the searching module have been computed: the last step. Finally, 6,602 i-vectors (size 20) for the eval- Real-time ratio of the searching module: ((step 1+3+ uation queries are obtained (nearly the same for the devel- 4)/(total duration of audio f iles∗total duration of queries)) opment queries). = 6, 600/(71, 839 ∗ 696) = 0.000132 Step 4: Alignment 4. CONCLUSIONS In order to identify the locations of each query term within In this paper, we proposed a language-independent audio the database audio files, an alignment is performed between search system based on an i-vector approach. Although the the i-vectors of each query and the i-vectors of all database results on the evaluation queries are poor, the encouraging audio files using an adapted Dynamic Time Warping (DTW) results obtained on the development data show that the i- algorithm. vectors are an interesting and original unsupervised way to The dissimilarity between two i-vectors is computed with search audio content using an audio content query. In the fu- the Mahalanobis distance2 . The matrix used in the Maha- ture, we plan to investigate in details the mismatch between lanobis distance is the total covariance matrix estimated on the development and the evaluation data performance. We all the i-vectors of the database audio files. will also explore the use of a Voice Activity Detection (VAD): In order to find the query start, alignment can start at it could help to discard silence sections that are contaminat- any point of the audio file but have to last at most two ing audio queries. times the size of the query. The start and end times of a matching query are given by the start time of the first i- vector and the end time of the last i-vector of the alignment, 5. ACKNOWLEDGMENTS respectively. The query score is the cost of the best path This work was funded by the ContNomina project sup- (more precisely, minus the cost). For each database file, the ported by the French National Research Agency (ANR) un- system searches the best costs of all queries although certain der contract ANR-12-BS02-0009. files do not contain any query. Only the n-best alignments for a document are kept (2, 3, 4, 6 or 8). 6. REFERENCES [1] X. Anguera, F. Metze, A. Buso, I. Szoke, and L. J. 3. EXPERIMENTS Rodriguez-Fuentes. The spoken web search task. In MediaEval 2013 Workshop, Barcelona, Spain, October The proposed system is evaluated in the MediaEval 2013 18-19 2013. SWS benchmark [1]. The number of queries is 500 and the [2] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and number of audio files (dataset) is 10,000 (around 20 hours). P. Ouellet. Front-end factor analysis for speaker The sets of audio files include many languages: non-native verification. In IEEE TASLP, 2009. English, Albanian, Czech, Basque, Romanian and 4 African languages. The main metric evaluation used is the Actual [3] J. Franco-Pedroso, I. Lopez-Moreno, and D. T. Term Weighted Value (ATWV) [1]. Table 1 presents the re- Toledano. Atvs-uam system description for the audio sults obtained on the development and the evaluation data segmentation and speaker diarization albayzin 2010 in terms of ATWV and Cnxe scores. In the Primary sys- evaluation. In SLTech Workshop, 2010. tem, the number of occurrences of each query in the audio [4] P. Kenny. Joint factor analysis versus eigenchannes in files has been fixed to 2 (i.e. each query is detected twice in speaker recognition. In IEEE Transactions on AShLP, the database). In the four contrastive systems, we increased 2007. the number of occurrences to 3, 4, 6 and 8. Best results are [5] D. Martinez, O. Plchot, L. Burget, O. Glembek, and obtained with the primary system whether it be on the de- P. Matejka. Language recognition in i-vectors space. In velopment or the evaluation data. While the system applied Interspeech, 2011. 2 http://classifion.sicyon.com/References/M distance.pdf