1. INTRODUCTION

GTTS-EHU Systems for QUESST at MediaEval 2014

Luis J. Rodriguez-Fuentes

Amparo Varona

Mikel Penagarikano

Germán Bordel

Mireia Diez

0 0 Software Technologies Working Group (

2014

16 17

This paper brie y describes the systems presented by the Software Technologies Working Group (http://gtts.ehu.es, GTTS) of the University of the Basque Country (UPV/EHU) to the Query-by-Example Search on Speech Task (QUESST) at MediaEval 2014. The GTTS-EHU systems consist of four modules: (1) feature extraction; (2) speech activity detection; (3) DTW-based query matching; and (4) score calibration and fusion. The submitted systems follow the same approach used in our SWS 2013 submissions, with two minor changes (needed to address the new task): the search stops at the most likely query detection (no further detections are looked for) and a score is produced for each (query, document) pair. The two approximate matching types introduced in QUESST have not received special treatment. This year, we have just explored the use of reduced feature sets, obtaining worse results but at lower computational costs.

1. INTRODUCTION

The MediaEval 2014 Query-by-Example Search on Speech Task (QUESST) consists of searching for a spoken query within a set of spoken documents. For each pair (query, document), a score in the range ( 1; +1) must be produced, the higher (the more positive) the score, the more likely that the query appears in the document. System performance is primarily measured in terms of a normalized cross-entropy cost Cnxe. Term-Weighted Value metrics (ATWV/MTWV) are used as secondary metrics, along with the processing resources (real-time factor and peak memory usage) required by the submitted systems. For more details on QUESST, see [ 2 ]. 2. 2.1

SYSTEM OVERVIEW Feature extraction

The Brno University of Technology (BUT) phone decoders for Czech, Hungarian and Russian [ 6 ] are applied to decode both the spoken queries and the audio documents. BUT decoders are trained on 8 kHz SpeechDat(E) databases recorded over xed telephone networks, featuring 45, 61 and 52 units for Czech, Hungarian and Russian, respectively (three of them being non-phonetic units that stand for short pauses and noises).

Given an input signal of length T , the decoder outputs the posterior probability of each state s (1 s S) of each unit i (1 i M ) at each frame t (1 t T ), pi;s(t), where M is the number of units and S the number of states per unit. The posterior probability of each unit i at each frame t are computed by adding the posteriors of its states: pi(t) =

X pi;s(t) 8s (1) Finally, the posteriors of the three non-phonetic units are added and stored as a single non-speech posterior. Thus, the size of the frame-level feature vectors is 43, 59 and 50 for the Czech, Hungarian and Russian BUT decoders, respectively. 2.1.1

Reduced feature sets

In [ 4 ], several dimensionality reduction techniques were successfully applied on phone posterior features to reduce the computational cost while keeping performance on spoken language recognition tasks. Following one of the approaches proposed in [ 4 ], here we de ne a reduced set of features by adding the posteriors of phones with the same manner and place of articulation. This leads to feature sets of size 25, 23 and 21, for the Czech, Hungarian and Russian BUT decoders, respectively. 2.2

Speech Activity Detection

Given an audio signal, Speech Activity Detection (SAD) is performed by discarding those phone posterior feature vectors for which the non-speech posterior is the highest. The remaining vectors, along with their corresponding time o sets, are stored for further use, but the component corresponding to the non-speech unit is deleted. If the number of speech vectors is too low (in this evaluation, 10, meaning 0.1 seconds), the whole signal is discarded (thus saving time and possibly avoiding many false alarms) and a oor score is output (in this evaluation, 10 5). 2.3

DTW-based query matching

Given two SAD- ltered sequences of feature vectors corresponding to a spoken query q and a spoken document x, the cosine distance is computed between each pair of vectors, q[i] and x[j] as follows: d(q[i]; x[j]) = log Note that d(v; w) 0, with d(v; w) = 0 if and only if v and w are perfectly aligned and d(v; w) = +1 if and only if v and w are orthogonal. The distance matrix computed according to Eq. 2 is further normalized with regard to the spoken document x, as follows: d(q[i]; x[j]) dmin(i) dmax(i) dmin(i) with dmin(i) = min d(q[i]; x[j]) and dmax(i) = max d(q[i]; x[j]). j j (2) (3) In this way, matrix values are all comprised between 0 and 1, so that a perfect match would produce a quasi-diagonal sequence of zeroes.

The best match of a query q of length m in a spoken document x of length n is de ned as that minimizing the average distance in a crossing path of the matrix dnorm. A crossing path starts at any given frame of x, k1 2 [1; n], then traverses a region of x which is optimally aligned to q (involving L vector alignments), and ends at frame k2 2 [k1; n]. The average distance in this crossing path is:

L davg(q; x) = 1 X dnorm(q[il]; x[jl]) (4)

L where il and jl are the indices of the vectors of q and x in the alignment l, for l = 1; 2; : : : ; L. Note that i1 = 1, iL = m, j1 = k1 and jL = k2. The minimization operation is accomplished by means of a dynamic programming procedure, which is (n m d) in time (d: size of feature vectors) and (n m) in space. The detection score is computed as 1 davg(q; x). Once the best match is obtained, the search procedure stops. As noted above, if either q or x have not enough speech samples, no alignment is performed and a oor score (10 5) is output. Note that a detection score must be mandatorily produced for each pair (q; x).

2.4 Score calibration and fusion

First, the so-called q-norm (query normalization) is applied, so that zero-mean and unit-variance scores are obtained per query [ 1 ]. Then, if n di erent systems are fused, since all of them contain a complete set of scores, for each trial the set of n scores is considered, which besides the ground truth (target/non-target labels) can be used to discriminatively estimate a linear transformation that produces well-calibrated scores that can be linearly combined to get fused scores. Under this approach, the Bayes optimal threshold (given by the e ective prior: 0:0741 for this evaluation) is applied. The BOSARIS toolkit [ 3 ] is used to estimate and apply the calibration/fusion models.

3. RESULTS

Table 1 shows the performance and processing costs of GTTS-EHU systems on QUESST 2014. To speed up computations, experiments with the full and reduced sets of features were carried out on di erent machines (see Table 1), which makes it nonsense to compare the reported times.

Indexing involves just applying BUT decoders to extract phone posterior features. The Indexing Speed Factor (ISF), the Searching Speed Factor (SSF) and the Peak Memory Usage (PMU) values have been computed as if all the computation was performed sequentially in a single processor (see [5]). Calibration and fusion costs have been neglected. The contrastive systems 1 and 2 (c1 and c2) use the concatenation of phone posteriors from the three decoders as features, for the full and reduced feature sets, respectively .

The system c3 is the fusion of four subsystems, using the full set of phone posteriors for Czech, Hungarian, Russian and the concatenation of them, respectively. The system c4 is equivalent to c3 but using the reduced sets of features. Finally, the primary system is the fusion of the eight available subsystems. In all cases, calibration and fusion parameters have been estimated on the development set. Note that the primary system yields only slightly better performance than system c3, meaning that reduced sets of features provide little additional information to full sets of features. In fact, the full sets outperform the reduced sets in all cases. As shown in Table 2, performance strongly degrades from T1 to T2 and (not so much) from T2 to T3; on the other hand, the non-native English and (to a lesser extent) the Basque subsets seem problematic. Future work may involve some kind of language detection and adaptation, plus speci c strategies for matching types T2 and T3 .

4. REFERENCES

[1]

Abad ,

L. J. Rodriguez

Fuentes ,

Penagarikano ,

Varona ,

Diez , and

Bordel . On the calibration and fusion of heterogeneous spoken term detection systems . In Interspeech 2013 , Lyon, France, August 25-29 2013 .

[2]

Anguera ,

L.-J.

Rodriguez-Fuentes , I. Szoke, A . Buzo, and

Metze . Query by Example Search on Speech at Mediaeval 2014 . In Working Notes Proceedings of the Mediaeval 2014 Workshop , Barcelona, Spain, October 16 -17 2014 .

[3]

Bru mmer and E. de Villiers. The BOSARIS Toolkit User Guide: Theory, Algorithms and Code for Binary Classi er Score Processing . Technical report , 2011 . https://sites.google.com/site/bosaristoolkit/.

[4]

Diez ,

L. J. Rodriguez

Fuentes ,

Penagarikano ,

Varona , and

Bordel . Dimensionality reduction of phone log-likelihood ratio features for spoken language recognition . In Interspeech 2013 , Lyon, France, August 25-29 2013 .

[5]

L.-J.

Rodriguez-Fuentes and

Penagarikano. MediaEval 2013 Spoken Web Search Task: System Performance Measures . Technical report , GTTS, UPV/EHU, May 2013 . http://gtts.ehu.es/gtts/NT/fulltext/rodriguezmediaeval13.pdf.

[6]

Schwarz . Phoneme recognition based on long temporal context . PhD thesis , FIT, BUT , Brno, Czech Republic, 2008 .