MOTIVATION

Jozef Vavrek

Peter Viszlay

Peter.Viszlay@tuke.sk 1

Martin Lojka

Martin.Lojka@tuke.sk 1

Matúš Pleva

Matus.Pleva@tuke.sk 1

Jozef Juhár

Milan Rusko

Milan.Rusko@savba.sk 0 1 0 Institute of Informatics, Slovak Academy of Sciences , Dúbravská cesta 9, 845 07 Bratislava , Slovakia 1 Laboratory of Speech Technologies in Telecommunications @ Technical University of Košice Park Komenského 13 , 041 20 Košice , Slovakia

2015

14 15

In this paper, we present our retrieving system for QUery by Example Search on Speech Task (QUESST), comprising the posteriorgram-based modeling approach along with the weighted fast sequential dynamic time warping algorithm (WFS-DTW). For this year, our main e ort was directed toward developing language-dependent keyword matching system, utilizing all available information about spoken languages, considering all queries and utterance les. Despite the fact that the retrieving algorithm is the same as we used in previous year, a big novelty resides in the way of utilizing the information about all languages spoken in the retrieving database. Two low-resource systems using languagedependent acoustic unit modeling (AUM) approaches have been submitted. The rst one, called supervised, employs four well-trained phonetic decoders using acoustic models trained on time-aligned and annotated speech. The second one, de ned as unsupervised, uses blind phonetic segmentation for the speci c language where the information about spoken language is extracted from Mediaeval 2013 and Mediaeval 2014 databases. Considering the in uence on the overall retrieving performance, the acoustic model adaptation to the speci c language through retraining procedure was investigated for both approaches as well.

MOTIVATION

Challenging acoustic conditions and di erent types of queries led us to explore the area of language adaptation in query-by-example (QbE) retrieving. Therefore, our intention was to built a QbE retrieving system using all the available acoustic models trained solely on languages presented in the provided database.

SUPERVISED AUM APPROACH

The low-resource approach allowed us to use external resources (not related to QUESST task) for AUM and building acoustic models (AM) for the target languages in the provided database. We developed four language-dependent (LD) speech recognition systems, each represented by speci c LD phonetic decoder and by an external well-trained LD phoneme-based GMM (Gaussian mixture model), trained with the corresponding phone-level transcription.

Four monolingual annotated datasets were used for acousSearch corpus LD-GMM tic model training: Slovak Speechdat (66 hours of read speech, 54 phonemes) [ 5 ], Czech Speechdat (89 hours of read speech, 42 phonemes) [ 5 ], Romanian anonymous speech corpus1 (4.6 hours of read speech, 28 phonemes) and Portuguese (3 hours of BN recordings from COST278 DB [ 6 ], 1 hour of Laps Benchmark corpus from Fala Brasil project2, 34 phonemes).

The phonetic decoders and the LD AMs were intended to perform phonetic transcription and time alignment of search data utilizing the Viterbi algorithm. Each decoder employed a phone-level vocabulary and a phone network. The timealigned utterances were used in the supervised training of the nal multilingual GMM. In the presented work, we exploited two di erent ways of building multilingual GMMs.

The rst one is oriented to training of a new GMM from scratch using the utterances and the time alignment, needed to initialize the GMM. The initialized GMMs were then simultaneously updated and expanded to higher mixtures (up to 1024 mixtures) using the Baum-Welch estimation procedure [ 9 ]. In this case, the external AM operates as an initial AM needed to bootstrap the recognition system, which is further supposed to proceed without an external input.

The second, improved training scheme is related to AM retraining3. The main idea is to re-estimate the acoustic likelihoods of the well-trained AM iteratively, using the utterances and the time alignments described above. We performed always three re-estimation cycles in the retraining to achieve the convergence of estimation. We found that the retraining brings higher precision over the standard training from scratch. The newly prepared language-dependent GMMs were used to generate posteriorgrams, which were nally fed to DTW-based search. The low-resource acoustic unit modeling is conceptually illustrated in Fig. 1. In the whole experimental setup, we used the standard 39-dim. MFCCs (Mel-Frequency Cepstral Coe cients).

1http://rasc.racai.ro 2http://www.laps.ufpa.br/falabrasil/ 3characteristic for late submission UNSUPERVISED AUM APPROACH

The multilinguality problem and missing knowledge about acoustic units led us to employ two di erent unsupervised acoustic modeling approaches. For both types, we extracted an additional acoustic information about six spoken languages from Mediaeval 2013 and Mediaeval 2014 databases according to the available language tags.

The rst type is focused on unsupervised building of acoustic model from unlabelled speech data. We re-employed our well-established procedures from the previous year and we built an acoustic model with up to 1024 mixtures for each language. This process included PCA-based voice activity detection [ 7 ], feature extraction and selection [ 2 ], K-means clustering (K = 50), Euclidean segmentation and GMM training, respectively [ 8 ]. This concept is depicted in Fig. 2 with dashed line. Each LD AM was intended to generate LD posteriorgrams, whereas the score results from all subsystems were nally fused together.

The second, advanced acoustic modeling technique used language adaptation4 in acoustic (phonetic) sense performed through a retraining procedure similar to that used in lowresource modeling. The main idea is to use the already prepared LD AMs and feed them to phonetic decoding of search data in order to obtain LD time alignments. Inevitably, it was necessary to build an initial multilingual AM intended to adaptation, utilizing the same, already mentioned unsupervised segmentation and GMM training. The multilingual AM is then iteratively retrained on the LD time-aligned utterances using the search data (Fig. 2). Compared to the low-resource retraining, we retrained here the multilingual GMM instead of the language-speci c GMM. The resulting six language-adapted GMMs practically match the probability distributions of the acoustic units of the speci c language as good as possible.

4. POST-PROCESSING: SCORE NORMAL IZATION AND FUSION

The average cumulative distance parameter (ACD), represented by mean value of cumulative distance matrix elements within each warping region and multiplied by = 0:1, was used as score parameter. Scaling the ACD parameter within the values 0-1 helped us to unify score ranges for the rst 500 detection candidates per each query. Then the score fusion for di erent subsystems was carried out, employing a max-score merging strategy and z-normalization, similarly as we did last year [ 8 ]. The nal set was obtained by keeping all fused detections per each query.

4characteristic for late submission RESULTS AND CONCLUSION

We submitted four runs obtained from supervised (primary) and unsupervised (general) systems, including late submissions, for QUESST 2015 task [ 4 ]. We did not perform evaluation with each individual type of query T1/T2/T3, but concentrated on the overall detection performance. The maximum number of Gaussian mixtures (GMs), we employed in both primary and general subsystems, was 256 for supervised and 64 for unsupervised AUM. Higher number of GMs did not bring any improvement.

The result obtained from all examined systems are far beyond our expectations (Tab. 1). It can be explained by the quality of audio data that were recorded in degraded acoustic conditions and in uenced by background noises. The overall detection accuracy did not increase even though we examined various ways of speech enhancement techniques (DC o set removal, spectral subtraction, minimum mean squared error and Wiener ltering). Even the bottle-neck features developed at Brno University of Technology (BUT) [ 1 ] did not work well for our system.

Supervised AUM approach shows slightly better values of Cnxe and T W V in comparison with unsupervised AUM. The reason for relatively high performance of general approach is the AM adaptation employed in unsupervised AUM where all spoken languages are covered. Not signi cant improvement can be observed for late submission systems, that represent retraining procedure employed in AUM. However, the process of retraining did not perform well in case of supervised AUM for eval query set.

The robust statistical model-based speech enhancement methods embedded in the AUM and HMM-based speech segmentation will be investigated in the future. The processing load (PL) [ 3 ] for all systems, comprising development query set, is shown in Tab. 2. Considering the same searching set, the processing load is nearly identical for both dev and eval queries. It is obvious that the unsupervised AUM has the advantage of fast processing, mainly due to separate segmentation and acoustic modeling for each language.

ACKNOWLEDGMENTS

This publication is the result of the Project implementation: University Science Park TECHNICOM for Innovation Applications Supported by Knowledge Technology, ITMS: 26220220182, supported by the Research & Development Operational Programme funded by the ERDF (100%).

[1]

Grezl , M. Kara at, S. Kontar, and

Cernocky . Probabilistic and bottle-neck features for LVCSR of meetings . In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007 ), pages 757 { 760 . IEEE Signal Processing Society , 2007 .

[2]

Juhar and

Viszlay . Linear Feature Transformations in Slovak Phoneme-Based Continuous Speech Recognition . In Modern Speech Recognition Approaches with Case Studies , pages 131 { 154 . InTech Open Access, 2012 .

[3]

L. J.

Rodriguez-Fuentes and

Penagarikano . Mediaeval 2013 spoken web search task:system performance measures . Technical report , Software Technologies Working Group (GTTS, http://gtts.ehu. es) , 2013 .

[4]

Szoke ,

L. J.

Rodriguez-Fuentes ,

Buzo ,

Anguera ,

Metze ,

Proenca ,

Lojka , and

Xiong . Query by Example Search on Speech at Mediaeval 2015 . In Working Notes Proc. of the MediaEval 2015 Workshop , Germany, Wurzen, 14 - 15 September 2015 .

[5] H. van den Heuvel et al. SpeechDat-E: ve eastern european speech databases for voice-operated teleservices completed . In Proc. of INTERSPEECH , pages 2059 { 2062 , 2001 .

[6]

Vandecatseye et al. The COST278 pan-european broadcast news database . In Proc. of the 4th Intl. Conf. on Language Resources And Evaluation, LREC'04 , 2004 .

[7]

Vavrek et al. Query-by-Example Retrieval via Fast Sequential Dynamic Time Warping Algorithm . In TSP 2014 , Berlin, DE, pages 469 { 473 . IEEE, July 2014 .

[8]

Vavrek et al. TUKE System for MediaEval 2014 QUESST . In Working Notes Proc. of the MediaEval 2014 , 2014 .

[9]

Young et al. The HTK Book (for HTK Version 3.4) . Cambridge University, 2006 .