SpeeD @ MediaEval 2015: Multilingual Phone Recognition Approach to Query By Example STD Alexandru Caranica, Andi Buzo, Horia Cucu, Corneliu Burileanu SpeeD Research Laboratory, University Politehnica of Bucharest Bucharest, Romania alexandru.caranica@speed.pub.ro {andi.buzo, horia.cucu, corneliu.burileanu}@upb.ro ABSTRACT is trained with 3.9 hours of native English read speech from the In this paper, we attempt to solve the Spoken Term Detection standard TIMIT database [4]. AM4 is trained with all the data (STD) problem for under-resourced languages by a phone from the three languages above, phonemes that are common in recognition approach within the Automatic Speech Recognition different languages were trained together, thus reducing the (ASR) paradigm, with multilingual acoustic models from six number of phonemes to 98. This was necessary to try and keep languages (Albanian, Czech, English, Hungarian, Romanian and uncertainty as low as possible during the recognition phase. The Russian). The Power Normalized Cepstral Coefficients (PNCC) identification of the common phonemes was made based on features are used for improved robustness to noise, along with International Phonetic Alphabet (IPA) classification [5]. Two Phone Posteriorgrams in order to obtain content-aware acoustic speech features types are used in this first approach: the common features as independent as possible from speaker and acoustic Mel Frequency Cepstral Coefficients (MFCC) and the Power environment. Normalized Cepstral Coefficients (PNCC) [6]. 1. INTRODUCTION & APPROACH Table 1. Training data for HMM approach We approach the Query by Example Search on Speech Task No. Training (QUESST [1]) @ MediaEval 2015 by using multilingual acoustic ID Language phones data [h] models (AM) trained with six languages (Albanian, Czech, AM1 Romanian 34 8.7 English, Hungarian, Romanian and Russian). The task involves AM2 Albanian 36 4.1 searching for audio content within audio content using an audio AM3 English 75 3.9 query. AM4 Multilingual common phones 98 16.7 The approach consists of two stages: 1. The indexing, i.e. the phone recognition of the content data For our second approach, we used phone posteriorgrams that 2. The searching, i.e. finding a similar string of phones in the are output by the robust phoneme recognizer from BUT. This indexed content that matches the one of the query by using a phone recognizer uses a split temporal context (STC) based DTW based searching algorithm. feature extraction, with neural network classifiers [7] to output Unlike previous years, in 2015 the audio database features a phone posteriorgrams, while Viterbi algorithm is used for more challenging acoustic environment, by introducing noise and phoneme string decoding. We can use the output of this tool in reverberation. We expected PNCC features to perform better in our DTW search algorithm as input features, to do the matching. this scenario. In order to use additional languages to build features for the As increasing the training database in comparison with last phoneme recognizer, we used the pre-trained systems available at year would go beyond the context of the competition (which aims [8] and described in Table 2. at low-resourced languages), we tried introducing new languages in the training phase to see how they perform, along with a neural Table 2. Trained systems used for STC approach network based phoneme recognizer, from BUT [2]. No. ID Language WER[%] 1.1 Acoustic models phonemes In our approach, one thing we want to compare is the effect AM5 Czech 45 24.24 of using PNCC features vs MFCC, along with the improvements a AM6 Hungarian 61 33.32 robust phone recognizer based on neural network classifiers can AM7 Russian 52 39.27 bring to our STD task. For the first comparison, we have built four acoustic models, using internal audio resources for training, as The languages used for training the systems described in described in Table 1. The AM training and the phoneme Table 2 are from the SpeechDat-E Eastern European Speech recognition are made in a conventional way, using Hidden Database [9]. Another incentive to use these systems is the Markov Models (HMMs), in CMU Sphinx [3]. existence of trained non-speech events mapped to the following We have built an AM for each language, (AM1 - AM3). tokens, which should prove useful with these years challenging AM1 is trained with 8.7 hours of read speech. We could have acoustic environment: trained the Romanian AM with more data, but as we stated in the  “int” for intermittent noise introduction, we wanted to have balanced training data among  “spk” for speaker noise different languages, for an under-resourced task. AM2 is trained  “pau” for silent pause with 4.1 hours of Albanian read speech and broadcast news. AM3 The STC approach is based on the theoretical study that significant information about phoneme is spread over few Copyright is held by the author/owner(s). hundreds milliseconds and that an STC system can process two MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany. parts of the phoneme independently. The trajectory representing a For the third type of features, posteriorgrams with STC phoneme feature can then be decorrelated by splitting them into approach, results are show in Table 4. two parts, to limit the size of the model, in particular the number of weights in the neural-net (NN). The system uses two blocks of Table 4. Posteriorgram performance results features, for left and right contexts (the blocks have one frame ID ACnxe MinCnxe overlap). Before splitting, the speech signal is filtered by applying AM5 1.0055 0.9945 the Hamming window on the whole block, so that the original AM6 1.0048 0.9935 central frame is emphasized. Dimensions of vectors are then reduced by DCT and results are sent to two neural networks. The AM7 1.0056 0.9941 posteriors from both contexts are, in the final stage, merged, after the front-end neural networks are able to generate a three-state per All our runs used our proposed DTW algorithm, described in phoneme posterior model [10]. The above described features were section 1.2. The metric used is the normalized cross entropy cost used in this work as input to our search algorithm, which is (Cnxe) [11]. The results show almost no difference between the described in the following section. three types of features. Although PNCC should offer better accuracy in noisy conditions and the BUT systems had tokens for 1.2 Searching algorithm noise, this was not reflected in the results, with no big In ideal conditions, if an ASR system should output improvements obtained. utterances with a 100% accuracy and precision, then the STD would be reduced to a simple character string search of a query 2.2 Official run results within a textual content. As the experimental results show, we are The results obtained by the official runs on the evaluation far from the ideal case, hence we have to find within a content a database are shown in Table 5. We selected the first five best string which is similar to the query. models, sorted after the Actual Cnxe metric. Because no tuning is Our proposed DTW String Search (DTWSS) uses the made based on the development data set, the results on the Dynamic Time Warping to align a string (a query) within a evaluation data set are quite similar and the same conclusions can content. The search is not performed on the entire content, but be drawn. Table 5 shows also the results per query type. only on a part of it by means of a sliding window proportional to the length of the query. The term is considered detected if the Table 5. Official 2015 runs DTW scores above a threshold. This method is refined by Overall Type 1 Type 2 Type 3 introducing a penalization for the short queries and the spread of ID ACnxe ACnxe ACnxe ACnxe the DTW match. The formula for the score s is given by equation AM6 1,0375 1,0384 1,0370 1,0376 (1): AM7 1,0379 1,0391 1,0384 1,0365 𝐿𝑞 − 𝐿𝑄𝑚 𝐿𝑤 − 𝐿𝑆 AM1 1,0379 1,0383 1,0372 1,0385 𝑠 = (1 − 𝑃ℎ𝐸𝑅)(1 + 𝛼 )(1 + 𝛽 ) (1) 𝐿𝑄𝑀 − 𝐿𝑄𝑚 𝐿𝑄 AM4* 1,0380 1,0386 1,0378 1,0379 where Lq is the length of the query, LQM = 18 and LQm = 4 are AM4 1,0381 1,0388 1,0378 1,0379 the maximum and the minimum query lengths found in the *PNCC model development data set, LW is the length of the sliding window, LS is It can be noticed that better results are obtained by query the length of the matched term in the content, while α and β are type 2, as these queries are longer, which may have affected the the tuning parameters. For this task, α and β are set to 0.6, from results. Posterior models (AM6/AM7) seem to offer minimal previous evaluations [12]. The penalizations in formula (1) are performance improvements, so we cannot draw a conclusion that motivated by the assumption that for two queries of different posteriors are better suited for the STD task in difficult acoustical length that match their respective contents by the same phone environments. Going further, we think improvements can be error rate (PhER), the match of the longer query is more probable obtained by preprocessing the audio first, before extracting to be the right one. Similarly the more compact DTW matches are features, to remove noise or possible reverberation. assumed to be more probable than the longer ones. This algorithm is suitable for queries of type 1 and 2, because the DTW handles 3. CONCLUSION inherently the small variations from the query, but it is not We have approached STD with a two-step process. A suitable for queries of type 3 where word order may be inverted. multilingual ASR is used as a phone recognizer for indexing the database, while a DTW based algorithm is used for searching a 2. EXPERIMENTAL RESULTS given query in the content database. We tested three types of features (MFCC, PNCC and Posteriorgrams) with two approaches 2.1 STD results to the phoneme recognizer (statistical HMMs and a neural STC For our first comparison, the results obtained on the approach). The results show no big improvement between each development database with the first two type of features (MFCC approach and feature types, in part because this year’s database and PNCC) are shown in Table 3. features very challenging acoustic environments, and our phonetizers return a lot of “noise” tokens or repeated phonemes, Table 3. MFCC vs PNCC performance comparison which reflected further upon our DTW algorithm. ID MFCC PNCC ACnxe MinCnxe ACnxe MinCnxe 4. ACKNOWLEDGEMENTS The work has been funded by the Sectoral Operational AM1 1.0061 0.9944 1.0061 0.9943 Programme Human Resources Development 2007-2013 of the AM2 1.0059 0.9947 1.0058 0.9947 Ministry of European Funds through the Financial Agreements AM3 1.0055 0.9944 1.0047 0.9933 POSDRU/159/1.5/S/132395 and in part by the PN II Programme AM4 1.0047 0.9933 1.0047 0.9933 "Partnerships in priority areas" of MEN - UEFISCDI, through project no. 332/2014. [6] F. Kelly, N. Harte, A comparison of auditory features for 5. REFERENCES robust speech recognition, In Proc. Of 18th European Signal [1] I. Szöke, L.-J. Rodriguez-Fuentes, A. Buzo, X. Anguera, F. Processing Conference (EUSIPCO-2010). Metze, J. Proença, M. Lojka, and X. Xiong, “Query by [7] P. Schwarz, Phoneme Recognition based on Long Temporal example search on speech at MediaEval 2015,” Working Context, PhD Thesis, Brno University of Technology, 2009 Notes Proceedings of the MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany, 2015. [8] Phoneme recognizer based on long temporal context, http://speech.fit.vutbr.cz/software/phoneme-recognizer- [2] P. Schwarz, P. Matejka, J. Cernocky, Towards Lower Error based-long-temporal-context, accessed august 2015. Rates in Phoneme Recognition, in Proc. TSD2004, Brno, Czech Republic, 2004. [9] Eastern European Speech Databases for Creation of Voice Driven Teleservices, http://www.fee.vutbr.cz/SPEECHDAT- [3] CMU Sphinx, An open source toolkit for speech recognition, E/, accessed august 2015. Carnegie Mellon University, accessed august 2015, http://cmusphinx.sourceforge.net/ [10] P. Schwarz, P. Matejka, J. Cernocky, "Hierarchical Structures of Neural Networks for Phoneme Recognition", in [4] The International Phonetic Alphabet and the IPA Chart, Proc. ICASSP 2006, pp. 325-328, Toulouse, France, 2006. https://www.internationalphoneticassociation.org/, accessed august 2015 [11] L.-J. Rodriguez-Fuentes and M. Penagarikano. MediaEval 2013 Spoken Web Search Task: System Performance [5] J.S. Garofolo, et al., TIMIT Acoustic-Phonetic Continuous Measures. Technical report, GTTS, UPV/EHU, May 2013. Speech Corpus, Linguistic Data Consortium, Philadelphia, 1993. [12] H. Andi Buzo, Horia Cucu, Iris Molnar, Bogdan Ionescu and Corneliu Burileanu, SpeeD@MediaEval 2013: A Phone Recognition Approach to Spoken Term Detection, in Proc. Mediaeval 2013 Workshop, Barcelona, Spain, 2013.