SpeeD @ MediaEval 2014: Spoken Term Detection with Robust Multilingual Phone Recognition

SpeeD @ MediaEval 2014: Spoken Term Detection with Robust Multilingual Phone Recognition AndiBuzo andi.buzo@upb.ro SpeeD Research Laboratory University Politehnica of Bucharest HoriaCucu horia.cucu@upb.ro SpeeD Research Laboratory University Politehnica of Bucharest CorneliuBurileanu corneliu.burileanu@upb.ro SpeeD Research Laboratory University Politehnica of Bucharest SpeeD @ MediaEval 2014: Spoken Term Detection with Robust Multilingual Phone Recognition B1B97FF2046A371F5CCDF674A440C239 GROBID - A machine learning software for extracting information from scholarly documents

In this paper, we attempt to resolve the Spoken Term Detection (STD) problem for under-resourced languages by phone recognition with a multilingual acoustic model of three languages (Albanian, English and Romanian). The Power Normalized Cepstral Coefficients (PNCC) features are used for improved robustness to noise.

INTRODUCTION AND APPROACH

We approach the Query by Example Search on Speech Task (QUESST) @ MediaEval 2014 [1] by using a multilingual acoustic model (AM) trained with three languages (Albanian, English and Romanian). The task involves searching for audio content within audio content using an audio query. The approach consists in two stages: (1) the indexing, i.e. the phone recognition of the content data and (2) the searching, i.e. finding a similar string of phones in the indexed content that matches the one of the query by using a DTW based searching algorithm.

The acoustic model

In our approach, we want to compare the effect of using multilingual AM against the monolingual AM. In order to achieve this we have built five acoustic models described in Table 1. The AM training and the phoneme recognition are made by using Hidden Markov Models (HMMs). We have built an AM for each language, (AM1 -AM3). AM1 is trained with 8.7 hours of read speech. We had more available training data for Romanian (in the MediaEval 2013 evaluation campaign we used 64 hours [2]), but this year we chose to train with less Romanian data in order to have a balanced training data set among different languages. AM2 is trained with 4.1 hours of Albanian read speech and broadcast news. AM3 is trained with 3.9 hours of native English read speech from the standard TIMIT database [3]. All these three languages are part of the languages used in MediaEval 2014 evaluation campaign [1] (except for English which is non-native). Hence, using more training data would go beyond the context of the competition which aims at low-resourced languages. AM4 is trained with all the data from the three languages. Phonemes from different languages, however, are trained separately. This led to a big number of phonemes (145). AM5 was trained with the same data as AM4, but in contrast phonemes that are common in different languages were trained together, thus reducing the number of phonemes to 98, which is still high. The identification of the common phonemes was made based on International Phonetic Alphabet (IPA) classification [4]. It is interesting to notice that Romanian and Albanian had in common more than 80% of their phonemes.

As for English, it has in common many consonants with the other two languages, but very different vowels. AM6 is the one used by SpeeD team in MediaEval 2013 and it is tracked here for comparison [2]. Two speech features types are used in this work: the common Mel Frequency Cepstral Coefficients (MFCC) and the Power Normalized Cepstral Coefficients (PNCC).

Searching algorithm

If the ASR accuracy would be 100% then the STD is reduced to a simple character string search of a query within a textual content. As the experimental results show, we are far from the ideal case, hence we have to find within a content a string which is similar to the query.

The DTW String Search (DTWSS) uses the Dynamic Time Warping to align a string (a query) within a content. The search is not performed on the entire content, but only on a part of it by the means of a sliding window proportional to the length of the query. The term is considered detected if the DTW scores above a threshold. This method is refined by introducing a penalization for the short queries and the spread of the DTW match. The formula for the score s is given by equation (1):

) 1 )( 1 )( 1 ( Q S W Qm QM Qm Q L L L L L L L PhER s          (1)

where LQ is the length of the query, LQM=18 and LQm=4 are the maximum and the minimum query lengths found in the development data set, LW is the length of the sliding window, LS is the length of the matched term in the content, while α and β are the tuning parameters. In this work, α and β are set to 0.6. The penalizations in formula (1) are motivated by the assumption that for two queries of different length that match their respective contents by the same phone error rate (PhER), the match of the longer query is more probable to be the right one. Similarly the more compact DTW matches are assumed to be more probable than the longer ones. This algorithm is suitable for queries of type 1 and 2, because the DTW handles inherently the small variations from the query, but it is not suitable for queries of type 3 where words order may be inverted.

EXPERIMENTAL RESULTS

STD results

The results obtained with different acoustic models on the development data set are shown in Figure 1. The comparison is made by using the Maximum Term-Weighted Value (MTWV) and the Detection Error Tradeoff (DET) curves. The speech features used are the PNCCs. By comparing the acoustic models trained with a single language, the Romanian AM outperformed the other two. This is most probably because the Romanian AM is trained on more data (8.7h vs. ~4h). AM4 performed slightly better than the monolingual acoustic models. On one hand, it is trained with multiple languages which would increase the phoneme recognition accuracy, on the other hand the number of phonemes for this acoustic model is significantly increased which increases the uncertainty during recognition. AM5 improves this latter aspect by not training separately common phonemes among different languages and the results show an improvement in performance. However the best results are obtained with AM6. Even though it is trained with only one language (Romanian), it is trained with a big amount of data (64h) and the set of phonemes is relatively small (34). This means that for larger phonemes set larger data are needed for training. Regarding the STD task, it seems that by training with multiple languages the performance increases but more data are needed in order to consolidate the acoustic models. The results obtained on the development database with different speech features (PNCC and MFCC) are shown in Table II. The metric used is the normalized cross entropy cost (Cnxe). The results show almost no difference between the two types of features. The same conclusion is drawn even when comparing by TWV metric. In general speech recognition, PNCCs obtain better accuracy in noise conditions, but, most probably, the noise in the MediaEval 2014 database is not significant. Therefore, the use of PNCC did not bring any improvement.

Official runs results

The results obtained by the official runs on the evaluation database are shown in Table 3 and the metrics used are the actual and the minimum Cnxe. Because no tuning is made based on the development data set, the results on the evaluation data set are quite similar and the same conclusions can be drawn. Table 3 shows also the results per query type. It can be noticed that better results are obtained by query type 2. In contrast to query type 1, these queries are longer, which may have affected the results. Query type 3 has obtained a slightly worse performance, most probably because of the reordering of the words in such queries. The results are obtained on a Xeon E5-2430, 6 cores, 2.20GHz, 48GB, under Linux Ubuntu 12.04.2 LTS. The Indexing Speed Factor (ISF), Searching Speed Factor (SSF) and Peak Memory Usage for indexing and searching (PMUi and PMUs) as described in [5] are almost the same for all runs (the differences between different runs stand only in the AM used). Their average values are ISF =0.81, SSF=1.2*10 -5 s -1 , PMUi=2203MB, PMUs=197MB.

CONCLUSIONS

We have approached STD with a two step process. Single or multilingual ASR is used as a phone recognizer for indexing the database, while a DTW based algorithm is used for searching a given query in the content database. The results show that by training with multiple languages the accuracy of the detection is increased, however the quantity of the data used for training is insufficient for training such a large phoneme set. The searching algorithm works better for query types 1 and 2 and slightly worse for query type 3 where the words' order may be inverted.

Figure 1 .1Figure 1. The results for the development data set

Table 1 . Training data1IDLanguageNo.Trainingphonemesdata [h]AM1Romanian348.7AM2Albanian364.1AM3English753.9AM4Multilingual separate phones14516.7AM5Multilingual common phones9816.7AM6Romanian MediaEval 20133464

Table 2 . PNCC vs. MFCC performance comparison2IDPNCCMFCCACnxeMinCnxeACnxeMinCnxeAM11.0320.9861.0320.986AM21.0550.9971.0550.997AM31.030.9941.030.994AM41.0150.9721.0160.971AM51.0160.9691.0160.969

Table 3 . Official runs3OverallType 1Type 2Type 3A/Min CnxeA/Min CnxeA/Min CnxeA/Min CnxeAM11.032/0.9901.035/0.9901.027/0.9821.039/0.992AM21.053/0.9971.057/0.9991.046/0.9941.052/0.995AM31.027/0.9901.029/0.9911.024/0.9831.032/0.994AM41.017/0.9771.019/0.9761.012/0.9731.018/0.974AM51.017/0.9721.019/0.9721.016/09701.017/0.963

<author> <persName><surname>References</surname></persName> </author> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b1"> <analytic> <title level="a" type="main">Query by Example Search on Speech at Mediaeval 2014 XAnguera LJRodriguez-Fuentes ASzöke FBuzo Metze Working Notes Proceedings of the Mediaeval 2014 Workshop

Barcelona, Spain

October 16-17 SpeeD@MediaEval 2013 : A Phone Recognition Approach to Spoken Term Detection H AndiBuzo HoriaCucu IrisMolnar BogdanIonescu CorneliuBurileanu Proc. Mediaeval 2013 Workshop Mediaeval 2013 Workshop

Barcelona, Spain

2013 TIMIT Acoustic-Phonetic Continuous Speech Corpus JSGarofolo Linguistic Data Consortium

Philadelphia

1993 MediaEval 2013 Spoken Web Search Task: System Performance Measures L.-JRodriguez-Fuentes MPenagarikano May 2013 GTTS, UPV/EHU Technical report