=Paper= {{Paper |id=Vol-1263/paper72 |storemode=property |title=SpeeD @ MediaEval 2014: Spoken Term Detection with Robust Multilingual Phone Recognition |pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_72.pdf |volume=Vol-1263 |dblpUrl=https://dblp.org/rec/conf/mediaeval/BuzoCB14 }} ==SpeeD @ MediaEval 2014: Spoken Term Detection with Robust Multilingual Phone Recognition== https://ceur-ws.org/Vol-1263/mediaeval2014_submission_72.pdf
       SpeeD @ MediaEval 2014: Spoken Term Detection with
             Robust Multilingual Phone Recognition
                                Andi Buzo, Horia Cucu, Corneliu Burileanu
                       SpeeD Research Laboratory, University Politehnica of Bucharest
                                         {andi.buzo, horia.cucu, corneliu.burileanu}@upb.ro


ABSTRACT                                                              low-resourced languages. AM4 is trained with all the data from
                                                                      the three languages. Phonemes from different languages,
In this paper, we attempt to resolve the Spoken Term Detection        however, are trained separately. This led to a big number of
(STD) problem for under-resourced languages by phone                  phonemes (145). AM5 was trained with the same data as AM4,
recognition with a multilingual acoustic model of three languages     but in contrast phonemes that are common in different languages
(Albanian, English and Romanian). The Power Normalized                were trained together, thus reducing the number of phonemes to
Cepstral Coefficients (PNCC) features are used for improved           98, which is still high. The identification of the common
robustness to noise.                                                  phonemes was made based on International Phonetic Alphabet
                                                                      (IPA) classification [4]. It is interesting to notice that Romanian
                                                                      and Albanian had in common more than 80% of their phonemes.
1. INTRODUCTION AND APPROACH                                          As for English, it has in common many consonants with the other
We approach the Query by Example Search on Speech Task                two languages, but very different vowels. AM6 is the one used by
(QUESST) @ MediaEval 2014 [1] by using a multilingual                 SpeeD team in MediaEval 2013 and it is tracked here for
acoustic model (AM) trained with three languages (Albanian,           comparison [2].
English and Romanian). The task involves searching for audio          Two speech features types are used in this work: the common
content within audio content using an audio query. The approach       Mel Frequency Cepstral Coefficients (MFCC) and the Power
consists in two stages: (1) the indexing, i.e. the phone              Normalized Cepstral Coefficients (PNCC).
recognition of the content data and (2) the searching, i.e. finding
a similar string of phones in the indexed content that matches the    1.2 Searching algorithm
one of the query by using a DTW based searching algorithm.
                                                                      If the ASR accuracy would be 100% then the STD is reduced to a
                                                                      simple character string search of a query within a textual content.
1.1 The acoustic model                                                As the experimental results show, we are far from the ideal case,
In our approach, we want to compare the effect of using               hence we have to find within a content a string which is similar
multilingual AM against the monolingual AM. In order to               to the query.
achieve this we have built five acoustic models described in          The DTW String Search (DTWSS) uses the Dynamic Time
Table 1. The AM training and the phoneme recognition are made         Warping to align a string (a query) within a content. The search
by using Hidden Markov Models (HMMs).                                 is not performed on the entire content, but only on a part of it by
                                                                      the means of a sliding window proportional to the length of the
                      Table 1. Training data                          query. The term is considered detected if the DTW scores above
  ID      Language                             No.       Training     a threshold. This method is refined by introducing a penalization
                                            phonemes     data [h]     for the short queries and the spread of the DTW match. The
  AM1     Romanian                             34           8.7       formula for the score s is given by equation (1):
  AM2     Albanian                             36           4.1
  AM3     English                              75           3.9                                 LQ  LQm              LW  LS
  AM4     Multilingual separate phones        145          16.7         s  (1  PhER )(1                 )(1             )       (1)
                                                                                                LQM  LQm               LQ
  AM5     Multilingual common phones           98          16.7
  AM6     Romanian MediaEval 2013              34           64
                                                                      where LQ is the length of the query, LQM=18 and LQm=4 are the
                                                                      maximum and the minimum query lengths found in the
We have built an AM for each language, (AM1 - AM3). AM1 is
                                                                      development data set, LW is the length of the sliding window, LS
trained with 8.7 hours of read speech. We had more available
                                                                      is the length of the matched term in the content, while α and β
training data for Romanian (in the MediaEval 2013 evaluation
                                                                      are the tuning parameters. In this work, α and β are set to 0.6.
campaign we used 64 hours [2]), but this year we chose to train
                                                                      The penalizations in formula (1) are motivated by the assumption
with less Romanian data in order to have a balanced training data
                                                                      that for two queries of different length that match their respective
set among different languages. AM2 is trained with 4.1 hours of
                                                                      contents by the same phone error rate (PhER), the match of the
Albanian read speech and broadcast news. AM3 is trained with
                                                                      longer query is more probable to be the right one. Similarly the
3.9 hours of native English read speech from the standard TIMIT
                                                                      more compact DTW matches are assumed to be more probable
database [3]. All these three languages are part of the languages
                                                                      than the longer ones. This algorithm is suitable for queries of
used in MediaEval 2014 evaluation campaign [1] (except for
                                                                      type 1 and 2, because the DTW handles inherently the small
English which is non-native). Hence, using more training data
                                                                      variations from the query, but it is not suitable for queries of type
would go beyond the context of the competition which aims at
                                                                      3 where words order may be inverted.

Copyright is held by the author/owner(s).
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain
                                                                    metric used is the normalized cross entropy cost (Cnxe). The
                                                                    results show almost no difference between the two types of
                                                                    features. The same conclusion is drawn even when comparing by
                                                                    TWV metric. In general speech recognition, PNCCs obtain better
                                                                    accuracy in noise conditions, but, most probably, the noise in the
                                                                    MediaEval 2014 database is not significant. Therefore, the use of
                                                                    PNCC did not bring any improvement.

                                                                    2.2 Official runs results
                                                                    The results obtained by the official runs on the evaluation
                                                                    database are shown in Table 3 and the metrics used are the
                                                                    actual and the minimum Cnxe. Because no tuning is made based
                                                                    on the development data set, the results on the evaluation data
                                                                    set are quite similar and the same conclusions can be drawn.
                                                                    Table 3 shows also the results per query type. It can be noticed
                                                                    that better results are obtained by query type 2. In contrast to
                                                                    query type 1, these queries are longer, which may have affected
                                                                    the results. Query type 3 has obtained a slightly worse
                                                                    performance, most probably because of the reordering of the
                                                                    words in such queries.
     Figure 1. The results for the development data set                                   Table 3. Official runs
                                                                               Overall         Type 1         Type 2         Type 3
                                                                               A/Min Cnxe      A/Min Cnxe     A/Min Cnxe     A/Min Cnxe
2. EXPERIMENTAL RESULTS                                               AM1      1.032/0.990     1.035/0.990    1.027/0.982    1.039/0.992
                                                                      AM2      1.053/0.997     1.057/0.999    1.046/0.994    1.052/0.995
2.1 STD results                                                       AM3      1.027/0.990     1.029/0.991    1.024/0.983    1.032/0.994
                                                                      AM4      1.017/0.977     1.019/0.976    1.012/0.973    1.018/0.974
The results obtained with different acoustic models on the            AM5      1.017/0.972     1.019/0.972    1.016/0970     1.017/0.963
development data set are shown in Figure 1. The comparison is
                                                                    The results are obtained on a Xeon E5-2430, 6 cores, 2.20GHz,
made by using the Maximum Term-Weighted Value (MTWV)                48GB, under Linux Ubuntu 12.04.2 LTS. The Indexing Speed
and the Detection Error Tradeoff (DET) curves. The speech           Factor (ISF), Searching Speed Factor (SSF) and Peak Memory
features used are the PNCCs. By comparing the acoustic models
                                                                    Usage for indexing and searching (PMUi and PMUs) as described
trained with a single language, the Romanian AM outperformed
                                                                    in [5] are almost the same for all runs (the differences between
the other two. This is most probably because the Romanian AM
                                                                    different runs stand only in the AM used). Their average values
is trained on more data (8.7h vs. ~4h). AM4 performed slightly      are ISF =0.81, SSF=1.2*10-5s-1, PMUi=2203MB, PMUs=197MB.
better than the monolingual acoustic models. On one hand, it is
trained with multiple languages which would increase the            3. CONCLUSIONS
phoneme recognition accuracy, on the other hand the number of
phonemes for this acoustic model is significantly increased which   We have approached STD with a two step process. Single or
increases the uncertainty during recognition. AM5 improves this     multilingual ASR is used as a phone recognizer for indexing the
latter aspect by not training separately common phonemes among      database, while a DTW based algorithm is used for searching a
different languages and the results show an improvement in          given query in the content database. The results show that by
performance. However the best results are obtained with AM6.        training with multiple languages the accuracy of the detection is
Even though it is trained with only one language (Romanian), it     increased, however the quantity of the data used for training is
is trained with a big amount of data (64h) and the set of           insufficient for training such a large phoneme set. The searching
phonemes is relatively small (34). This means that for larger       algorithm works better for query types 1 and 2 and slightly worse
phonemes set larger data are needed for training. Regarding the     for query type 3 where the words' order may be inverted.
STD task, it seems that by training with multiple languages the
performance increases but more data are needed in order to          4. REFERENCES
consolidate the acoustic models.                                    [1] X. Anguera, L.J. Rodriguez-Fuentes, I Szöke, A. Buzo and F. Metze,
                                                                          "Query by Example Search on Speech at Mediaeval 2014", in Working
     Table 2. PNCC vs. MFCC performance comparison                        Notes Proceedings of the Mediaeval 2014 Workshop, Barcelona,
    ID               PNCC                     MFCC                        Spain, October 16-17.
            ACnxe       MinCnxe       ACnxe      MinCnxe            [2]   H. Andi Buzo, Horia Cucu, Iris Molnar, Bogdan Ionescu and Corneliu
    AM1     1.032       0.986         1.032      0.986                    Burileanu, “SpeeD@MediaEval 2013 : A Phone Recognition
    AM2     1.055       0.997         1.055      0.997                    Approach to Spoken Term Detection”, in Proc. Mediaeval 2013
    AM3     1.03        0.994         1.03       0.994                    Workshop, Barcelona, Spain, 2013.
    AM4     1.015       0.972         1.016      0.971              [3]   J.S. Garofolo, et al., “TIMIT Acoustic-Phonetic Continuous Speech
    AM5     1.016       0.969         1.016      0.969                    Corpus”, Linguistic Data Consortium, Philadelphia, 1993.
                                                                    [4]   http://www.langsci.ucl.ac.uk/ipa/
The results obtained on the development database with different     [5]   L.-J. Rodriguez-Fuentes and M. Penagarikano. MediaEval 2013
speech features (PNCC and MFCC) are shown in Table II. The                Spoken Web Search Task: System Performance Measures. Technical
                                                                          report, GTTS, UPV/EHU, May 2013.