SpeeD @ MediaEval 2015: Multilingual Phone
           Recognition Approach to Query By Example STD
                       Alexandru Caranica, Andi Buzo, Horia Cucu, Corneliu Burileanu
                          SpeeD Research Laboratory, University Politehnica of Bucharest
                                             Bucharest, Romania
                alexandru.caranica@speed.pub.ro {andi.buzo, horia.cucu, corneliu.burileanu}@upb.ro


ABSTRACT                                                              is trained with 3.9 hours of native English read speech from the
In this paper, we attempt to solve the Spoken Term Detection          standard TIMIT database [4]. AM4 is trained with all the data
(STD) problem for under-resourced languages by a phone                from the three languages above, phonemes that are common in
recognition approach within the Automatic Speech Recognition          different languages were trained together, thus reducing the
(ASR) paradigm, with multilingual acoustic models from six            number of phonemes to 98. This was necessary to try and keep
languages (Albanian, Czech, English, Hungarian, Romanian and          uncertainty as low as possible during the recognition phase. The
Russian). The Power Normalized Cepstral Coefficients (PNCC)           identification of the common phonemes was made based on
features are used for improved robustness to noise, along with        International Phonetic Alphabet (IPA) classification [5]. Two
Phone Posteriorgrams in order to obtain content-aware acoustic        speech features types are used in this first approach: the common
features as independent as possible from speaker and acoustic         Mel Frequency Cepstral Coefficients (MFCC) and the Power
environment.                                                          Normalized Cepstral Coefficients (PNCC) [6].

1. INTRODUCTION & APPROACH                                                      Table 1. Training data for HMM approach
      We approach the Query by Example Search on Speech Task                                                        No.       Training
(QUESST [1]) @ MediaEval 2015 by using multilingual acoustic             ID                Language
                                                                                                                   phones     data [h]
models (AM) trained with six languages (Albanian, Czech,                AM1      Romanian                            34         8.7
English, Hungarian, Romanian and Russian). The task involves            AM2      Albanian                            36         4.1
searching for audio content within audio content using an audio         AM3      English                             75         3.9
query.                                                                  AM4      Multilingual common phones          98         16.7
      The approach consists of two stages:
1. The indexing, i.e. the phone recognition of the content data            For our second approach, we used phone posteriorgrams that
2. The searching, i.e. finding a similar string of phones in the      are output by the robust phoneme recognizer from BUT. This
      indexed content that matches the one of the query by using a    phone recognizer uses a split temporal context (STC) based
      DTW based searching algorithm.                                  feature extraction, with neural network classifiers [7] to output
      Unlike previous years, in 2015 the audio database features a    phone posteriorgrams, while Viterbi algorithm is used for
more challenging acoustic environment, by introducing noise and       phoneme string decoding. We can use the output of this tool in
reverberation. We expected PNCC features to perform better in         our DTW search algorithm as input features, to do the matching.
this scenario.                                                             In order to use additional languages to build features for the
      As increasing the training database in comparison with last     phoneme recognizer, we used the pre-trained systems available at
year would go beyond the context of the competition (which aims       [8] and described in Table 2.
at low-resourced languages), we tried introducing new languages
in the training phase to see how they perform, along with a neural
                                                                              Table 2. Trained systems used for STC approach
network based phoneme recognizer, from BUT [2].
                                                                                                                 No.
                                                                         ID             Language                              WER[%]
1.1 Acoustic models                                                                                           phonemes
     In our approach, one thing we want to compare is the effect       AM5       Czech                           45             24.24
of using PNCC features vs MFCC, along with the improvements a          AM6       Hungarian                       61             33.32
robust phone recognizer based on neural network classifiers can        AM7       Russian                         52             39.27
bring to our STD task. For the first comparison, we have built four
acoustic models, using internal audio resources for training, as           The languages used for training the systems described in
described in Table 1. The AM training and the phoneme                 Table 2 are from the SpeechDat-E Eastern European Speech
recognition are made in a conventional way, using Hidden              Database [9]. Another incentive to use these systems is the
Markov Models (HMMs), in CMU Sphinx [3].                              existence of trained non-speech events mapped to the following
     We have built an AM for each language, (AM1 - AM3).              tokens, which should prove useful with these years challenging
AM1 is trained with 8.7 hours of read speech. We could have           acoustic environment:
trained the Romanian AM with more data, but as we stated in the                “int” for intermittent noise
introduction, we wanted to have balanced training data among                   “spk” for speaker noise
different languages, for an under-resourced task. AM2 is trained               “pau” for silent pause
with 4.1 hours of Albanian read speech and broadcast news. AM3             The STC approach is based on the theoretical study that
                                                                      significant information about phoneme is spread over few
Copyright is held by the author/owner(s).                             hundreds milliseconds and that an STC system can process two
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany.
parts of the phoneme independently. The trajectory representing a           For the third type of features, posteriorgrams with STC
phoneme feature can then be decorrelated by splitting them into         approach, results are show in Table 4.
two parts, to limit the size of the model, in particular the number
of weights in the neural-net (NN). The system uses two blocks of                  Table 4. Posteriorgram performance results
features, for left and right contexts (the blocks have one frame
                                                                         ID                           ACnxe                 MinCnxe
overlap). Before splitting, the speech signal is filtered by applying
                                                                         AM5                          1.0055                 0.9945
the Hamming window on the whole block, so that the original
                                                                         AM6                          1.0048                 0.9935
central frame is emphasized. Dimensions of vectors are then
reduced by DCT and results are sent to two neural networks. The          AM7                          1.0056                 0.9941
posteriors from both contexts are, in the final stage, merged, after
the front-end neural networks are able to generate a three-state per         All our runs used our proposed DTW algorithm, described in
phoneme posterior model [10]. The above described features were         section 1.2. The metric used is the normalized cross entropy cost
used in this work as input to our search algorithm, which is            (Cnxe) [11]. The results show almost no difference between the
described in the following section.                                     three types of features. Although PNCC should offer better
                                                                        accuracy in noisy conditions and the BUT systems had tokens for
1.2 Searching algorithm                                                 noise, this was not reflected in the results, with no big
      In ideal conditions, if an ASR system should output               improvements obtained.
utterances with a 100% accuracy and precision, then the STD
would be reduced to a simple character string search of a query         2.2 Official run results
within a textual content. As the experimental results show, we are           The results obtained by the official runs on the evaluation
far from the ideal case, hence we have to find within a content a       database are shown in Table 5. We selected the first five best
string which is similar to the query.                                   models, sorted after the Actual Cnxe metric. Because no tuning is
      Our proposed DTW String Search (DTWSS) uses the                   made based on the development data set, the results on the
Dynamic Time Warping to align a string (a query) within a               evaluation data set are quite similar and the same conclusions can
content. The search is not performed on the entire content, but         be drawn. Table 5 shows also the results per query type.
only on a part of it by means of a sliding window proportional to
the length of the query. The term is considered detected if the                            Table 5. Official 2015 runs
DTW scores above a threshold. This method is refined by                            Overall        Type 1           Type 2         Type 3
introducing a penalization for the short queries and the spread of        ID
                                                                                   ACnxe          ACnxe            ACnxe          ACnxe
the DTW match. The formula for the score s is given by equation         AM6        1,0375         1,0384           1,0370         1,0376
(1):                                                                    AM7        1,0379         1,0391           1,0384         1,0365
                              𝐿𝑞 − 𝐿𝑄𝑚           𝐿𝑤 − 𝐿𝑆                AM1        1,0379         1,0383           1,0372         1,0385
    𝑠 = (1 − 𝑃ℎ𝐸𝑅)(1 + 𝛼                 )(1 + 𝛽         )      (1)
                             𝐿𝑄𝑀 − 𝐿𝑄𝑚              𝐿𝑄                  AM4*       1,0380         1,0386           1,0378         1,0379
      where Lq is the length of the query, LQM = 18 and LQm = 4 are     AM4        1,0381         1,0388           1,0378         1,0379
the maximum and the minimum query lengths found in the                       *PNCC model
development data set, LW is the length of the sliding window, LS is          It can be noticed that better results are obtained by query
the length of the matched term in the content, while α and β are        type 2, as these queries are longer, which may have affected the
the tuning parameters. For this task, α and β are set to 0.6, from      results. Posterior models (AM6/AM7) seem to offer minimal
previous evaluations [12]. The penalizations in formula (1) are         performance improvements, so we cannot draw a conclusion that
motivated by the assumption that for two queries of different           posteriors are better suited for the STD task in difficult acoustical
length that match their respective contents by the same phone           environments. Going further, we think improvements can be
error rate (PhER), the match of the longer query is more probable       obtained by preprocessing the audio first, before extracting
to be the right one. Similarly the more compact DTW matches are         features, to remove noise or possible reverberation.
assumed to be more probable than the longer ones. This algorithm
is suitable for queries of type 1 and 2, because the DTW handles        3. CONCLUSION
inherently the small variations from the query, but it is not                We have approached STD with a two-step process. A
suitable for queries of type 3 where word order may be inverted.        multilingual ASR is used as a phone recognizer for indexing the
                                                                        database, while a DTW based algorithm is used for searching a
2. EXPERIMENTAL RESULTS                                                 given query in the content database. We tested three types of
                                                                        features (MFCC, PNCC and Posteriorgrams) with two approaches
2.1 STD results                                                         to the phoneme recognizer (statistical HMMs and a neural STC
     For our first comparison, the results obtained on the              approach). The results show no big improvement between each
development database with the first two type of features (MFCC          approach and feature types, in part because this year’s database
and PNCC) are shown in Table 3.                                         features very challenging acoustic environments, and our
                                                                        phonetizers return a lot of “noise” tokens or repeated phonemes,
      Table 3. MFCC vs PNCC performance comparison                      which reflected further upon our DTW algorithm.
 ID                   MFCC                         PNCC
                ACnxe    MinCnxe             ACnxe    MinCnxe
                                                                        4. ACKNOWLEDGEMENTS
                                                                             The work has been funded by the Sectoral Operational
 AM1            1.0061    0.9944             1.0061    0.9943
                                                                        Programme Human Resources Development 2007-2013 of the
 AM2            1.0059    0.9947             1.0058    0.9947
                                                                        Ministry of European Funds through the Financial Agreements
 AM3            1.0055    0.9944             1.0047    0.9933
                                                                        POSDRU/159/1.5/S/132395 and in part by the PN II Programme
 AM4            1.0047    0.9933             1.0047    0.9933           "Partnerships in priority areas" of MEN - UEFISCDI, through
                                                                        project no. 332/2014.
                                                                  [6] F. Kelly, N. Harte, A comparison of auditory features for
5. REFERENCES                                                         robust speech recognition, In Proc. Of 18th European Signal
[1] I. Szöke, L.-J. Rodriguez-Fuentes, A. Buzo, X. Anguera, F.        Processing Conference (EUSIPCO-2010).
    Metze, J. Proença, M. Lojka, and X. Xiong, “Query by
                                                                  [7] P. Schwarz, Phoneme Recognition based on Long Temporal
    example search on speech at MediaEval 2015,” Working              Context, PhD Thesis, Brno University of Technology, 2009
    Notes Proceedings of the MediaEval 2015 Workshop, Sept.
    14-15, 2015, Wurzen, Germany, 2015.                           [8] Phoneme recognizer based on long temporal context,
                                                                      http://speech.fit.vutbr.cz/software/phoneme-recognizer-
[2] P. Schwarz, P. Matejka, J. Cernocky, Towards Lower Error
                                                                      based-long-temporal-context, accessed august 2015.
    Rates in Phoneme Recognition, in Proc. TSD2004, Brno,
    Czech Republic, 2004.                                         [9] Eastern European Speech Databases for Creation of Voice
                                                                      Driven Teleservices, http://www.fee.vutbr.cz/SPEECHDAT-
[3] CMU Sphinx, An open source toolkit for speech recognition,        E/, accessed august 2015.
    Carnegie Mellon University, accessed august 2015,
    http://cmusphinx.sourceforge.net/                             [10] P. Schwarz, P. Matejka, J. Cernocky, "Hierarchical
                                                                       Structures of Neural Networks for Phoneme Recognition", in
[4] The International Phonetic Alphabet and the IPA Chart,
                                                                       Proc. ICASSP 2006, pp. 325-328, Toulouse, France, 2006.
    https://www.internationalphoneticassociation.org/, accessed
    august 2015                                                   [11] L.-J. Rodriguez-Fuentes and M. Penagarikano. MediaEval
                                                                       2013 Spoken Web Search Task: System Performance
[5] J.S. Garofolo, et al., TIMIT Acoustic-Phonetic Continuous          Measures. Technical report, GTTS, UPV/EHU, May 2013.
    Speech Corpus, Linguistic Data Consortium, Philadelphia,
    1993.                                                         [12] H. Andi Buzo, Horia Cucu, Iris Molnar, Bogdan Ionescu and
                                                                       Corneliu Burileanu, SpeeD@MediaEval 2013: A Phone
                                                                       Recognition Approach to Spoken Term Detection, in Proc.
                                                                       Mediaeval 2013 Workshop, Barcelona, Spain, 2013.