NTU System at MediaEval 2015: Zero Resource Query by Example Spoken Term Detection Using Deep and Recurrent Neural Networks Cheng-Tao Chung Yang-de Chen Graduate Institute of Electrical Engineering, Graduate Institute of Communication National Taiwan University Engineering, National Taiwan University b97901182@gmail.com yongde0108@gmail.com ABSTRACT Let the feature sequence of an utterance be denoted as This note serves as a documentation describing the methods X ∈ RS×F , where S denotes the length of the sequence, and the authors of this paper implemented for the Query by F denotes the number of dimensions of the feature. We ex- Example Search on Speech Task (QUESST) as a part of tract 39 dimensional MFCCs with energy, delta and double MediaEval 2015. In this work, we combined DTW, DNN deltas with HTK[6] for our features in this work. By per- and RNN in one framework to perform query by example forming sub-sequence Dynamic Time Warping on the two spoken term detection in a zero resource setting. feature sequences of a document Xd and that of a query Xq , we can find the aligned warping sequences Wd ∈ RT ×F in the document and Wq ∈ RT ×F in the query, where T 1. INTRODUCTION denotes the length of the warping sequence. Participants of the task were asked to implement a query Wd , Wq = DT W (Xd , Xq ). (1) by example spoken term detection system on a corpus pro- vided by the organizers. The queries were divided into the We forward the feature frames of both Xd and Xq through development set and the evaluation set, and the list of cor- the same deep neural network. The number of neurons in the rect documents are given for the development queries. A DNN is 39, 100, 100, 39 on each layer from input to output, soft score and a hard decision for every query-document We use tanh as the activation function of the network. pair in the evaluation set has to be provided. Note that in this task, only whether or not the query appears in the Hd = DN N (Wd ), (2) document is considered, when the query appears in the doc- Hq = DN N (Wq ). (3) ument is not important. For more information please refer On the final layer of the DNN, the output features are then to the overview paper [1]. concatenated, then forwarded to a recurrent neural network. In this work, we approached the task under a zero resource The number of neurons in the hidden layer of the RNN is 50, setting [2] using neural networks. This means we did not and the sigmoid function is used as the activation function. use any other information than the corpus itself. For the The output of the RNN at the time frame t is a single score task considered here, we need to formulate an objective that st (q, d), s(q, d) is the average of the score along the entire compares two feature sequences of a query and a document sequence, and T is the length of the sequence: both of varying length and return a score. The Deep Neural Network [3] (DNN) is a state-of-the-art architecture that RN N ([Hd , Hq ]) = [s0 (d, q), s1 (d, q), ..., sT −1 (d, q)] (4) has been widely applied in speech recognition. However, it 1 X is limited to framewise objectives where the length of the s(d, q) = st (d, q). (5) T t input feature has to be fixed. Hence we need to focus on two issues in the work: dealing with the varying sequence For every query q, the score of a positive document dp con- length and the formulation of a sequence objective. taining q should be high; the score of a negative document Dynamic Time Warping [4] (DTW) is one of the earliest dn containing q should be low. Therefore, the following is techniques applied in the field and can find the alignment the objective which we wish to minimize: of two sequences, hence transforming both sequences into X a feature representation of the same length. DTW solves Lq = s(dn , q) − s(dp , q). (6) the problem of varying sequence length. On the other hand, p,n we use Recurrent Neural Networks [5] (RNN) to generate The final objective we wish to minimize is the sum of the a sequence objective for the query-document pair. In this objective of all the queries: work, we combine DTW, DNN and RNN in one framework. XX L= s(dn , q) − s(dp , q). (7) 2. OBJECTIVE AND APPROACH q p,n We train the entire network including both the DNN and RNN using back-propagation algorithm. We take s(d, q) Copyright is held by the author/owner(s). as the score for the document-query pair (d, q), and query MediaEval 2015 Workshop Sept. 14-15, 2015, Wurzen, Germany would be considered to be in a document if s(d, q) > 0.5. Table 1: Cnxe Results Actual Cnxe set method ALL T1 T2 T3 dtw 2.0066 2.0064 2.0077 2.0055 dev rnn 2.0066 2.0064 2.0077 2.0055 dtw 2.0067 2.0070 2.0093 2.0029 eval rnn 2.0067 2.0070 2.0093 2.0029 3. EXPERIMENTS AND RESULTS The entire corpus was trained only using the corpus of QUESST 2015. We derived two sets of scores from the method above. The first set of scores is the DTW scores generated when we initially align the features between query and document. The second set of scores is the score gener- ated from our RNN in equation 4. We did not perform any pretraining on the network and used random initialization for all the weights. The neural network was implemented using the Theano library [7]. Positive examples for query- document pairs were selected from all query types(T1, T2, T3) in the development set, negative examples were ran- domly generated query-document pairs. The results of our experiments are shown in Table 1. We only show the actual normalized cross entropy (Cnxe). From the results, we see that the RNN did not perform better than the DTW, and neither system seemed to have performed well. This could be due to error in the implementation or insufficient number of epochs during the training for the RNN. Since the results of the RNN were based on the results of the DTW, it is unclear of what caused the problem. 4. CONCLUSION Figure 1: Other systems that we’ve tried. The authors of the paper attempted a framework to com- bine DNN, RNN and DTW under a single zero resource neu- ral network framework for query by example spoken term another 2D feature representation with one axis being time. detection. We treated the features at the output as if they were acous- tic features and plot the warping matrix (pairwise consine similarity). However, instead of applying DTW, we treated 5. SUPPLEMENTARY MATERIAL the warping matrix itself as another image and forward it Since we were encouraged to discuss other systems that through another CNN. The target of the CNN was whether we’ve tried, we’ve included several versions of our system or not the document contains the query. This design didn’t through different development iterations. All of these sys- work because the error on the testing set didn’t converge, tems except for the correspondence auto-encoder have been maybe due to serious over-fitting. implemented, yet not all have been evaluated on the corpus. In the third attempt, we removed the second CNN to re- The philosophy behind all these designs was to map query- duce the number of parameters. The first CNN has a fully document pairs to a single trainable objective so the error connected layer in this design, and we took the inner prod- back-propagates through the entire network. uct of the fully connected layer from the document and the In the first attempt, we learned feature transformation on query to be the error. Although the number of parameters the acoustic features (MFCC) using a DNN. The objective have been reduced, the error on the testing set still didn’t of the DNN was a the warping distance after DTW, and the converge. Maybe the number of correct training pairs just error of the DTW was back-propagated into the network. wasn’t enough to train such systems. This design was abandoned due to unreasonably slow calcu- Finally, we decided to consider using correspondence au- lation: DNN is GPU intensive while DTW is CPU intensive, toencoders [9] for this task. The plan was to use correspon- making this hybrid system hard to implement using Theano. dence autoencoders as an zero-resource feature extractor. In the second attempt, we replaced DTW with Convolu- We planned to perform another DTW on these extracted tional Neural Networks (CNN) [8]. CNNs are GPU friendly features. Although the system contains both DTWs and architectures which fixes the problem of the previous hybrid DNNs, they are not jointly trained so the performance prob- systems. The 2D feature representation of every utterance lem of the first design doesn’t occur. However, DTWs are was treated as an image. The acoustic features from dif- extremely time consuming operations opposed to RNNs. We ferent queries/documents were padded with zero vectors to ran out of time to perform the second DTW so decided to be the same length. This feature transformation CNN only replace it with a RNN which is the system that we submitted has convolutional layers. The end result the first CNN was in the end. 6. REFERENCES [1] Igor Szoke, Luis J. Rodriguez-Fuentes, Andi Buzo, Xavier Anguera, Florian Metze, Jorge Proenca, Martin Lojka, and Xiao Xiong. Query by example search on speech at mediaeval 2015. In Working Notes Proceedings of the Mediaeval 2015 Workshop, Wurzen, Germany, 2015. [2] Aren Jansen, Emmanuel Dupoux, Sharon Goldwater, Mark Johnson, Sanjeev Khudanpur, Kenneth Church, Naomi Feldman, Hynek Hermansky, Florian Metze, Richard Rose, et al. A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition. 2013. [3] Li Deng, Geoffrey Hinton, and Brian Kingsbury. New types of deep neural network learning for speech recognition and related applications: An overview. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8599–8603. IEEE, 2013. [4] Meinard Müller. Dynamic time warping. Information retrieval for music and motion, pages 69–84, 2007. [5] Tony Robinson, Mike Hochberg, and Steve Renals. The use of recurrent neural networks in continuous speech recognition. In Automatic speech and speaker recognition, pages 233–258. Springer, 1996. [6] Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, et al. The HTK book, volume 2. Entropic Cambridge Research Laboratory Cambridge, 1997. [7] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4, page 3. Austin, TX, 2010. [8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [9] Herman Kamper, Micha Elsner, Aren Jansen, and Sharon Goldwater. Unsupervised neural network based feature extraction using weak top-down constraints.