NTU System at MediaEval 2015: Zero Resource Query by
        Example Spoken Term Detection Using Deep and
                  Recurrent Neural Networks

                        Cheng-Tao Chung                                         Yang-de Chen
           Graduate Institute of Electrical Engineering,            Graduate Institute of Communication
                  National Taiwan University                       Engineering, National Taiwan University
                    b97901182@gmail.com                                  yongde0108@gmail.com

ABSTRACT                                                          Let the feature sequence of an utterance be denoted as
This note serves as a documentation describing the methods      X ∈ RS×F , where S denotes the length of the sequence, and
the authors of this paper implemented for the Query by          F denotes the number of dimensions of the feature. We ex-
Example Search on Speech Task (QUESST) as a part of             tract 39 dimensional MFCCs with energy, delta and double
MediaEval 2015. In this work, we combined DTW, DNN              deltas with HTK[6] for our features in this work. By per-
and RNN in one framework to perform query by example            forming sub-sequence Dynamic Time Warping on the two
spoken term detection in a zero resource setting.               feature sequences of a document Xd and that of a query
                                                                Xq , we can find the aligned warping sequences Wd ∈ RT ×F
                                                                in the document and Wq ∈ RT ×F in the query, where T
1.    INTRODUCTION                                              denotes the length of the warping sequence.
   Participants of the task were asked to implement a query                    Wd , Wq = DT W (Xd , Xq ).                          (1)
by example spoken term detection system on a corpus pro-
vided by the organizers. The queries were divided into the        We forward the feature frames of both Xd and Xq through
development set and the evaluation set, and the list of cor-    the same deep neural network. The number of neurons in the
rect documents are given for the development queries. A         DNN is 39, 100, 100, 39 on each layer from input to output,
soft score and a hard decision for every query-document         We use tanh as the activation function of the network.
pair in the evaluation set has to be provided. Note that
in this task, only whether or not the query appears in the                          Hd        =   DN N (Wd ),                      (2)
document is considered, when the query appears in the doc-                          Hq        =   DN N (Wq ).                      (3)
ument is not important. For more information please refer
                                                                On the final layer of the DNN, the output features are then
to the overview paper [1].
                                                                concatenated, then forwarded to a recurrent neural network.
   In this work, we approached the task under a zero resource
                                                                The number of neurons in the hidden layer of the RNN is 50,
setting [2] using neural networks. This means we did not
                                                                and the sigmoid function is used as the activation function.
use any other information than the corpus itself. For the
                                                                The output of the RNN at the time frame t is a single score
task considered here, we need to formulate an objective that
                                                                st (q, d), s(q, d) is the average of the score along the entire
compares two feature sequences of a query and a document
                                                                sequence, and T is the length of the sequence:
both of varying length and return a score. The Deep Neural
Network [3] (DNN) is a state-of-the-art architecture that          RN N ([Hd , Hq ])    = [s0 (d, q), s1 (d, q), ..., sT −1 (d, q)] (4)
has been widely applied in speech recognition. However, it                                 1 X
is limited to framewise objectives where the length of the                   s(d, q)    =         st (d, q).                        (5)
                                                                                          T t
input feature has to be fixed. Hence we need to focus on
two issues in the work: dealing with the varying sequence         For every query q, the score of a positive document dp con-
length and the formulation of a sequence objective.             taining q should be high; the score of a negative document
   Dynamic Time Warping [4] (DTW) is one of the earliest        dn containing q should be low. Therefore, the following is
techniques applied in the field and can find the alignment      the objective which we wish to minimize:
of two sequences, hence transforming both sequences into                           X
a feature representation of the same length. DTW solves                       Lq =     s(dn , q) − s(dp , q).             (6)
the problem of varying sequence length. On the other hand,                             p,n
we use Recurrent Neural Networks [5] (RNN) to generate
                                                                  The final objective we wish to minimize is the sum of the
a sequence objective for the query-document pair. In this
                                                                objective of all the queries:
work, we combine DTW, DNN and RNN in one framework.                                XX
                                                                             L=           s(dn , q) − s(dp , q).        (7)
2.    OBJECTIVE AND APPROACH                                                       q    p,n

                                                                We train the entire network including both the DNN and
                                                                RNN using back-propagation algorithm. We take s(d, q)
Copyright is held by the author/owner(s).                       as the score for the document-query pair (d, q), and query
MediaEval 2015 Workshop Sept. 14-15, 2015, Wurzen, Germany      would be considered to be in a document if s(d, q) > 0.5.
                Table 1: Cnxe Results
                            Actual Cnxe
     set    method
                   ALL     T1      T2            T3
            dtw    2.0066 2.0064 2.0077          2.0055
     dev
            rnn    2.0066 2.0064 2.0077          2.0055
            dtw    2.0067 2.0070 2.0093          2.0029
     eval
            rnn    2.0067 2.0070 2.0093          2.0029


3.   EXPERIMENTS AND RESULTS
   The entire corpus was trained only using the corpus of
QUESST 2015. We derived two sets of scores from the
method above. The first set of scores is the DTW scores
generated when we initially align the features between query
and document. The second set of scores is the score gener-
ated from our RNN in equation 4. We did not perform any
pretraining on the network and used random initialization
for all the weights. The neural network was implemented
using the Theano library [7]. Positive examples for query-
document pairs were selected from all query types(T1, T2,
T3) in the development set, negative examples were ran-
domly generated query-document pairs. The results of our
experiments are shown in Table 1. We only show the actual
normalized cross entropy (Cnxe). From the results, we see
that the RNN did not perform better than the DTW, and
neither system seemed to have performed well. This could
be due to error in the implementation or insufficient number
of epochs during the training for the RNN. Since the results
of the RNN were based on the results of the DTW, it is
unclear of what caused the problem.

4.   CONCLUSION                                                      Figure 1: Other systems that we’ve tried.
  The authors of the paper attempted a framework to com-
bine DNN, RNN and DTW under a single zero resource neu-
ral network framework for query by example spoken term          another 2D feature representation with one axis being time.
detection.                                                      We treated the features at the output as if they were acous-
                                                                tic features and plot the warping matrix (pairwise consine
                                                                similarity). However, instead of applying DTW, we treated
5.   SUPPLEMENTARY MATERIAL                                     the warping matrix itself as another image and forward it
   Since we were encouraged to discuss other systems that       through another CNN. The target of the CNN was whether
we’ve tried, we’ve included several versions of our system      or not the document contains the query. This design didn’t
through different development iterations. All of these sys-     work because the error on the testing set didn’t converge,
tems except for the correspondence auto-encoder have been       maybe due to serious over-fitting.
implemented, yet not all have been evaluated on the corpus.        In the third attempt, we removed the second CNN to re-
The philosophy behind all these designs was to map query-       duce the number of parameters. The first CNN has a fully
document pairs to a single trainable objective so the error     connected layer in this design, and we took the inner prod-
back-propagates through the entire network.                     uct of the fully connected layer from the document and the
   In the first attempt, we learned feature transformation on   query to be the error. Although the number of parameters
the acoustic features (MFCC) using a DNN. The objective         have been reduced, the error on the testing set still didn’t
of the DNN was a the warping distance after DTW, and the        converge. Maybe the number of correct training pairs just
error of the DTW was back-propagated into the network.          wasn’t enough to train such systems.
This design was abandoned due to unreasonably slow calcu-          Finally, we decided to consider using correspondence au-
lation: DNN is GPU intensive while DTW is CPU intensive,        toencoders [9] for this task. The plan was to use correspon-
making this hybrid system hard to implement using Theano.       dence autoencoders as an zero-resource feature extractor.
   In the second attempt, we replaced DTW with Convolu-         We planned to perform another DTW on these extracted
tional Neural Networks (CNN) [8]. CNNs are GPU friendly         features. Although the system contains both DTWs and
architectures which fixes the problem of the previous hybrid    DNNs, they are not jointly trained so the performance prob-
systems. The 2D feature representation of every utterance       lem of the first design doesn’t occur. However, DTWs are
was treated as an image. The acoustic features from dif-        extremely time consuming operations opposed to RNNs. We
ferent queries/documents were padded with zero vectors to       ran out of time to perform the second DTW so decided to
be the same length. This feature transformation CNN only        replace it with a RNN which is the system that we submitted
has convolutional layers. The end result the first CNN was      in the end.
6.   REFERENCES
[1] Igor Szoke, Luis J. Rodriguez-Fuentes, Andi Buzo,
    Xavier Anguera, Florian Metze, Jorge Proenca, Martin
    Lojka, and Xiao Xiong. Query by example search on
    speech at mediaeval 2015. In Working Notes
    Proceedings of the Mediaeval 2015 Workshop, Wurzen,
    Germany, 2015.
[2] Aren Jansen, Emmanuel Dupoux, Sharon Goldwater,
    Mark Johnson, Sanjeev Khudanpur, Kenneth Church,
    Naomi Feldman, Hynek Hermansky, Florian Metze,
    Richard Rose, et al. A summary of the 2012 JHU CLSP
    workshop on zero resource speech technologies and
    models of early language acquisition. 2013.
[3] Li Deng, Geoffrey Hinton, and Brian Kingsbury. New
    types of deep neural network learning for speech
    recognition and related applications: An overview. In
    Acoustics, Speech and Signal Processing (ICASSP),
    2013 IEEE International Conference on, pages
    8599–8603. IEEE, 2013.
[4] Meinard Müller. Dynamic time warping. Information
    retrieval for music and motion, pages 69–84, 2007.
[5] Tony Robinson, Mike Hochberg, and Steve Renals. The
    use of recurrent neural networks in continuous speech
    recognition. In Automatic speech and speaker
    recognition, pages 233–258. Springer, 1996.
[6] Steve Young, Gunnar Evermann, Mark Gales, Thomas
    Hain, Dan Kershaw, Xunying Liu, Gareth Moore,
    Julian Odell, Dave Ollason, Dan Povey, et al. The HTK
    book, volume 2. Entropic Cambridge Research
    Laboratory Cambridge, 1997.
[7] James Bergstra, Olivier Breuleux, Frédéric Bastien,
    Pascal Lamblin, Razvan Pascanu, Guillaume
    Desjardins, Joseph Turian, David Warde-Farley, and
    Yoshua Bengio. Theano: a CPU and GPU math
    expression compiler. In Proceedings of the Python for
    scientific computing conference (SciPy), volume 4,
    page 3. Austin, TX, 2010.
[8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E
    Hinton. Imagenet classification with deep convolutional
    neural networks. In Advances in neural information
    processing systems, pages 1097–1105, 2012.
[9] Herman Kamper, Micha Elsner, Aren Jansen, and
    Sharon Goldwater. Unsupervised neural network based
    feature extraction using weak top-down constraints.