<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CUNY Systems for the Query-by-Example Search on Speech Task at MediaEval 2015</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Min Ma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrew Rosenberg</string-name>
          <email>andrew@cs.qc.cuny.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, CUNY Graduate Center</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Queens College (CUNY)</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes two query-by-example systems developed by Speech Lab, Queens College (CUNY). Our systems aimed to respond with quick search results from the selected reference les. Three phonetic recognizers (Czech, Hungarian and Russian) were utilized to get phoneme sequences of both query and reference speech les. Each query sequence were compared with all the reference sequences using both global and local aligners. In the rst system, we predicted the most probable reference les based on the sequence alignment results; In the second system, we pruned out the subsequences from the reference sequences that yielded best local symbolic alignments, then 39-dimension MFCC features were extracted for both query and the subsequences. Both the two systems employed an optimized DTW, and obtained Cnxe of 0.9989 and 1.0674 on the test data respectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The primary goal of query-by-example search is to detect the
occurrence of query audio in a unlabeled reference speech
corpus. This year, this task emphasized the probabilistic
decisions of the query occurrence, ignoring the exact start
and end times of hit regions in the reference audio.
Moreover, three di erent types of query searches were used in this
year's evaluation in addition to exact matches of query
audio: 1) word level re-orderings, 2) small ller content and 3)
conversational queries in context. Detailed description can
be found in [1]. Therefore, we targeted comprehensive search
systems which are compatible with the queries of di erent
types, and capable to output both strict and approximate
matches. Our systems took advantage of phonetic symbolic
sequences as well as MFCC features, and employed a DTW
algorithm with novel optimizations. Note that all the queries
were manually recorded with the acoustic context, in order
to avoid any problems when cutting the queries from a longer
sentence. However, this context might interfere the retrieval
performance. Thus, we ltered out the exact spoken queries
as the ones searched in our systems, using the timing
boundaries of the relevant words provided.</p>
    </sec>
    <sec id="sec-2">
      <title>2. PHONETIC TOKENIZER</title>
      <p>Unlike resource rich languages, for low-resource languages,
automatic speech recognition still remains challenging due
to the lack of adequate training data. Without su cient
data for language modeling, or reliable pronunciation
modeling, we seek to decode the audio to a phoneme sequence.
We utilized three phonetic tokenizers (Czech, Hungarian and
Russian) released by Brno University of Technology (BUT)
to generate the phoneme transcripts[2]. All BUT
recognizers used the split temporal context network structure. All
queries and reference audios were decoded by the three
tokenizers and non-phonetic symbols (pau, int, spk) were
removed from nal sequences.</p>
    </sec>
    <sec id="sec-3">
      <title>3. SEQUENCE ALIGNMENT</title>
      <p>Considering various acoustic and voice conditions of audio
and the error rate of phonetic tokenizers, phonetic sequences
from the query and the actual hit region in the reference
audio may not always be exactly the same, but should be
similar. Thus we employed three aligners to match phoneme
sequences: 1) strict global aligner, 2) (fuzzy) global aligner and
3) local aligner, where 1) was designed for exact matches,
while 2) and 3) for inexact matches. Every aligner used
Levenshtein edit distance and backtracking to nd the optimal
matching parts of each sequence pair, yielding a matching
score and a similarity percentage. We performed alignment
for all the sequences decoded by di erent tokenizers, thus
obtained 9 matching scores and 9 similarity percentages for
each sequence pair.</p>
    </sec>
    <sec id="sec-4">
      <title>4. SYSTEM 1: SMO+ISAX</title>
      <p>In our rst system, we investigated how to use global and
local symbolic similarities to identify the best possible
reference les for DTW to search.</p>
      <p>Binary Classi er: We aggregated all the scores and
percentages computed from Section 3 for each query-reference
(q-ref) pair, then assigned \CORR" (\correct", if the query
occurred in the reference audio) or \FA" (\false alarm") for
training queries. The data set showed strong between-class
imbalance, where \FA" samples was 1,365 times more than
\CORR" instances. Since most classi cation algorithms do
not perform optimally when class examples are extremely
imbalanced, we applied clustering centroid undersampling
[3], a technique proven e ective to address such issue, to our
training data. We then trained a support vector machine
using sequential minimal optimization (SMO)[4] and
evaluated using four-fold cross validation on training data.
Another SMO model was trained over all the training set and
tested on test queries to make predictions of \CORR"/\FA".
Limited by time, for each query, we narrowed our search
scope of DTW to top 50 reference les that obtained
high</p>
      <p>Sequence Mapping: DTW is an excellent search
algorithm, especially on small data sets [5]. Therefore, after
using the edit distance to get the rough decisions (i:e: shrink
the scope of candidates), we applied DTW to make more
accurate decisions as to whether a spoken query occurs in the
candidate audio segment. We utilized indexable iSAX[6],
a method that uses the symbolic words to internally
organize and index the data, and supports DTW by building
bounding envelopes around the query time series. Following
[6], we mapped the phoneme sequence using the approach
shown in Table 1. The phoneme vocabulary was constructed
by all the phonemes from query and reference sequences. To
simplify the mapping algorithm, shown in Table 1, a "null"
symbol was inserted at the end of the alphabet to ensure the
total number is even. In this way, we converted phoneme
sequences to integer time series, in order to apply the DTW
described in Section 6.</p>
    </sec>
    <sec id="sec-5">
      <title>5. SYSTEM 2: SUBSEQ+MFCC</title>
      <p>In our second system, phoneme sequence alignments were
used to determine the subsequences of reference les that
best matched the query sequence. We then applied DTW
over the subsequence and query pair (q-subref) by comparing
their MFCC features, a type of acoustic features that has
been widely used for the task.</p>
      <p>Subsequence Detection: We made an approximate
estimation of start and end time points of the subsequence
that was likely to be an occurrence of query, only for the
reference les that achieved a high average similarity from
local aligners. More speci cally, we selected the top 100
qsubref pairs for each query and extracted the hypothesized
matched regions from reference audio segments. For
example, suppose we have a query sequence "u i k", then only the
region that were transcribed as "u i k" would be searched.
Therefore, using the symbol matching looking, we only need
make DTW comparisons between the query sequence and a
smaller local region of the reference segment.</p>
      <p>MFCC Feature Extraction: We extracted 39-dimension
MFCC features for both query and subsequence speech les
using OpenSMILE [7]. The features were produced from
25 ms audio frames (sampled at a rate of 10 ms), by
computing 13 MFCC (0-12) from 26 Mel-frequency bands and
applying a cepstral liftering lter with a weight parameter
of 22, with 13 delta and 13 acceleration coecients appended
to the MFCC. Finally, we made mean normalization of all
the features with respect to the full input sequence.</p>
    </sec>
    <sec id="sec-6">
      <title>6. OPTIMIZED DTW</title>
      <p>Four optimization techniques proposed in [5] can greatly
accelerate the DTW algorithm: 1) early abandoning
z-normalSMO+iSAX-dev
SMO+iSAX-eval
Subseq+MFCC-dev
Subseq+MFCC-eval
ization, 2) reordering early abandoning, 3) reversing the
query/data role in LBKeogh and 4) cascading lower bounds.
UCRsuite [8] implements the optimized DTW algorithm. In
the rst system, optimized DTW was applied to the
iSAXconverted time series; in the second system, it was employed
to compare each dimension of MFCC features, then the 39
scores were aggregated to compute their average. Per-query
normalization was performed, using the formula z = x
where is the mean of raw matching scores for the query,
and is their standard deviation.</p>
    </sec>
    <sec id="sec-7">
      <title>7. EXPERIMENT RESULTS</title>
      <p>Results obtained by our systems on both dev queries and
eval queries are shown in Table 2. System 1 performed
better than System 2, in both Cnxe and AT W V metrics.
Smaller Cnxe indicates System 1 is a more informative
system of the two, which might be caused by the fact that
System 1 chose the candidate les based on both (strict)
global and local similarities, while System 2 selected the
subsequences that either matched the entire query(e:g: "white
horse") most, or matched part of the query(e:g: "horse")
most, which might fail to capture some inexact matches as
Type 2 and Type 3 queries and misrecognized les that just
partially matched the query. Therefore, System 2 performed
worse, even though it introduced acoustic information from
MFCC and generated a response faster. Besides, basing
decisions on the phoneme sequences made the accuracy of
phoneme recognizers impact the performance of both
systems. To run the experiments, we used a computer with
60CPUs (2.8 GHz, 120 cores), 64GB RAM and 1TB hard
drive. Most of the computation cost of our systems was
caused by the alignment and DTW phases. Approximately,
the Indexing Speed Factor was 0.607, Searching Speed
Factor was 0.307 per sec and per language, and Peak Memory
was 1.903 GB for the primary system.</p>
    </sec>
    <sec id="sec-8">
      <title>8. CONCLUSIONS</title>
      <p>We brie y summarize our two systems as part of the
MediaEval QUESST 2015 Shared Tasks along with their
evaluation results. Our rst system involved (strict) global/local
similarities of symbolic phoneme sequences, therefore was
compatible to both the exact query search and
approximate search tasks. Additionally, a classi er-based selection
method of candidate reference les were explored. Our
second system used local similarity as measurement to lter
out the most similar subsequences to query sequence, thus
reduced the computational burden of DTW comparison. In
our systems, optimized DTW algorithms were employed to
compare either iSAX-ed phoneme sequences or 39-dimension
MFCC features to determine if a query appears within a
reference sample. In the future, more acoustic features and
selection approaches should be further investigated.</p>
    </sec>
    <sec id="sec-9">
      <title>9. REFERENCES</title>
      <p>[1] I. Szoke, L.-J. Rodriguez-Fuentes, A. Buzo, X. Anguera,
F. Metze, J. Proenca, M. Lojka, and X. Xiong, \Query
by example search on speech at MediaEval 2015," in
Working Notes Proceedings of the MediaEval 2015
Workshop, Sept. 14-15, 2015, Wurzen, Germany, 2015.
[2] P. Schwarz, \Phoneme Recognizer Based on Long
Temporal Context," in http:
// www. t.vutbr.cz/ ~schwarzp/ publi/ thesis.pdf ,PhD
Thesis, Brno University of Technology, 2008.
[3] M. M. Rahman and D. Davis, \Cluster Based
Under-Sampling for Unbalanced Cardiovascular Data,"
in Proceedings of the World Congress on Engineering,
vol. 3, 2013.
[4] S. Keerthi, S. Shevade, C. Bhattacharyya, and
K. Murthy, \Improvements to Platt's SMO Algorithm
for SVM Classi er Design," in Neural Computation,
vol. 13, no. 3, 2001, pp. 637{649.
[5] T. Rakthanmanon, B. Campana, A. Mueen, G. Batista,
B. Westover, Q. Zhu, J. Zakaria, and E. Keogh,
\Searching and Mining Trillions of Time Series
Subsequences under Dynamic Time Warping," in
Proceedings of the 18th ACM SIGKDD international
conference on Knowledge discovery and data mining.</p>
      <p>ACM, 2012, pp. 262{270.
[6] J. Shieh and E. Keogh, \iSAX: Indexing and Mining
Terabyte Sized Time Series," in Proceedings of the 14th
ACM SIGKDD international conference on Knowledge
discovery and data mining. ACM, 2008, pp. 623{631.
[7] F. Eyben, F. Weninger, F. Gro , and B. Schuller,
\Recent Developments in openSMILE, the Munich
Open-Source Multimedia Feature Extractor," in
Proceedings of the 21st ACM international conference
on Multimedia. ACM, 2013, pp. 835{838.
[8] \The UCR Suite," in
www.cs.ucr.edu/ ~eamonn/ UCRsuite.html .</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>