=Paper= {{Paper |id=Vol-1436/Paper83 |storemode=property |title=CUNY Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 |pdfUrl=https://ceur-ws.org/Vol-1436/Paper83.pdf |volume=Vol-1436 |dblpUrl=https://dblp.org/rec/conf/mediaeval/MaR15 }} ==CUNY Systems for the Query-by-Example Search on Speech Task at MediaEval 2015== https://ceur-ws.org/Vol-1436/Paper83.pdf
       CUNY Systems for the Query-by-Example Search on
               Speech Task at MediaEval 2015

                                             Min Ma1 , Andrew Rosenberg2
                            1
                               Department of Computer Science, CUNY Graduate Center, USA
                           2
                               Department of Computer Science, Queens College (CUNY), USA
                               mma@gradcenter.cuny.edu, andrew@cs.qc.cuny.edu


ABSTRACT                                                          data for language modeling, or reliable pronunciation mod-
This paper describes two query-by-example systems devel-          eling, we seek to decode the audio to a phoneme sequence.
oped by Speech Lab, Queens College (CUNY). Our systems            We utilized three phonetic tokenizers (Czech, Hungarian and
aimed to respond with quick search results from the selected      Russian) released by Brno University of Technology (BUT)
reference files. Three phonetic recognizers (Czech, Hungar-       to generate the phoneme transcripts[2]. All BUT recogniz-
ian and Russian) were utilized to get phoneme sequences of        ers used the split temporal context network structure. All
both query and reference speech files. Each query sequence        queries and reference audios were decoded by the three to-
were compared with all the reference sequences using both         kenizers and non-phonetic symbols (pau, int, spk) were re-
global and local aligners. In the first system, we predicted      moved from final sequences.
the most probable reference files based on the sequence align-
ment results; In the second system, we pruned out the sub-        3.   SEQUENCE ALIGNMENT
sequences from the reference sequences that yielded best lo-      Considering various acoustic and voice conditions of audio
cal symbolic alignments, then 39-dimension MFCC features          and the error rate of phonetic tokenizers, phonetic sequences
were extracted for both query and the subsequences. Both          from the query and the actual hit region in the reference au-
the two systems employed an optimized DTW, and obtained           dio may not always be exactly the same, but should be simi-
Cnxe of 0.9989 and 1.0674 on the test data respectively.          lar. Thus we employed three aligners to match phoneme se-
                                                                  quences: 1) strict global aligner, 2) (fuzzy) global aligner and
1.   INTRODUCTION                                                 3) local aligner, where 1) was designed for exact matches,
The primary goal of query-by-example search is to detect the      while 2) and 3) for inexact matches. Every aligner used Lev-
occurrence of query audio in a unlabeled reference speech         enshtein edit distance and backtracking to find the optimal
corpus. This year, this task emphasized the probabilistic         matching parts of each sequence pair, yielding a matching
decisions of the query occurrence, ignoring the exact start       score and a similarity percentage. We performed alignment
and end times of hit regions in the reference audio. More-        for all the sequences decoded by different tokenizers, thus
over, three different types of query searches were used in this   obtained 9 matching scores and 9 similarity percentages for
year’s evaluation in addition to exact matches of query au-       each sequence pair.
dio: 1) word level re-orderings, 2) small filler content and 3)
conversational queries in context. Detailed description can       4.   SYSTEM 1: SMO+ISAX
be found in [1]. Therefore, we targeted comprehensive search      In our first system, we investigated how to use global and
systems which are compatible with the queries of different        local symbolic similarities to identify the best possible ref-
types, and capable to output both strict and approximate          erence files for DTW to search.
matches. Our systems took advantage of phonetic symbolic
sequences as well as MFCC features, and employed a DTW            Binary Classifier: We aggregated all the scores and per-
algorithm with novel optimizations. Note that all the queries     centages computed from Section 3 for each query-reference
were manually recorded with the acoustic context, in order        (q-ref) pair, then assigned “CORR” (“correct”, if the query
to avoid any problems when cutting the queries from a longer      occurred in the reference audio) or “FA” (“false alarm”) for
sentence. However, this context might interfere the retrieval     training queries. The data set showed strong between-class
performance. Thus, we filtered out the exact spoken queries       imbalance, where “FA” samples was 1,365 times more than
as the ones searched in our systems, using the timing bound-      “CORR” instances. Since most classification algorithms do
aries of the relevant words provided.                             not perform optimally when class examples are extremely
                                                                  imbalanced, we applied clustering centroid undersampling
2.   PHONETIC TOKENIZER                                           [3], a technique proven effective to address such issue, to our
Unlike resource rich languages, for low-resource languages,       training data. We then trained a support vector machine
automatic speech recognition still remains challenging due        using sequential minimal optimization (SMO)[4] and evalu-
to the lack of adequate training data. Without sufficient         ated using four-fold cross validation on training data. An-
                                                                  other SMO model was trained over all the training set and
Copyright is held by the author/owner(s).                         tested on test queries to make predictions of “CORR”/“FA”.
MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Ger-
many                                                              Limited by time, for each query, we narrowed our search
                                                                  scope of DTW to top 50 reference files that obtained high-
Table 1: An algorithm converting phoneme sequence                      Table 2: System Performances on all the queries
T to time series S(A stands for phoneme alphabet)                      (System 1 is primary system)
   Always set s1 = 0;                                                   System               actCnxe    minCnxe    ATWV      MTWV
   For T = t1 , t2 , ..., tN , where ti ∈ A = {l1 , l2 , ..., lM } :    SMO+iSAX-dev         0.9988     0.9872     0.0011    0.0067
       si+1 = si + ( M   2
                            − j + 1) if ti = lj and j ≤ M     2
                                                                ;       SMO+iSAX-eval        0.9989     0.9870     0.0006    0.0010
       si+1 = si − (j − M     2
                                ) if ti = lj and j > M 2
                                                          .             Subseq+MFCC-dev      1.0658     0.9823     -3.9820   0.0123
                                                                        Subseq+MFCC-eval     1.0674     0.9853     -4.0205   0.0006

est prediction confidences.
                                                                       ization, 2) reordering early abandoning, 3) reversing the
Sequence Mapping: DTW is an excellent search algo-                     query/data role in LBKeogh and 4) cascading lower bounds.
rithm, especially on small data sets [5]. Therefore, after us-         UCRsuite [8] implements the optimized DTW algorithm. In
ing the edit distance to get the rough decisions (i.e. shrink          the first system, optimized DTW was applied to the iSAX-
the scope of candidates), we applied DTW to make more ac-              converted time series; in the second system, it was employed
curate decisions as to whether a spoken query occurs in the            to compare each dimension of MFCC features, then the 39
candidate audio segment. We utilized indexable iSAX[6],                scores were aggregated to compute their average. Per-query
a method that uses the symbolic words to internally orga-              normalization was performed, using the formula z = x−µ    σ
nize and index the data, and supports DTW by building                  where µ is the mean of raw matching scores for the query,
bounding envelopes around the query time series. Following             and σ is their standard deviation.
[6], we mapped the phoneme sequence using the approach
shown in Table 1. The phoneme vocabulary was constructed
by all the phonemes from query and reference sequences. To             7.   EXPERIMENT RESULTS
simplify the mapping algorithm, shown in Table 1, a ”null”             Results obtained by our systems on both dev queries and
symbol was inserted at the end of the alphabet to ensure the           eval queries are shown in Table 2. System 1 performed
total number is even. In this way, we converted phoneme se-            better than System 2, in both Cnxe and AT W V metrics.
quences to integer time series, in order to apply the DTW              Smaller Cnxe indicates System 1 is a more informative sys-
described in Section 6.                                                tem of the two, which might be caused by the fact that
                                                                       System 1 chose the candidate files based on both (strict)
5.   SYSTEM 2: SUBSEQ+MFCC                                             global and local similarities, while System 2 selected the sub-
In our second system, phoneme sequence alignments were                 sequences that either matched the entire query(e.g. ”white
used to determine the subsequences of reference files that             horse”) most, or matched part of the query(e.g. ”horse”)
best matched the query sequence. We then applied DTW                   most, which might fail to capture some inexact matches as
over the subsequence and query pair (q-subref) by comparing            Type 2 and Type 3 queries and misrecognized files that just
their MFCC features, a type of acoustic features that has              partially matched the query. Therefore, System 2 performed
been widely used for the task.                                         worse, even though it introduced acoustic information from
                                                                       MFCC and generated a response faster. Besides, basing
Subsequence Detection: We made an approximate es-                      decisions on the phoneme sequences made the accuracy of
timation of start and end time points of the subsequence               phoneme recognizers impact the performance of both sys-
that was likely to be an occurrence of query, only for the             tems. To run the experiments, we used a computer with
reference files that achieved a high average similarity from           60CPUs (2.8 GHz, 120 cores), 64GB RAM and 1TB hard
local aligners. More specifically, we selected the top 100 q-          drive. Most of the computation cost of our systems was
subref pairs for each query and extracted the hypothesized             caused by the alignment and DTW phases. Approximately,
matched regions from reference audio segments. For exam-               the Indexing Speed Factor was 0.607, Searching Speed Fac-
ple, suppose we have a query sequence ”u i k”, then only the           tor was 0.307 per sec and per language, and Peak Memory
region that were transcribed as ”u i k” would be searched.             was 1.903 GB for the primary system.
Therefore, using the symbol matching looking, we only need
make DTW comparisons between the query sequence and a                  8.   CONCLUSIONS
smaller local region of the reference segment.                         We briefly summarize our two systems as part of the Medi-
                                                                       aEval QUESST 2015 Shared Tasks along with their evalua-
MFCC Feature Extraction: We extracted 39-dimension                     tion results. Our first system involved (strict) global/local
MFCC features for both query and subsequence speech files              similarities of symbolic phoneme sequences, therefore was
using OpenSMILE [7]. The features were produced from                   compatible to both the exact query search and approxi-
25 ms audio frames (sampled at a rate of 10 ms), by com-               mate search tasks. Additionally, a classifier-based selection
puting 13 MFCC (0-12) from 26 Mel-frequency bands and                  method of candidate reference files were explored. Our sec-
applying a cepstral liftering filter with a weight parameter           ond system used local similarity as measurement to filter
of 22, with 13 delta and 13 acceleration coecients appended            out the most similar subsequences to query sequence, thus
to the MFCC. Finally, we made mean normalization of all                reduced the computational burden of DTW comparison. In
the features with respect to the full input sequence.                  our systems, optimized DTW algorithms were employed to
                                                                       compare either iSAX-ed phoneme sequences or 39-dimension
6.   OPTIMIZED DTW                                                     MFCC features to determine if a query appears within a
Four optimization techniques proposed in [5] can greatly ac-           reference sample. In the future, more acoustic features and
celerate the DTW algorithm: 1) early abandoning z-normal-              selection approaches should be further investigated.
9.   REFERENCES
[1] I. Szöke, L.-J. Rodriguez-Fuentes, A. Buzo, X. Anguera,
    F. Metze, J. Proenca, M. Lojka, and X. Xiong, “Query
    by example search on speech at MediaEval 2015,” in
    Working Notes Proceedings of the MediaEval 2015
    Workshop, Sept. 14-15, 2015, Wurzen, Germany, 2015.
[2] P. Schwarz, “Phoneme Recognizer Based on Long
    Temporal Context,” in http:
    // www.fit.vutbr.cz/ ˜schwarzp/ publi/ thesis.pdf ,PhD
    Thesis, Brno University of Technology, 2008.
[3] M. M. Rahman and D. Davis, “Cluster Based
    Under-Sampling for Unbalanced Cardiovascular Data,”
    in Proceedings of the World Congress on Engineering,
    vol. 3, 2013.
[4] S. Keerthi, S. Shevade, C. Bhattacharyya, and
    K. Murthy, “Improvements to Platt’s SMO Algorithm
    for SVM Classifier Design,” in Neural Computation,
    vol. 13, no. 3, 2001, pp. 637–649.
[5] T. Rakthanmanon, B. Campana, A. Mueen, G. Batista,
    B. Westover, Q. Zhu, J. Zakaria, and E. Keogh,
    “Searching and Mining Trillions of Time Series
    Subsequences under Dynamic Time Warping,” in
    Proceedings of the 18th ACM SIGKDD international
    conference on Knowledge discovery and data mining.
    ACM, 2012, pp. 262–270.
[6] J. Shieh and E. Keogh, “iSAX: Indexing and Mining
    Terabyte Sized Time Series,” in Proceedings of the 14th
    ACM SIGKDD international conference on Knowledge
    discovery and data mining. ACM, 2008, pp. 623–631.
[7] F. Eyben, F. Weninger, F. Groß, and B. Schuller,
    “Recent Developments in openSMILE, the Munich
    Open-Source Multimedia Feature Extractor,” in
    Proceedings of the 21st ACM international conference
    on Multimedia. ACM, 2013, pp. 835–838.
[8] “The UCR Suite,” in
    www.cs.ucr.edu/ ˜eamonn/ UCRsuite.html .