=Paper=
{{Paper
|id=Vol-1436/Paper83
|storemode=property
|title=CUNY Systems for the Query-by-Example Search on Speech Task at MediaEval 2015
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper83.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/MaR15
}}
==CUNY Systems for the Query-by-Example Search on Speech Task at MediaEval 2015==
CUNY Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 Min Ma1 , Andrew Rosenberg2 1 Department of Computer Science, CUNY Graduate Center, USA 2 Department of Computer Science, Queens College (CUNY), USA mma@gradcenter.cuny.edu, andrew@cs.qc.cuny.edu ABSTRACT data for language modeling, or reliable pronunciation mod- This paper describes two query-by-example systems devel- eling, we seek to decode the audio to a phoneme sequence. oped by Speech Lab, Queens College (CUNY). Our systems We utilized three phonetic tokenizers (Czech, Hungarian and aimed to respond with quick search results from the selected Russian) released by Brno University of Technology (BUT) reference files. Three phonetic recognizers (Czech, Hungar- to generate the phoneme transcripts[2]. All BUT recogniz- ian and Russian) were utilized to get phoneme sequences of ers used the split temporal context network structure. All both query and reference speech files. Each query sequence queries and reference audios were decoded by the three to- were compared with all the reference sequences using both kenizers and non-phonetic symbols (pau, int, spk) were re- global and local aligners. In the first system, we predicted moved from final sequences. the most probable reference files based on the sequence align- ment results; In the second system, we pruned out the sub- 3. SEQUENCE ALIGNMENT sequences from the reference sequences that yielded best lo- Considering various acoustic and voice conditions of audio cal symbolic alignments, then 39-dimension MFCC features and the error rate of phonetic tokenizers, phonetic sequences were extracted for both query and the subsequences. Both from the query and the actual hit region in the reference au- the two systems employed an optimized DTW, and obtained dio may not always be exactly the same, but should be simi- Cnxe of 0.9989 and 1.0674 on the test data respectively. lar. Thus we employed three aligners to match phoneme se- quences: 1) strict global aligner, 2) (fuzzy) global aligner and 1. INTRODUCTION 3) local aligner, where 1) was designed for exact matches, The primary goal of query-by-example search is to detect the while 2) and 3) for inexact matches. Every aligner used Lev- occurrence of query audio in a unlabeled reference speech enshtein edit distance and backtracking to find the optimal corpus. This year, this task emphasized the probabilistic matching parts of each sequence pair, yielding a matching decisions of the query occurrence, ignoring the exact start score and a similarity percentage. We performed alignment and end times of hit regions in the reference audio. More- for all the sequences decoded by different tokenizers, thus over, three different types of query searches were used in this obtained 9 matching scores and 9 similarity percentages for year’s evaluation in addition to exact matches of query au- each sequence pair. dio: 1) word level re-orderings, 2) small filler content and 3) conversational queries in context. Detailed description can 4. SYSTEM 1: SMO+ISAX be found in [1]. Therefore, we targeted comprehensive search In our first system, we investigated how to use global and systems which are compatible with the queries of different local symbolic similarities to identify the best possible ref- types, and capable to output both strict and approximate erence files for DTW to search. matches. Our systems took advantage of phonetic symbolic sequences as well as MFCC features, and employed a DTW Binary Classifier: We aggregated all the scores and per- algorithm with novel optimizations. Note that all the queries centages computed from Section 3 for each query-reference were manually recorded with the acoustic context, in order (q-ref) pair, then assigned “CORR” (“correct”, if the query to avoid any problems when cutting the queries from a longer occurred in the reference audio) or “FA” (“false alarm”) for sentence. However, this context might interfere the retrieval training queries. The data set showed strong between-class performance. Thus, we filtered out the exact spoken queries imbalance, where “FA” samples was 1,365 times more than as the ones searched in our systems, using the timing bound- “CORR” instances. Since most classification algorithms do aries of the relevant words provided. not perform optimally when class examples are extremely imbalanced, we applied clustering centroid undersampling 2. PHONETIC TOKENIZER [3], a technique proven effective to address such issue, to our Unlike resource rich languages, for low-resource languages, training data. We then trained a support vector machine automatic speech recognition still remains challenging due using sequential minimal optimization (SMO)[4] and evalu- to the lack of adequate training data. Without sufficient ated using four-fold cross validation on training data. An- other SMO model was trained over all the training set and Copyright is held by the author/owner(s). tested on test queries to make predictions of “CORR”/“FA”. MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Ger- many Limited by time, for each query, we narrowed our search scope of DTW to top 50 reference files that obtained high- Table 1: An algorithm converting phoneme sequence Table 2: System Performances on all the queries T to time series S(A stands for phoneme alphabet) (System 1 is primary system) Always set s1 = 0; System actCnxe minCnxe ATWV MTWV For T = t1 , t2 , ..., tN , where ti ∈ A = {l1 , l2 , ..., lM } : SMO+iSAX-dev 0.9988 0.9872 0.0011 0.0067 si+1 = si + ( M 2 − j + 1) if ti = lj and j ≤ M 2 ; SMO+iSAX-eval 0.9989 0.9870 0.0006 0.0010 si+1 = si − (j − M 2 ) if ti = lj and j > M 2 . Subseq+MFCC-dev 1.0658 0.9823 -3.9820 0.0123 Subseq+MFCC-eval 1.0674 0.9853 -4.0205 0.0006 est prediction confidences. ization, 2) reordering early abandoning, 3) reversing the Sequence Mapping: DTW is an excellent search algo- query/data role in LBKeogh and 4) cascading lower bounds. rithm, especially on small data sets [5]. Therefore, after us- UCRsuite [8] implements the optimized DTW algorithm. In ing the edit distance to get the rough decisions (i.e. shrink the first system, optimized DTW was applied to the iSAX- the scope of candidates), we applied DTW to make more ac- converted time series; in the second system, it was employed curate decisions as to whether a spoken query occurs in the to compare each dimension of MFCC features, then the 39 candidate audio segment. We utilized indexable iSAX[6], scores were aggregated to compute their average. Per-query a method that uses the symbolic words to internally orga- normalization was performed, using the formula z = x−µ σ nize and index the data, and supports DTW by building where µ is the mean of raw matching scores for the query, bounding envelopes around the query time series. Following and σ is their standard deviation. [6], we mapped the phoneme sequence using the approach shown in Table 1. The phoneme vocabulary was constructed by all the phonemes from query and reference sequences. To 7. EXPERIMENT RESULTS simplify the mapping algorithm, shown in Table 1, a ”null” Results obtained by our systems on both dev queries and symbol was inserted at the end of the alphabet to ensure the eval queries are shown in Table 2. System 1 performed total number is even. In this way, we converted phoneme se- better than System 2, in both Cnxe and AT W V metrics. quences to integer time series, in order to apply the DTW Smaller Cnxe indicates System 1 is a more informative sys- described in Section 6. tem of the two, which might be caused by the fact that System 1 chose the candidate files based on both (strict) 5. SYSTEM 2: SUBSEQ+MFCC global and local similarities, while System 2 selected the sub- In our second system, phoneme sequence alignments were sequences that either matched the entire query(e.g. ”white used to determine the subsequences of reference files that horse”) most, or matched part of the query(e.g. ”horse”) best matched the query sequence. We then applied DTW most, which might fail to capture some inexact matches as over the subsequence and query pair (q-subref) by comparing Type 2 and Type 3 queries and misrecognized files that just their MFCC features, a type of acoustic features that has partially matched the query. Therefore, System 2 performed been widely used for the task. worse, even though it introduced acoustic information from MFCC and generated a response faster. Besides, basing Subsequence Detection: We made an approximate es- decisions on the phoneme sequences made the accuracy of timation of start and end time points of the subsequence phoneme recognizers impact the performance of both sys- that was likely to be an occurrence of query, only for the tems. To run the experiments, we used a computer with reference files that achieved a high average similarity from 60CPUs (2.8 GHz, 120 cores), 64GB RAM and 1TB hard local aligners. More specifically, we selected the top 100 q- drive. Most of the computation cost of our systems was subref pairs for each query and extracted the hypothesized caused by the alignment and DTW phases. Approximately, matched regions from reference audio segments. For exam- the Indexing Speed Factor was 0.607, Searching Speed Fac- ple, suppose we have a query sequence ”u i k”, then only the tor was 0.307 per sec and per language, and Peak Memory region that were transcribed as ”u i k” would be searched. was 1.903 GB for the primary system. Therefore, using the symbol matching looking, we only need make DTW comparisons between the query sequence and a 8. CONCLUSIONS smaller local region of the reference segment. We briefly summarize our two systems as part of the Medi- aEval QUESST 2015 Shared Tasks along with their evalua- MFCC Feature Extraction: We extracted 39-dimension tion results. Our first system involved (strict) global/local MFCC features for both query and subsequence speech files similarities of symbolic phoneme sequences, therefore was using OpenSMILE [7]. The features were produced from compatible to both the exact query search and approxi- 25 ms audio frames (sampled at a rate of 10 ms), by com- mate search tasks. Additionally, a classifier-based selection puting 13 MFCC (0-12) from 26 Mel-frequency bands and method of candidate reference files were explored. Our sec- applying a cepstral liftering filter with a weight parameter ond system used local similarity as measurement to filter of 22, with 13 delta and 13 acceleration coecients appended out the most similar subsequences to query sequence, thus to the MFCC. Finally, we made mean normalization of all reduced the computational burden of DTW comparison. In the features with respect to the full input sequence. our systems, optimized DTW algorithms were employed to compare either iSAX-ed phoneme sequences or 39-dimension 6. OPTIMIZED DTW MFCC features to determine if a query appears within a Four optimization techniques proposed in [5] can greatly ac- reference sample. In the future, more acoustic features and celerate the DTW algorithm: 1) early abandoning z-normal- selection approaches should be further investigated. 9. REFERENCES [1] I. Szöke, L.-J. Rodriguez-Fuentes, A. Buzo, X. Anguera, F. Metze, J. Proenca, M. Lojka, and X. Xiong, “Query by example search on speech at MediaEval 2015,” in Working Notes Proceedings of the MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany, 2015. [2] P. Schwarz, “Phoneme Recognizer Based on Long Temporal Context,” in http: // www.fit.vutbr.cz/ ˜schwarzp/ publi/ thesis.pdf ,PhD Thesis, Brno University of Technology, 2008. [3] M. M. Rahman and D. Davis, “Cluster Based Under-Sampling for Unbalanced Cardiovascular Data,” in Proceedings of the World Congress on Engineering, vol. 3, 2013. [4] S. Keerthi, S. Shevade, C. Bhattacharyya, and K. Murthy, “Improvements to Platt’s SMO Algorithm for SVM Classifier Design,” in Neural Computation, vol. 13, no. 3, 2001, pp. 637–649. [5] T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, and E. Keogh, “Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping,” in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012, pp. 262–270. [6] J. Shieh and E. Keogh, “iSAX: Indexing and Mining Terabyte Sized Time Series,” in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2008, pp. 623–631. [7] F. Eyben, F. Weninger, F. Groß, and B. Schuller, “Recent Developments in openSMILE, the Munich Open-Source Multimedia Feature Extractor,” in Proceedings of the 21st ACM international conference on Multimedia. ACM, 2013, pp. 835–838. [8] “The UCR Suite,” in www.cs.ucr.edu/ ˜eamonn/ UCRsuite.html .