=Paper=
{{Paper
|id=Vol-1436/Paper8
|storemode=property
|title=Query by Example Search on Speech at Mediaeval 2015
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper8.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/SzokeRBAMPLX15
}}
==Query by Example Search on Speech at Mediaeval 2015==
Query by Example Search on Speech at Mediaeval 2015 Igor Szoke Luis Javier Andi Buzo Xavier Anguera Brno University of Rodriguez-Fuentes University Politehnica of Sinkronigo S.L. Technology University of the Basque Bucharest Barcelona, Spain Brno, Czech Republic Country Bucharest, Romania xanguera@gmail.com szoke@fit.vutbr.cz Leioa, Spain andi.buzo@upb.ro luisjavier.rodriguez@ehu.es Florian Metze Jorge Proenca Martin Lojka Xiao Xiong Carnegie Mellon University of Coimbra Technical University of Temasek laboratories, University Coimbra, Portugal Kosice Nanyang Technological Pittsburgh, PA, U.S.A jproenca@co.it.pt Kosice, Slovakia University fmetze@cs.cmu.edu martin.lojka@tuke.sk Singapore xiaoxiong@ntu.edu.sg ABSTRACT itive) the score, the more likely it is that the query appears In this paper, we describe the “Query by Example Search in the audio file. The normalized cross entropy cost (Cnxe ) on Speech Task” (QUESST), held as part of the MediaEval [3, 10] is used as the primary metric, whereas the Actual 2015 evaluation campaign. As in previous years, the pro- Term Weighted Value (ATWV) is kept as a secondary met- posed task requires performing language-independent audio ric for diagnostic purposes, which means that systems must search in a low resource scenario. This year, the task has provide not only scores, but also Yes/No decisions. Three been designed to get as close as possible to a practical use types of query matchings are considered: the first one (T1 ) case scenario, in which a user would like to retrieve, us- involves “exact matches” whereas the second one (T2 ) allows ing speech, utterances containing a given word or short sen- for inflectional variations of words or word re-orderings (that tence, including those with limited inflectional variations of is, “approximate matches”); the third one (T3 ) is similar to words, some filler content and/or word re-orderings. We also T2, but queries are drawn from conversational speech, thus stressed a mismatch caused by noise and reverberations. containing strong coarticulations and some filler content be- tween words. 1. INTRODUCTION 2. BRIEF TASK DESCRIPTION This is the fifth year of query-by-example search on speech QUESST is part of the Mediaeval 2015 evaluation cam- evaluations [9, 6, 1, 2]. The task of QUESST (“QUery by paign1 . As usual, two separate sets of queries were pro- Example Search on Speech Task”) is to search FOR audio vided, for development and evaluation, along with a single content WITHIN audio content USING an audio query. As set of audio files, on which both sets of queries had to be in previous years, the search database was collected from searched on. The set of development queries and the set of heterogeneous sources, covering multiple languages, and un- audio files were distributed early (April 1st), including the der diverse acoustic conditions. Some of these languages are groundtruth and the scoring scripts, for the participants to resource-limited, some are recorded in challenging acoustic develop and evaluate their systems. The set of evaluation conditions and some contain heavily accented speech (typi- queries was distributed one month later (May 1st). System cally from non-native speakers). No transcriptions, language results (for both sets of queries) had to be returned by the tags or any other metadata are provided to participants. evaluation deadline (July 22nd), including a likelihood score The task therefore requires researchers to build a language- and a Yes/No decision for each pair (query, audio file). Note independent audio-to-audio search system. that not every query necessarily appears in the set of audio Compared to the previous year, two main changes were in- files, and that several queries may appear in the same audio troduced for this year’s evaluation. First, we provide queries file. with longer context. So participants can use this surround- Multiple system results could be submitted (up to 5), but ing speech to adapt their systems. Second, we artificially one of them (presumably the best one) had to be identified add noises and reverberations into the data. This aims to as primary. Also, although participants were encouraged measure robustness of particular feature extractions and al- to train their systems using only the data released for this gorithms in heavy channel mismatch. year’s evaluation, they were allowed to use any additional As in the previous year, the proposed task does not require resources they might have available, as long as their use was the localization (time stamps) of query matchings within documented in their system description papers. System re- audio files. However, systems must provide a score (a real sults were then scored and returned to participants (by July number) for each query matching. The higher (the more pos- 29th), who had to prepare a working notes (two-page) paper describing their systems and return it to the organizers (by August 28th). Finally, systems were presented and results Copyright is held by the author/owner(s). 1 MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Germany. http://www.multimediaeval.org/mediaeval2015/ discussed in the Mediaeval workshop, which serves to meet regard to the lexical form of the query. fellow participants, to share ideas and to bootstrap future Type 3 (Conversational queries in context): This type collaborations. of search is another step towards realistic use-case scenarios. In this case, the spoken query is just part of a sentence, that 3. THE QUESST 2015 DATASET may contain silent/filled pauses and irrelevant words. For The QUESST 2015 dataset is the result of a joint effort by example, “Google, let me find some red [uh] white [pau] horse several institutions to put together a sizable amount of data to ride today” could be one of these complex queries. As to be used in this evaluation and for later research on the it is extremely difficult to distinguish between query words topic of query-by-example search on speech. The search cor- (“white [pau] horse”) and “fillers” (“Google, let me find some pus is composed of around 18 hours of audio (11662 files) in red [uh]” and “to ride today”), we provide timing meta-data the following 7 languages: Albanian, Czech, English, Man- of the relevant segment inside the spoken query. darin, Portuguese, Romanian and Slovak [8], with differ- As Type 3 queries required timing meta-data, all the queries ent amounts of audio per language. The search utterances, were recorded within a context and timing information was which are relatively short (5.8 seconds long on average), were provided for all of them. The context of Types 1 and 2 automatically extracted from longer recordings and manu- was “artificial”: speakers were asked to say several words ally checked to avoid very short or very long utterances. The before and after the query, with significant pauses around QUESST 2015 dataset includes 445 development queries and the query to avoid coarticulations. Participants are free to 447 evaluation queries, the number of queries per language use the “context” of the spoken query (e.g. for adaptation). being more or less balanced with the amount of audio avail- The ground truth was created either manually by na- able in the search corpus. A big effort has been made to tive speakers or automatically by speech recognition engines manually record the queries, in order to avoid problems ob- tuned to each particular language, and provided by the task served in previous years due to acoustic context derived from organizers, following the format of NIST’s Spoken Term De- cutting queries from longer sentences. Speakers recruited tection evaluations. The development package contains a for recording the queries were asked to maintain a normal general ground-truth folder (the one that must be used to speaking speed and a clear speaking style. All audio files score system results on the development set) which consid- are PCM encoded at 8 kHz, 16 bits/sample, and stored in ers all types of matchings, but also three ground-truth fold- WAV format. ers specific to each type of matchings, to allow participants evaluate their progress on each condition during system de- The data was then artificially noised and reverberated with equal amounts of clean, noisy, reverb and noisy+reverb velopment. speech. We used both stationary and transient noises down- loaded from https://www.freesound.org. Reverberation 5. PERFORMANCE METRICS was obtained by passing the audio through a filter with an In QUESST 2015, Cnxe and ATWV are used as primary artificially generated room impulse response (RIR) [5]. and secondary metrics, respectively. For the Cnxe scores to be meaningful, participants are requested either to return a score (that will be taken as a log-likelihood ratio) for every 4. THE GROUND-TRUTH pair (query, audio file), or alternatively, to define a default Similarly to the last year’s evaluation, we have applied (floor) score for all the pairs not included in the results file. a relaxed concept of a query match, which strongly affects Participants are also required to report on their real-time the ground-truth definition and thus the way systems are running factor, hardware characteristics and peak memory expected to work. Besides “exact matchings” (Type 1), two requirements, in order to profile the different approaches types of “approximate matchings” (Types 2 and 3) are con- applied. See [10] for further information on how the metrics sidered, which are defined as follows: work and how they are computed. Type 1 (Exact match): Occurrences of single or mul- tiple word queries in utterances should exactly match the 6. PROVIDED TOOLS lexical representation of the query. An example of this case We offered some of the basic tools for paricipants to make is the query “white horse” that should match the utterance their “first contact” with the QUESST easier. We provided “My white horse is beautiful” but should not match to “The Bottle-Neck feature [4] extraction tool trained on Russian whiter horse is faster”. and Hungarian Speechdat-E. Next, calibration and fussion Type 2 (Re-ordering and small lexical variations): script based on logistic regresion [11] and DTW search [12], Here the search algorithm should cope with: both developed at BUT, were provided. Finally, the data • Lexical variations. Occurrences of single/multiple word and all the scripts were setup in a Virtual Machine which was queries might differ slightly, with small portions of audio provided to the participants through the Speech Recognition either at the beginning or the end of the segment that do Virtual Kitchen (http://speechkitchen.org/, [7]). not match the lexical form of the reference. An example of this type of search would be “researcher” matching an 7. ACKNOWLEDGEMENTS utterance containing “research” (note that the inverse would also be possible). We would like to thank the Mediaeval organizers for their support and all the participants for their hard work. Por- • Word re-orderings and small filler content between words. tuguesse data were provided with thanks to Tecnovoz project For example, when searching for the query “white horse”, PMDT No. 03/165 systems should be able to match both “My horse is white” and “I have two white and beautiful horses”. Note that the matching words may also contain slight variations with 8. REFERENCES [1] X. Anguera, F. Metze, A. Buzo, I. Szoke, and L. J. Rodriguez-Fuentes. The Spoken Web Search Task. In Proc. Mediaeval 2013 Workshop. [2] X. Anguera, J. L. Rodriguez-Fuentes, I. Szoke, A. Buzo, and F. Metze. Query by Example Search on Speech at Mediaeval 2014. In Proc. Mediaeval 2014 Workshop. [3] X. Anguera, L.-J. Rodriguez-Fuentes, A. Buzo, F. Metze, I. Szoke, and M. Penagarikano. QUESST2014: Evaluating Query-by-Example Speech Search in a Zero-Resource Setting with Real-Life Queries. In Proc. ICASSP, pages 5833–5837, 2015. [4] F. Grézl, M. Karafiát, S. Kontár, and J. Černocký. Probabilistic and bottle-neck features for lvcsr of meetings. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), pages 757–760. IEEE Signal Processing Society, 2007. [5] S. G. McGovern. Fast image method for impulse response calculations of box-shaped rooms. Applied Acoustics, 70(1):182–189, 2009. [6] F. Metze, E. Barnard, M. Davel, C. van Heerden, X. Anguera, G. Gravier, and N. Rajput. The Spoken Web Search Task. In Proc. Mediaeval 2012 Workshop. [7] F. Metze, E. Riebling, E. Fosler-Lussier, A. Plummer, and R. Bates. The speech recognition virtual kitchen turns one. In to appear in Proc. INTERSPEECH. ISCA, 2015. [8] M. Pleva and J. Juhar. Tuke-bnews-sk: Slovak broadcast news corpus construction and evaluation. In N. C. C. Chair), K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, may 2014. European Language Resources Association (ELRA). [9] N. Rajput and F. Metze. Spoken Web Search. In Proc. Mediaeval 2011 Workshop. [10] L. J. Rodriguez-Fuentes and M. Penagarikano. MediaEval 2013 Spoken Web Search Task: System Performance Measures. Technical Report TR-2013-1, DEE, University of the Basque Country, 2013. Online: http://gtts.ehu.es/gtts/ NT/fulltext/rodriguezmediaeval13.pdf. [11] I. Szoke, L. Burget, F. Grézl, J. Černocký, and L. Ondel. Calibration and fusion of query-by-example systems - but sws 2013. In Proceedings of ICASSP 2014, pages 7899–7903. IEEE Signal Processing Society, 2014. [12] I. Szoke, M. Skacel, L. Burget, and J. H. Cernocky. Coping with Channel Mismatch in Query-by-Example - BUT QUESST 2014. In Proc. ICASSP, pages 5838–5843, 2015.