Spoken Web Search Nitendra Rajput Florian Metze IBM Research Carnegie Mellon University 4, Block C, ISID Campus 5000 Forbes Ave Vasant Kunj, New Delhi 110070 Pittsburgh, PA 15213 India USA rnitendra@in.ibm.com fmetze@cs.cmu.edu ABSTRACT 3. RELATED WORK In this paper, we describe the “Spoken Web Search” Task, which was held as part of the 2011 MediaEval campaign. The purpose of This task has been suggested by IBM Research India, and is using this task was to perform audio search in several languages, with data provided by this group, see [2]. Previous attempts at spoken very little resources being available in each language. The data web search have mostly focused on searching through the meta- was taken from audio content that was created in live settings and data related to the audio content [3][4]. was submitted to the “spoken web” over a mobile connection. 4. TASK DESCRIPTION Categories and Subject Descriptors Participants received development and test utterances (audio data) H.3 [Information Storage and Retrieval]: Content Analysis and as well as development and test (audio) queries, described in more Indexing. detail below. Only the occurrence of development queries in development utterances was provided. General Terms Participants were required to submit the following runs in the Algorithms, Performance, Experimentation, Languages. audio-only condition (i.e., without looking at the textual form of the queries): Keywords • On the test utterances: identify which query (from the Spoken Term Detection, Web Search, Spoken Web. set of development queries) occurs in each utterance (0- n matches per term, i.e. not every term necessarily occurs, but multiple matches are possible) 1. INTRODUCTION The “spoken web search” task of MediaEval 2011 [5] involves • On the test utterances: identify which query (from the searching for audio content within audio content using an audio set of test queries) occurs in each utterance (0-n content query. The task required researchers to build a language- matches) independent audio search system so that, given an audio query, it • On the development utterances: identify which test should be able to find the appropriate audio file(s) and the (approximate) location of query term within the audio file(s). query occurs in each utterance (0-n matches) Evaluation was performed using standard NIST metrics. The purpose of requiring these three conditions is to see how As a contrastive condition (i.e. a "general" run in MediaEval’s critical tuning is for the different approaches, i.e., we assume that terms), participants were asked to run systems not based on an participants already know their performance for "dev queries" on audio query, as the organizers also provided the search term in "dev utterances", so for evaluation we will evaluate the performance of unseen "test queries" on previously known "dev lexical form. utterances" (which could have been used for unsupervised Note that language labels and pronunciation dictionaries were not adaptation, etc), known queries (for which good classifiers could provided. The lexical form cannot be used to deduce the language have been developed) on unseen data, and unseen queries on in the audio-only condition. The goal of the task was primarily to unseen utterances. compare the performance of different methods on this type of data, not so much a performance comparison geared towards Optionally, participants were asked to submit the same runs also different sites. using the provided lexical form of the query, i.e. they could use existing speech recognition systems, etc., for comparison purposes. 2. MOTIVATION Imagine you want to build a simple speech recognition system, or Not every test query occured in the data (that’s why it is “0-n at least a spoken term detection (STD) system in a new dialect, for matches”). which only very few audio examples are available. Maybe there Participants could submit multiple systems, but had to designate even is no written form for that dialect? Is it possible to do one primary system. If more then one were submitted as primary, something useful (i.e. identify the topic of a query) by using only the last one uploaded was considered "primary". those very limited resources available? Participants were allowed to use any additional resources they might have available, as long as their use is documented in the Copyright is held by the author/ owner(s). working notes paper. MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy. 4.1 Development Data 5. EVALUATION OF RESULTS Participants were provided with a data set that has been kindly The ground truth was created manually by native speakers, and made available by the Spoken Web team at IBM Research, India provided by the task organizers, following the principles of [6]. The audio content is spontaneous speech that has been created NIST's Spoken Term Detection (STD) evaluations. over phone in a live setting by low-literate users. While most of the audio content is related to farming practices, there are other The primary evaluation metric was ATWV (Average Term domains as well. The data set comprises audio from four different Weighted Value), as used in the NIST 2006 Spoken Term Indian languages: English, Hindi, Gujarati and Telugu. Each data Detection (STD) evaluation [1]. item is ca. 4-30 secs in length. However, the language labels were The systems can be scored using the software provided in intentionally not provided either in the development or the me2011-scoring-beta2.tar.bz2 available on the FTP evaluation data set. server. This software allowed participants to verify themselves As already mentioned above, participants were allowed to use any that the organizers can process their output for scoring, and additional resources they might have available, as long as their reports the respective figure of merit plus graphs. It also contains use is documented in the working notes paper. the reference ECF files (these were generated automatically from the text-only files described above). The development set contains 400 utterances (100 per language) and 64 queries (16 per language), all as audio files recorded in The NIST-compatible files have been generated automatically 8kHz/ 16bit, as WAV file. For each query and utterance, we also from simple files that do not contain time information of the provided the lexical transcription in UTF8 encoding. The occurrences. Generous time windows for matches should allow transcriptions were in a Romanized transliterated form. For each for correct detections. utterance, the organizers provided 0-n matching queries (but not the location of the match). 6. OUTLOOK Spoken Web is primarily targeted at communities that currently There are 4 directories in the Spoken Web data. Audio has the do not have access to Internet. Most of such target users speak in 400 audio files in four languages. Transcripts has the non-traditional languages for which good speech recognition corresponding word level transcriptions in roman characters of the systems do not exist. Low resource speech recognition is currently audio file. QueryAudio has the 64 query audio terms. receiving a lot of attention, as exhibited by research efforts such QueryTranscripts has the corresponding word level roman as IARPA’s Babel program. We will discuss future directions at transcription of the QueryAudio files. The file Mapping.txt the evaluation workshop and present the outcome in future shows which query is present in which audio file. publications. 4.2 Evaluation Data The test set consists of 200 utterances (50 per language) and 36 7. ACKNOWLEDGMENTS queries (9 per language) as audio files, with the same The organizers would like to thank Martha Larson for organizing characteristics. As with the development data, the lexical form of this event, and the participants for putting in a lot of hard work. the query was provided, but not the matching utterances. Most of the content is from low-literate farmers across various parts in India – the organizers would like to thank them for The evaluation data consists of two directories. creating this content. The data was kindly collected and provided EvaluationAudio directory contains 200 audio files that are by IBM Research India. the utterance. This has 50 audio files for each of the four languages. EvaluationQueryAudio contains the 36 audio files that are the query audio terms. This has 9 audio queries from 8. REFERENCES each of the four languages. [1] Fiscus, J., Ajot, J., Garofolo, J., Doddington, G., 2007, "Results of the 2006 Spoken Term Detection Evaluation," The written form of the search queries is also provided in the Proceedings of the ACM SIGIR 2007, Workshop in directory EvaluationQueryTranscripts Searching Spontaneous Conversational Speech (SSCS 2007), (EvaluationTranscripts may be made available later). pp. 51-56. Data was provided as a "termlist" XML file, in which the "termid" [2] Kumar, A., Rajput, N., Chakraborty, D., Agarwal, S. K., corresponds to the filename of the audio query. This information Nanavati, A. A., “WWTW: The World Wide Telecom Web,” was packaged together with the scoring software (see below), for NSDR 2007 (SIGCOMM workshop), Kyoto, Japan, 27 example: August, 2007. [3] J. Ajmera, A. Joshi, S. Mukherjee, N. Rajput, S. Sahay, M. Shrivastava, K. Shrivastava, "Two Stream Indexing for Conference, WWW 2011 "years" Content on Spoken Web," CIKM 2010. [5] http://www.multimediaeval.org/mediaeval2011/index.html [6] http://domino.research.ibm.com/comm/research_projects.nsf/ pages/pyrmeait.index.html