Spoken Web Search
                          Nitendra Rajput                                                   Florian Metze
                          IBM Research                                                Carnegie Mellon University
                    4, Block C, ISID Campus                                               5000 Forbes Ave
                 Vasant Kunj, New Delhi 110070                                          Pittsburgh, PA 15213
                              India                                                              USA
                     rnitendra@in.ibm.com                                              fmetze@cs.cmu.edu

ABSTRACT                                                              3. RELATED WORK
In this paper, we describe the “Spoken Web Search” Task, which
was held as part of the 2011 MediaEval campaign. The purpose of       This task has been suggested by IBM Research India, and is using
this task was to perform audio search in several languages, with      data provided by this group, see [2]. Previous attempts at spoken
very little resources being available in each language. The data      web search have mostly focused on searching through the meta-
was taken from audio content that was created in live settings and    data related to the audio content [3][4].
was submitted to the “spoken web” over a mobile connection.
                                                                      4. TASK DESCRIPTION
Categories and Subject Descriptors                                    Participants received development and test utterances (audio data)
H.3 [Information Storage and Retrieval]: Content Analysis and         as well as development and test (audio) queries, described in more
Indexing.                                                             detail below. Only the occurrence of development queries in
                                                                      development utterances was provided.

General Terms                                                         Participants were required to submit the following runs in the
Algorithms, Performance, Experimentation, Languages.                  audio-only condition (i.e., without looking at the textual form of
                                                                      the queries):

Keywords                                                                   •    On the test utterances: identify which query (from the
Spoken Term Detection, Web Search, Spoken Web.                                  set of development queries) occurs in each utterance (0-
                                                                                n matches per term, i.e. not every term necessarily
                                                                                occurs, but multiple matches are possible)
1. INTRODUCTION
The “spoken web search” task of MediaEval 2011 [5] involves                •    On the test utterances: identify which query (from the
searching for audio content within audio content using an audio                 set of test queries) occurs in each utterance (0-n
content query. The task required researchers to build a language-               matches)
independent audio search system so that, given an audio query, it
                                                                           •    On the development utterances: identify which test
should be able to find the appropriate audio file(s) and the
(approximate) location of query term within the audio file(s).                  query occurs in each utterance (0-n matches)
Evaluation was performed using standard NIST metrics.                 The purpose of requiring these three conditions is to see how
As a contrastive condition (i.e. a "general" run in MediaEval’s       critical tuning is for the different approaches, i.e., we assume that
terms), participants were asked to run systems not based on an        participants already know their performance for "dev queries" on
audio query, as the organizers also provided the search term in       "dev utterances", so for evaluation we will evaluate the
                                                                      performance of unseen "test queries" on previously known "dev
lexical form.
                                                                      utterances" (which could have been used for unsupervised
Note that language labels and pronunciation dictionaries were not     adaptation, etc), known queries (for which good classifiers could
provided. The lexical form cannot be used to deduce the language      have been developed) on unseen data, and unseen queries on
in the audio-only condition. The goal of the task was primarily to    unseen utterances.
compare the performance of different methods on this type of
data, not so much a performance comparison geared towards             Optionally, participants were asked to submit the same runs also
different sites.                                                      using the provided lexical form of the query, i.e. they could use
                                                                      existing speech recognition systems, etc., for comparison
                                                                      purposes.
2. MOTIVATION
Imagine you want to build a simple speech recognition system, or      Not every test query occured in the data (that’s why it is “0-n
at least a spoken term detection (STD) system in a new dialect, for   matches”).
which only very few audio examples are available. Maybe there         Participants could submit multiple systems, but had to designate
even is no written form for that dialect? Is it possible to do        one primary system. If more then one were submitted as primary,
something useful (i.e. identify the topic of a query) by using only   the last one uploaded was considered "primary".
those very limited resources available?
                                                                      Participants were allowed to use any additional resources they
                                                                      might have available, as long as their use is documented in the
 Copyright is held by the author/ owner(s).                           working notes paper.
 MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy.
4.1 Development Data                                                   5. EVALUATION OF RESULTS
Participants were provided with a data set that has been kindly        The ground truth was created manually by native speakers, and
made available by the Spoken Web team at IBM Research, India           provided by the task organizers, following the principles of
[6]. The audio content is spontaneous speech that has been created     NIST's Spoken Term Detection (STD) evaluations.
over phone in a live setting by low-literate users. While most of
the audio content is related to farming practices, there are other     The primary evaluation metric was ATWV (Average Term
domains as well. The data set comprises audio from four different      Weighted Value), as used in the NIST 2006 Spoken Term
Indian languages: English, Hindi, Gujarati and Telugu. Each data       Detection (STD) evaluation [1].
item is ca. 4-30 secs in length. However, the language labels were     The systems can be scored using the software provided in
intentionally not provided either in the development or the            me2011-scoring-beta2.tar.bz2 available on the FTP
evaluation data set.                                                   server. This software allowed participants to verify themselves
As already mentioned above, participants were allowed to use any       that the organizers can process their output for scoring, and
additional resources they might have available, as long as their       reports the respective figure of merit plus graphs. It also contains
use is documented in the working notes paper.                          the reference ECF files (these were generated automatically from
                                                                       the text-only files described above).
The development set contains 400 utterances (100 per language)
and 64 queries (16 per language), all as audio files recorded in       The NIST-compatible files have been generated automatically
8kHz/ 16bit, as WAV file. For each query and utterance, we also        from simple files that do not contain time information of the
provided the lexical transcription in UTF8 encoding. The               occurrences. Generous time windows for matches should allow
transcriptions were in a Romanized transliterated form. For each       for correct detections.
utterance, the organizers provided 0-n matching queries (but not
the location of the match).                                            6. OUTLOOK
                                                                       Spoken Web is primarily targeted at communities that currently
There are 4 directories in the Spoken Web data. Audio has the
                                                                       do not have access to Internet. Most of such target users speak in
400 audio files in four languages. Transcripts has the
                                                                       non-traditional languages for which good speech recognition
corresponding word level transcriptions in roman characters of the
                                                                       systems do not exist. Low resource speech recognition is currently
audio file. QueryAudio has the 64 query audio terms.
                                                                       receiving a lot of attention, as exhibited by research efforts such
QueryTranscripts has the corresponding word level roman
                                                                       as IARPA’s Babel program. We will discuss future directions at
transcription of the QueryAudio files. The file Mapping.txt            the evaluation workshop and present the outcome in future
shows which query is present in which audio file.                      publications.
4.2 Evaluation Data
The test set consists of 200 utterances (50 per language) and 36       7. ACKNOWLEDGMENTS
queries (9 per language) as audio files, with the same                 The organizers would like to thank Martha Larson for organizing
characteristics. As with the development data, the lexical form of     this event, and the participants for putting in a lot of hard work.
the query was provided, but not the matching utterances.               Most of the content is from low-literate farmers across various
                                                                       parts in India – the organizers would like to thank them for
The      evaluation     data    consists   of    two    directories.   creating this content. The data was kindly collected and provided
EvaluationAudio directory contains 200 audio files that are            by IBM Research India.
the utterance. This has 50 audio files for each of the four
languages. EvaluationQueryAudio contains the 36 audio
files that are the query audio terms. This has 9 audio queries from
                                                                       8. REFERENCES
each of the four languages.                                            [1] Fiscus, J., Ajot, J., Garofolo, J., Doddington, G., 2007,
                                                                           "Results of the 2006 Spoken Term Detection Evaluation,"
The written form of the search queries is also provided in the             Proceedings of the ACM SIGIR 2007, Workshop in
directory                EvaluationQueryTranscripts                        Searching Spontaneous Conversational Speech (SSCS 2007),
(EvaluationTranscripts may be made available later).                       pp. 51-56.
Data was provided as a "termlist" XML file, in which the "termid"      [2] Kumar, A., Rajput, N., Chakraborty, D., Agarwal, S. K.,
corresponds to the filename of the audio query. This information           Nanavati, A. A., “WWTW: The World Wide Telecom Web,”
was packaged together with the scoring software (see below), for           NSDR 2007 (SIGCOMM workshop), Kyoto, Japan, 27
example:                                                                   August, 2007.
<?xml version="1.0" encoding="UTF-8"?>                                 [3] J. Ajmera, A. Joshi, S. Mukherjee, N. Rajput, S. Sahay, M.
                                                                           Shrivastava, K. Shrivastava, "Two Stream Indexing for
<termlist                                                                  Spoken Web Search," 20th International World Wide Web
ecf_filename="expt_06_std_dryrun06_eng_all_spch_ex
pt_1.ecf.xml" ... >                                                        Conference, WWW 2011

<term
                                                                       [4] Mamadou Diao, Sougata Mukherjea, Nitendra Rajput,
termid="DryRun06_eng_0001"><termtext>"years"</term                         Kundan Srivastava, "Faceted Search and Browsing of Audio
text></term>                                                               Content on Spoken Web," CIKM 2010.
</termlist>                                                            [5] http://www.multimediaeval.org/mediaeval2011/index.html
                                                                       [6] http://domino.research.ibm.com/comm/research_projects.nsf/
                                                                           pages/pyrmeait.index.html