Zero-Cost Speech Recognition Task at Mediaeval 2016

                            Igor Szoke                                           Xavier Anguera
                   Brno University of Technology                                    ELSA Corp.
                      Brno, Czech Republic                                        Lisboa, Portugal
                          szoke@fit.vutbr.cz                                      xavier@elsanow.io


ABSTRACT                                                          2. THE ZERO-COST 2016 DATA-SET
The main goal of the Zero-Cost Speech Recognition task is           The target language selected for this year is Vietnamese.
to bring researchers together on the topic of training ASR        One of the reasons why we choose Vietnamese is that it was
systems using only publicly available data. In particular,        one of the languages of OpenKWS/BABEL in 2013 [1, 7] and
the task consists on the development of either an LVCSR           there are many papers reporting results in ASR [14, 10, 11,
or a subword speech recognizer on a given target language.        12, 6, 15, 13]. BUT as the task co-organizer provided“upper-
For this year we selected Vietnamese as the target language       bound” results using their BABEL system [9] and “calibrate”
The organizers provided participants several sets of pub-         results of Zero-Cost participants to the larger world wide
licly available data combined with one proprietary set gath-      speech community. The BUT baseline was trained only on
ered for this evaluation. Participants are free to find and       conversational telephone speech without any adaptation on
use other publicly available resources (free for research pur-    target domain (Zero-Cost).
poses). These resources must be shared with other partici-          Other reasons for choosing Vietnamese are that it is a low-
pants till end of July. The data-set for the evaluation is then   resourced language with limited resources available on-line
fixed and no outside data can be used.                            (more difficult for participants to “cheat” with data other
                                                                  than what is provided), it is a tonal language (with its in-
                                                                  herent difficulties) but it is a syllabic language (simpler to
1.   INTRODUCTION                                                 treat for zero-resources algorithms where clear phoneme se-
   There are more than thousand spoken languages. Any re-         quences can be seen repetitively in the data).
search lab or “Speech company” that wants to develop tech-          Task organizers provided participants with an initial set
nology in a new language usually needs to buy a speech            of free multimedia resources – a mix of audio data and im-
database (audio + carefully hand made transcripts) to get         perfect transcripts like audios/videos with subtitles:
started 1 . The cost of such databases range from 5k to 20k
EURs (provided by LDC / ELRA / Appen etc.). Alterna-                 • Forvo.com – Download of Vietnamese data from Forvo.com
tively, if you have enough budget, you can collect your own            service. It is composed of a collection of short record-
data and cover some specific topics, acoustic environments             ings with one or more word pronunciations each.
etc. This brings a huge gap between “top” labs and compa-            • Rhinospike.com – Download of Vietnamese data from
nies having enough budget to afford such expenditures with             Rhinospike.com service. It is a collection of recordings
“the other” small players, endowed to freely available data,           consisting between one short to several long sentences.
tools or even systems.
   The main goal of this task is to challenge participating          • ELSA – Proprietary prompted data recorded with a
teams to come up and experiment with bootstrapping tech-               mobile application by Vietnamese students. It con-
niques, which allow to train initial ASR system for “free”.            tains several read sentences obtained from a book of
We are interested in the exploration of techniques to al-              Vietnamese quotes. This data simulates a case where
low researchers to train ASR systems on public multimedia              participant is able to collect small amount of data
resources (texts, audios, videos, dictionaries etc.), without          themselves.
the need to buy expensive (ideally any) data-sets. Partic-           • Other “surprise” test data – Surprise data aiming at
ipants may inspire in approaches used in under-resourced               evaluating how robust participant systems are to new
languages [5, 8]. There were also other initiatives close to           data. This data is a download of 35 YouTube videos
the topic of this tasks: Zero Resource Speech Challenge in             (broadcast news, presentations, talks) mostly contain-
2015 [4] and The Third Frederick Jelinek Memorial Summer               ing one speaker. The first 2 minutes of each videos
Workshop 2016 [2].                                                     were transcribed and used as ground truth. The rest
                                                                       of video was let in the test set to augment the data for
1
 We understand any speech tokenizer under abbreviation                 possible unsupervised adaptation.
ASR – i.e. including phoneme recognizer, word recognizer,
automatic unit discovery                                            Please note, that transcripts may not match the audio in
                                                                  100%. In addition, any audio may contain some dropouts,
                                                                  noise or some speech may be missing. This data has been
Copyright is held by the author/owner(s).
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether-     preprocessed, split into Train / Devel / Test, and converted
lands.                                                            to 16kHz wav + STM references.
                                                    Train    Devel-Local   Devel   Test    SUM
                                 Forvo.com          663.3       2.8        34.3     33.8   731.4
                                 Rhinospike.com     122.4       2.3        10.1     7.8    140.3
                                 ELSA                43.1       8.5        43.2     58.8   145.1
                                 Surprise data        –           –         –       40.7   40.7
                                 SUM                828.8       13.6       87.6    141.1

      Table 1: Distribution of data (in minutes) according to a set and a data source. Devel-Local is a subset of Devel.

   In addition to the “official” datasets, several participants    the units discovered are completely unrelated to the refer-
have provided some free data which we encourage other par-         ence phone labels, the results is zero. The maximum, one, is
ticipants to use. Apart from these, the use of no other data       achieved when one can fully recover the phone sequence from
is allowed. Train, Development and Test sets are available         the discovered units AND the entropy of both the discov-
already during system training. Participants can use them          ered units and the reference phone is the same. This means
and adapt their system on them (e.g. unsupervised adap-            that the metric penalizes systems that have too many units.
tation on the Test set). However, reference transcripts are        The evaluation algorithm used to compute NMI takes into
not provided for the development / test data and it is not         account timing of the discovered units. It matches them to
allowed to transcribe or manually analyze it.                      reference ones (according to time) first and then calculates
                                                                   the NMI.
2.1 Participants’ data description
                                                                   4. EVALUATION, SCORING AND LEADER
     • I2R - A list of 890k Vietnamese webpage URLs.
                                                                      BOARD
     • I2R - A Vietnamese wordlist – 80k words.                      Participants are provided with Training, Development and
     • I2R - A raw dump of Vietnamese wiktionary – later           Test (evaluation) data all at once. However they do not
       cleaned by I2R to 750MB of text.                            have references transcripts for Development and Test data.
     • BUT - A download of Vietnamese-English subtitles [3]        They can use the on-line leader board to score their systems
       – 93MB of text.                                             and compute development results. When the evaluations are
                                                                   over, the results on the Test set will be published. To make
     • BUT - A set of Vietnamese videos and subtitles –            the development faster and easier, we defined a Devel-Local
       14 partly subtitled episodes of a Vietnamese telenovel.     subset and provided the ground truth to participants, so
                                                                   that they can perform initial development on their systems
                                                                   locally.
3.    BRIEF SUB-TASKS DESCRIPTION
  Participants of the task are asked to train a speech tok-            • The Devel-Local is 1/5 subset of Devel. Participants
enizer – LVCSR or subword – on a collection of public data               are provided with references and scoring scripts so
(see section 2) in Vietnamese language. Each participant                 that they can score their system outputs on this sub-
must take part in at least one sub-task.                                 set. This was done to allow for quick iterations during
                                                                         training period and to overcome the need to upload
3.1 Large vocabulary continuous speech recog-                            the system outputs to the leader board too often.
    nition (LVCSR) sub-task                                            • The Devel consists of the full Devel dataset. Once par-
   This task targets full speech recognition where the output            ticipants end up with some good enough / improved
is a sequence of recognized words. Systems will be evaluated             enough system, they are encouraged to upload their re-
on the Word-Error-Rate (WER) metric (using cstk scoring                  sults to the leader board and be scored on much more
tool). The WER is based on the comparison of transcripts                 data. The uploaded scores are available for all partic-
(reference and generated hypothesis) at word level. Both                 ipants to see.
transcripts should be produced in uppercase and without
punctuation, hesitation markers etc. There is no other text            • The Test is ”unseen” data. It partly contains data
normalization done. This sub-task main use-case scenario is              similar to training / devel one but also unseen one.
in areas where full speech transcript is needed.                         Participants are encouraged to adapt their systems on
                                                                         this data (in unsupervised ways).
3.2 Subword speech recognition sub-task
                                                                     Each participant has to register and submit their results
   This task aims at building a “light weight” speech recog-
                                                                   to the on-line Leader Board (http://www.zero-cost.org/).
nizer. The output is a sequence of subword tokens/units.
                                                                   There is no maximum limit on the number of submissions
We do not define what the tokens should be. It can be
                                                                   per team. For the official final scoring each participant must
phonemes, graphemes, syllables, triphones, automatically
                                                                   define one primary submission by adding the P- prefix to
estimated units, etc. This sub-task’s main use-case scenario
                                                                   their submission and optionally at most 5 others as con-
is for areas where speech must be converted to a sequence
                                                                   trastive systems C1- – C5-.
of discrete symbols (LID, SID, KWS, Topic detection, etc.).
                                                                     The submission deadline is 12th of September 2016.
Phoneme units are used as ground truth for this sub-task.
   The evaluation metric used in this sub-task is a Normal-
ized version of the Mutual Information (NMI) also called the       5. ACKNOWLEDGMENTS
symmetric uncertainty. It ranges between 0 and 1. When               We would like to thank the Mediaeval organizers for their
both variables, X and Y, are independent, meaning that             support and all the participants for their hard work.
6.   REFERENCES                                                Chinese Spoken Language Processing, Singapore,
 [1] Openkws/babel in 2013 web:                                September 12-14, 2014, pages 183–186, 2014.
     https://www.nist.gov/multimodal-information-
     group/openkws13-evaluation/.
 [2] The third frederick jelinek memorial summer workshop
     2016 web: http://www.clsp.jhu.edu/workshops/16-
     workshop/building-speech-recognition-system-from-
     untranscribed-data/.
 [3] Vietnamese-english subtitles web:
     http://opus.lingfil.uu.se/.
 [4] Zero resource speech challenge in 2015 web:
     http://www.lscp.net/persons/dupoux/bootphon/
     zerospeech2014/website/.
 [5] L. Besacier, E. Barnard, A. Karpov, and T. Schultz.
     Automatic speech recognition for under-resourced
     languages: A survey. Speech Communication, 56:85 –
     100, 2014.
 [6] N. F. Chen, S. Sivadas, B. P. Lim, H. G. Ngo, H. Xu,
     V. T. Pham, B. Ma, and H. Li. Strategies for
     vietnamese keyword search. In IEEE International
     Conference on Acoustics, Speech and Signal
     Processing, ICASSP 2014, Florence, Italy, May 4-9,
     2014, pages 4121–4125, 2014.
 [7] J. G. Fiscus and N. Chen. Overview of the NIST open
     keyword search 2013 evaluation workshop, 2013.
 [8] T. Fraga-Silva, A. Laurent, J.-L. Gauvain, L. Lamel,
     V. B. Le, and A. Messaoudi. Improving data selection
     for low-resource STT and KWS. In Proceedings of
     ASRU 2015.
 [9] M. Karafiát, F. Grézl, M. Hannemann, and
     J. Černocký. BUT neural network features for
     spontaneous Vietnamese in BABEL. In Proceedings of
     ICASSP 2014, pages 5659–5663. IEEE Signal
     Processing Society, 2014.
[10] H. Q. Nguyen, P. Nocera, E. Castelli, and V. L. Trinh.
     Large vocabulary continuous speech recognition for
     Vietnamese, an under-resourced language. In
     Proceedings of SLTU 2008.
[11] T. C. Nguyen and J. Chaloupka. Phoneme set and
     pronouncing dictionary creation for large vocabulary
     continuous speech recognition of vietnamese., pages
     394–401. Berlin: Springer, 2013.
[12] S. Tsakalidis, R. Hsiao, D. Karakos, T. Ng, S. Ranjan,
     G. Saikumar, L. Zhang, L. Nguyen, R. M. Schwartz,
     and J. Makhoul. The 2013 BBN vietnamese telephone
     speech keyword spotting system. In IEEE
     International Conference on Acoustics, Speech and
     Signal Processing, ICASSP 2014, Florence, Italy, May
     4-9, 2014, pages 7829–7833, 2014.
[13] N. T. Vu and T. Schultz. Vietnamese large vocabulary
     continuous speech recognition. In 2009 IEEE
     Workshop on Automatic Speech Recognition &
     Understanding, ASRU 2009, Merano/Meran, Italy,
     December 13-17, 2009, pages 333–338, 2009.
[14] T. T. Vu, D. T. Nguyen, C. M. Luong, and J.-P.
     Hosom. Vietnamese large vocabulary continuous
     speech recognition. In Proceedings of INTERSPEECH
     2005.
[15] S. Xiong, W. Guo, and D. Liu. The vietnamese speech
     recognition based on rectified linear units deep neural
     network and spoken term detection system
     combination. In The 9th International Symposium on