Zero-Cost Speech Recognition Task at Mediaeval 2016 Igor Szoke Xavier Anguera Brno University of Technology ELSA Corp. Brno, Czech Republic Lisboa, Portugal szoke@fit.vutbr.cz xavier@elsanow.io ABSTRACT 2. THE ZERO-COST 2016 DATA-SET The main goal of the Zero-Cost Speech Recognition task is The target language selected for this year is Vietnamese. to bring researchers together on the topic of training ASR One of the reasons why we choose Vietnamese is that it was systems using only publicly available data. In particular, one of the languages of OpenKWS/BABEL in 2013 [1, 7] and the task consists on the development of either an LVCSR there are many papers reporting results in ASR [14, 10, 11, or a subword speech recognizer on a given target language. 12, 6, 15, 13]. BUT as the task co-organizer provided“upper- For this year we selected Vietnamese as the target language bound” results using their BABEL system [9] and “calibrate” The organizers provided participants several sets of pub- results of Zero-Cost participants to the larger world wide licly available data combined with one proprietary set gath- speech community. The BUT baseline was trained only on ered for this evaluation. Participants are free to find and conversational telephone speech without any adaptation on use other publicly available resources (free for research pur- target domain (Zero-Cost). poses). These resources must be shared with other partici- Other reasons for choosing Vietnamese are that it is a low- pants till end of July. The data-set for the evaluation is then resourced language with limited resources available on-line fixed and no outside data can be used. (more difficult for participants to “cheat” with data other than what is provided), it is a tonal language (with its in- herent difficulties) but it is a syllabic language (simpler to 1. INTRODUCTION treat for zero-resources algorithms where clear phoneme se- There are more than thousand spoken languages. Any re- quences can be seen repetitively in the data). search lab or “Speech company” that wants to develop tech- Task organizers provided participants with an initial set nology in a new language usually needs to buy a speech of free multimedia resources – a mix of audio data and im- database (audio + carefully hand made transcripts) to get perfect transcripts like audios/videos with subtitles: started 1 . The cost of such databases range from 5k to 20k EURs (provided by LDC / ELRA / Appen etc.). Alterna- • Forvo.com – Download of Vietnamese data from Forvo.com tively, if you have enough budget, you can collect your own service. It is composed of a collection of short record- data and cover some specific topics, acoustic environments ings with one or more word pronunciations each. etc. This brings a huge gap between “top” labs and compa- • Rhinospike.com – Download of Vietnamese data from nies having enough budget to afford such expenditures with Rhinospike.com service. It is a collection of recordings “the other” small players, endowed to freely available data, consisting between one short to several long sentences. tools or even systems. The main goal of this task is to challenge participating • ELSA – Proprietary prompted data recorded with a teams to come up and experiment with bootstrapping tech- mobile application by Vietnamese students. It con- niques, which allow to train initial ASR system for “free”. tains several read sentences obtained from a book of We are interested in the exploration of techniques to al- Vietnamese quotes. This data simulates a case where low researchers to train ASR systems on public multimedia participant is able to collect small amount of data resources (texts, audios, videos, dictionaries etc.), without themselves. the need to buy expensive (ideally any) data-sets. Partic- • Other “surprise” test data – Surprise data aiming at ipants may inspire in approaches used in under-resourced evaluating how robust participant systems are to new languages [5, 8]. There were also other initiatives close to data. This data is a download of 35 YouTube videos the topic of this tasks: Zero Resource Speech Challenge in (broadcast news, presentations, talks) mostly contain- 2015 [4] and The Third Frederick Jelinek Memorial Summer ing one speaker. The first 2 minutes of each videos Workshop 2016 [2]. were transcribed and used as ground truth. The rest of video was let in the test set to augment the data for 1 We understand any speech tokenizer under abbreviation possible unsupervised adaptation. ASR – i.e. including phoneme recognizer, word recognizer, automatic unit discovery Please note, that transcripts may not match the audio in 100%. In addition, any audio may contain some dropouts, noise or some speech may be missing. This data has been Copyright is held by the author/owner(s). MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether- preprocessed, split into Train / Devel / Test, and converted lands. to 16kHz wav + STM references. Train Devel-Local Devel Test SUM Forvo.com 663.3 2.8 34.3 33.8 731.4 Rhinospike.com 122.4 2.3 10.1 7.8 140.3 ELSA 43.1 8.5 43.2 58.8 145.1 Surprise data – – – 40.7 40.7 SUM 828.8 13.6 87.6 141.1 Table 1: Distribution of data (in minutes) according to a set and a data source. Devel-Local is a subset of Devel. In addition to the “official” datasets, several participants the units discovered are completely unrelated to the refer- have provided some free data which we encourage other par- ence phone labels, the results is zero. The maximum, one, is ticipants to use. Apart from these, the use of no other data achieved when one can fully recover the phone sequence from is allowed. Train, Development and Test sets are available the discovered units AND the entropy of both the discov- already during system training. Participants can use them ered units and the reference phone is the same. This means and adapt their system on them (e.g. unsupervised adap- that the metric penalizes systems that have too many units. tation on the Test set). However, reference transcripts are The evaluation algorithm used to compute NMI takes into not provided for the development / test data and it is not account timing of the discovered units. It matches them to allowed to transcribe or manually analyze it. reference ones (according to time) first and then calculates the NMI. 2.1 Participants’ data description 4. EVALUATION, SCORING AND LEADER • I2R - A list of 890k Vietnamese webpage URLs. BOARD • I2R - A Vietnamese wordlist – 80k words. Participants are provided with Training, Development and • I2R - A raw dump of Vietnamese wiktionary – later Test (evaluation) data all at once. However they do not cleaned by I2R to 750MB of text. have references transcripts for Development and Test data. • BUT - A download of Vietnamese-English subtitles [3] They can use the on-line leader board to score their systems – 93MB of text. and compute development results. When the evaluations are over, the results on the Test set will be published. To make • BUT - A set of Vietnamese videos and subtitles – the development faster and easier, we defined a Devel-Local 14 partly subtitled episodes of a Vietnamese telenovel. subset and provided the ground truth to participants, so that they can perform initial development on their systems locally. 3. BRIEF SUB-TASKS DESCRIPTION Participants of the task are asked to train a speech tok- • The Devel-Local is 1/5 subset of Devel. Participants enizer – LVCSR or subword – on a collection of public data are provided with references and scoring scripts so (see section 2) in Vietnamese language. Each participant that they can score their system outputs on this sub- must take part in at least one sub-task. set. This was done to allow for quick iterations during training period and to overcome the need to upload 3.1 Large vocabulary continuous speech recog- the system outputs to the leader board too often. nition (LVCSR) sub-task • The Devel consists of the full Devel dataset. Once par- This task targets full speech recognition where the output ticipants end up with some good enough / improved is a sequence of recognized words. Systems will be evaluated enough system, they are encouraged to upload their re- on the Word-Error-Rate (WER) metric (using cstk scoring sults to the leader board and be scored on much more tool). The WER is based on the comparison of transcripts data. The uploaded scores are available for all partic- (reference and generated hypothesis) at word level. Both ipants to see. transcripts should be produced in uppercase and without punctuation, hesitation markers etc. There is no other text • The Test is ”unseen” data. It partly contains data normalization done. This sub-task main use-case scenario is similar to training / devel one but also unseen one. in areas where full speech transcript is needed. Participants are encouraged to adapt their systems on this data (in unsupervised ways). 3.2 Subword speech recognition sub-task Each participant has to register and submit their results This task aims at building a “light weight” speech recog- to the on-line Leader Board (http://www.zero-cost.org/). nizer. The output is a sequence of subword tokens/units. There is no maximum limit on the number of submissions We do not define what the tokens should be. It can be per team. For the official final scoring each participant must phonemes, graphemes, syllables, triphones, automatically define one primary submission by adding the P- prefix to estimated units, etc. This sub-task’s main use-case scenario their submission and optionally at most 5 others as con- is for areas where speech must be converted to a sequence trastive systems C1- – C5-. of discrete symbols (LID, SID, KWS, Topic detection, etc.). The submission deadline is 12th of September 2016. Phoneme units are used as ground truth for this sub-task. The evaluation metric used in this sub-task is a Normal- ized version of the Mutual Information (NMI) also called the 5. ACKNOWLEDGMENTS symmetric uncertainty. It ranges between 0 and 1. When We would like to thank the Mediaeval organizers for their both variables, X and Y, are independent, meaning that support and all the participants for their hard work. 6. REFERENCES Chinese Spoken Language Processing, Singapore, [1] Openkws/babel in 2013 web: September 12-14, 2014, pages 183–186, 2014. https://www.nist.gov/multimodal-information- group/openkws13-evaluation/. [2] The third frederick jelinek memorial summer workshop 2016 web: http://www.clsp.jhu.edu/workshops/16- workshop/building-speech-recognition-system-from- untranscribed-data/. [3] Vietnamese-english subtitles web: http://opus.lingfil.uu.se/. [4] Zero resource speech challenge in 2015 web: http://www.lscp.net/persons/dupoux/bootphon/ zerospeech2014/website/. [5] L. Besacier, E. Barnard, A. Karpov, and T. Schultz. Automatic speech recognition for under-resourced languages: A survey. Speech Communication, 56:85 – 100, 2014. [6] N. F. Chen, S. Sivadas, B. P. Lim, H. G. Ngo, H. Xu, V. T. Pham, B. Ma, and H. Li. Strategies for vietnamese keyword search. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4-9, 2014, pages 4121–4125, 2014. [7] J. G. Fiscus and N. Chen. Overview of the NIST open keyword search 2013 evaluation workshop, 2013. [8] T. Fraga-Silva, A. Laurent, J.-L. Gauvain, L. Lamel, V. B. Le, and A. Messaoudi. Improving data selection for low-resource STT and KWS. In Proceedings of ASRU 2015. [9] M. Karafiát, F. Grézl, M. Hannemann, and J. Černocký. BUT neural network features for spontaneous Vietnamese in BABEL. In Proceedings of ICASSP 2014, pages 5659–5663. IEEE Signal Processing Society, 2014. [10] H. Q. Nguyen, P. Nocera, E. Castelli, and V. L. Trinh. Large vocabulary continuous speech recognition for Vietnamese, an under-resourced language. In Proceedings of SLTU 2008. [11] T. C. Nguyen and J. Chaloupka. Phoneme set and pronouncing dictionary creation for large vocabulary continuous speech recognition of vietnamese., pages 394–401. Berlin: Springer, 2013. [12] S. Tsakalidis, R. Hsiao, D. Karakos, T. Ng, S. Ranjan, G. Saikumar, L. Zhang, L. Nguyen, R. M. Schwartz, and J. Makhoul. The 2013 BBN vietnamese telephone speech keyword spotting system. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4-9, 2014, pages 7829–7833, 2014. [13] N. T. Vu and T. Schultz. Vietnamese large vocabulary continuous speech recognition. In 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, ASRU 2009, Merano/Meran, Italy, December 13-17, 2009, pages 333–338, 2009. [14] T. T. Vu, D. T. Nguyen, C. M. Luong, and J.-P. Hosom. Vietnamese large vocabulary continuous speech recognition. In Proceedings of INTERSPEECH 2005. [15] S. Xiong, W. Guo, and D. Liu. The vietnamese speech recognition based on rectified linear units deep neural network and spoken term detection system combination. In The 9th International Symposium on