Domain Specific Wav2vec 2.0 Fine-tuning For The SE&R 2022 Challenge Alef Iury Siqueira Ferreira1 , Gustavo dos Reis Oliveira1 1 Federal University of Goias, Goiânia, Brazil Abstract This paper presents our efforts to build a robust ASR model for the shared task Automatic Speech Recognition for spontaneous and prepared speech & Speech Emotion Recognition in Portuguese (SE&R 2022). The goal of the challenge is to advance the ASR research for the Portuguese language, considering prepared and spontaneous speech in different dialects. Our method consist on fine-tuning an ASR model in a domain-specific approach, applying gain normalization and selective noise insertion. The proposed method improved over the strong baseline provided on the test set in 3 of the 4 tracks available. Keywords speech recognition, Portuguese, prepared speech, spontaneous speech, wild data 1. Introduction The performance of Automatic Speech Recognition sys- tems (ASRs) has increased significantly with the devel- opment of modern neural network topologies and the use of massive amount of data to train the models [1]. Although the accuracy of recent models improved for high-resource languages, such as English, the develop- ment of ASR models in other languages is still a difficult task using the same technologies [2, 3]. In this scenario, Self-Supervised Learning (SSL), a method in which rep- resentations with semantic information are learned by Figure 1: Illustration of the Wav2vec 2.0 framework [6]. using unlabelled data, emerged as an important advance, allowing the training of deeper models using less labelled data [4, 5]. In this line of work, this paper explores the use source languages. In Portuguese, for example, [8] and of the Wav2vec 2.0 [6], a framework for self-supervised [9] demonstrated that the fine-tuning of the Wav2vec learning of discrete representations from raw audio data. 2.0 model achieves state-of-the-art (SOTA) results only Wav2vec 2.0 (Figure 1) is inspired by previous works using publicly available datasets. in unsupervised pre-training for speech recognition, that An important aspect to consider when training an is, Wav2vec [7] and Vq-Wav2vec [4]. During pre-training ASR model is the quality and the domain of the data [10, the model learns speech representations solving a con- 11, 12]. While most of the available public datasets are trastive task which requires identifying the correct quan- composed of prepared speech [9], mostly read sentences tized latent speech representations of a masked time step [13, 14], the domain of real ASRs are far more complex, among a set of distractors. After the self-supervised pre- mainly because it is formed by spontaneous speech and training, the model can be fine-tuned on labeled data in a different speech dialects. Quality is another issue: most supervised task, like ASR, adding a randomly initialized of ASR use cases involve high noise environments or low linear projection on top of the context network with 𝑁 recording equipment, which is not adressed in most of classes and a loss function specific to the task at hand, the public datasets available [9]. like CTC, for instance. To stimulate research that can advance the present The model shows important results among low re- SOTA in ASR in Portuguese, for both prepared and spon- taneous speech, the shared-task Automatic Speech Recog- Proceedings of the First Workshop on Automatic Speech Recognition for Spontaneous and Prepared Speech & Speech Emotion Recognition nition for spontaneous and prepared speech & Speech Emo- in Portuguese (SE&R 2022), co-located with PROPOR 2022. March 21st, tion Recognition in Portuguese (SE&R 2022) introduces a 2022 (Online). new baseline for ASR and a new dataset in Portuguese Envelope-Open alef_iury_c.c@discente.ufg.br (A. I. S. Ferreira); [9]. The Corpus of Annotated Audios (CORAA ASR), gustavoṙeis2_c.c@discente.ufg.br (G. d. R. Oliveira) © 2022 Copyright for this paper by its authors. Use permitted under Creative a large corpus of spontaneous and prepared speech, is CEUR Workshop Proceedings Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 composed by various subsets in Portuguese with differ- ent characteristics. The baseline achieves a Word Error • Corpus of Annotated Audios (CORAA ASR) v1 Rate (WER) of 24.18% on CORAA ASR test set, a difficult [9]: is a public available dataset that contains dataset containing samples with low quality, noise, and 290.77 hours of validated pairs (audio-transcription) a variety of domains and dialects. in Portuguese (mostly Brazilian Portuguese vari- In this work, we investigate the fine-tuning of the base- ant) and is comprised by five other corpora: ALIP line model [9] proposed by the shared task, a fine-tuned [18], C-ORAL Brasil I [19], NURC-Recife [20], model based on the Wav2vec 2.0 XLSR-53 [15], using SP2010 [21] and TEDx Portuguese talks. only public available Portuguese datasets, including the CORAA ASR dataset. We conducted several experiments in different domains for the challenge and explored the 2.2. Experiments use of selective noise insertion and audio normalization during training. This work is organized as follows: Sec- Our experiments consists on the fine-tuning of the base- tion 2 discuss the proposed methods and Section 3 shows line model of [9]. For each experiment, we trained the and discusses the obtained results. Finally, Section 4 model for 5 epochs, using a batch size of 192, and us- presents the conclusions of this work. ing Adam [22] where the learning rate is 3𝑒−05 that is warmed up for the first 400 updates, then linearly de- cayed for the remained. For the experiments, we used a 2. Methods NVIDIA TESLA V100 32GB, a NVIDIA TESLA Tesla P100 16GB and a NVIDIA A100 80GB, depending on the type 2.1. Datasets of audio pre-processing used. The code to replicate the results is available at https://github.com/alefiury/SE-R_ We used several publicly available datasets in Portuguese. 2022_Challenge_Wav2vec2. Besides CORAA ASR, most of them is composed by pre- In total, we conducted five main experiments to test pared speech. In general, we opted to use all the data in our methods: the gathered datasets for training, except the dev part of the CORAA ASR, as presented in Table 1. The datasets • Experiment 1: Wav2vec 2.0 XLSR-53 - Base: For used in this work are: this experiment, the model was fine-tuned con- • CETUC [13]: contains approximately 145 hours sidering the whole train set, but did not receive of Brazilian Portuguese speech distributed among neither normalization nor noise addition; 50 male and 50 female speakers, each pronounc- • Experiment 2: Wav2vec 2.0 XLSR-53 - Norm: The ing approximately 1,000 phonetically balanced model was fine-tuned with the whole train set sentences selected from the CETEN-Folha1 cor- with normalization. For the normalization, the pus; mean gain of all the audios in the train set was • Common Voice (CV) 7.0 [16]: is a project pro- considered; posed by Mozilla Foundation with the goal to • Experiment 3: Wav2vec 2.0 XLSR-53 - Norm and create a wide open dataset in different languages. SNA: The model was fine-tuned with gain normal- In this project, volunteers donate and validate ization and selective noise addition. The audios speech using the official site2 ; were normalized considering the mean gain of all • Multilingual LibriSpeech (MLS) [14]: a massive the audios in the train set, and those audios per- dataset available in many languages. The MLS is taining to datasets that were considered to have a based on audiobook recordings in public domain low presence of noise, namely MLS and CETUC, like LibriVox3 . The dataset contains a total of received randomly one of the following 5 possi- 6k hours of transcribed data in many languages. ble types of noises: additive noise, being music The set in Portuguese used in this work4 (mostly or nonspeech noises from the MUSAN Corpus Brazilian variant) has approximately 284 hours [23], Room impulse responses [24], Addition or of speech, obtained from 55 audiobooks read by reduction of gain, Pitch shift and Gaussian noise; 62 speakers; • Experiment 4: Wav2vec 2.0 XLSR-53 - Norm + • Multilingual TEDx [17]: a collection of audio Prepared Speech: Model fine-tuned based on the recordings from TEDx talks in 8 source languages. final trained model of Experiment 2, but con- The Portuguese set (mostly Brazilian Portuguese sidering just the prepared speech data from the variant) contains 164 hours of transcribed speech. CORAA ASR dataset, trained for more 5 epochs; • Experiment 5: Wav2vec 2.0 XLSR-53 - Norm + 1 https://www.linguateca.pt/cetenfolha/ Spontaneous Speech: Model fine-tuned based 2 https://commonvoice.mozilla.org/pt 3 https://librivox.org/ on the final trained model of Experiment 2, but 4 http://www.openslr.org/94/ considering just the spontaneous speech data Table 1 Dataset splits used in this work. Dataset Subset Type Train Dev (validation) Test Others CETUC Prepared Speech 144.65h – – Common Voice Prepared Speech 112.08h – – MLS (Portuguese) Prepared Speech 168.34h – – Multilingual TEDx (Portuguese) Prepared Speech 152.17h – – CORAA ASR[9] ALIP Spontaneous Speech 33.40h 0.99h 1.57h C-ORAL Brasil I Spontaneous Speech 6.54h 1.13h 1.97h NURC-Recife Spontaneous Speech 137.08h 1.29h 2.94h SP2010 Spontaneous Speech 27.83h 1.13h 2.18h TEDx Portuguese Prepared Speech 68.67h 1.37h 2.70h Total 850.76h 5.91h 11.36h from the CORAA ASR dataset, trained for more 3.2. Final Results 5 epochs. Table 4 compares the baseline with our selected models in the test set. The model Wav2vec 2.0 XLSR-53 - Norm 3. Results and Discussion + Prepared Speech surpassed the strong baseline in the Prepared Speech PT_BR, Prepared Speech PT_PT and the The shared-task consists of 4 tracks. Each track has a Mixed tracks. As seen in the results based on the dev set, domain specific scenario, that includes prepared speech the fact that most of the data of the datasets that were and spontaneous speech. In this regard, we conducted added to the train set are comprised of prepared speech, a prior analysis (Section 3.1) using the dev set to select might have contributed to the increase in performance the best approaches based on 3 of the 4 tracks available: in this domain in both Portuguese variants. Lastly, even Mixed, Prepared Speech PT_BR and Spontaneous Speech. though we were not able to surpass the baseline model in The best models were selected and then submitted for the Spontaneous Speech track, we achieved competitive evaluation. Our final results are presented in Section 3.2. results with both submitted models. 3.1. dev set Analysis 3.3. Additional Experiments Overall, our models did not show a huge improvement in After selecting and submitting the best results, we per- performance when compared to the baseline model. Even formed some additional experiments to further explore though we fine-tuned a model that is considered the state- our proposed methods using the dev set. We tried to use of-the-art in Brazilian Portuguese, we suspect that the text correction in the outputs of the ASR models, and number of training epochs might have been insufficient train the normalized models for a longer period of time to obtain an increase in performance, or that the baseline using early stopping considering the prepared speech and model might have already reached a local optima. spontaneous speech data from the CORAA ASR dataset. Furthermore, as presented in Table 2, the model that The text correction was done with an additional post- was fine-tuned using prepared speech clearly improved processing step, using a KenLM [25] model. For the dif- the results in the Prepared Speech subset (and conse- ferent tasks, we used 2 different KenLM models: one for quently the Mixed subset). The same phenomenon could spontaneous speech, which was built using subsets of not be seen in the Spontaneous Speech subset. A possible the CORAA ASR dataset containing spontaneous speech explanation to this fact is that most of the data of the phrases. And the other one was built considering wikipedia datasets that were added to the train set are comprised in portuguese texts, as proposed by [3]. Both were 4- of prepared speech, which might have contributed to the grams. We found that this post-processing of the predic- increase in performance in this particular domain. An- tions from the ASR models did not improve the results other possible explanation is the low number of training from the dev set on the Prepared Speech PT_BR and epochs used to train the models. Spontaneous Speech tracks, as can be seen in Table 5, in Additionally, the noise insertion did not gave further fact they were worse. One possible explanation is that improvement in performance. Nevertheless, the results some of the decoder hyper-parameters did not work well of the SDA model in some more noisy subsets of the with our ASR models. Another possibility is that the CORAA ASR dataset, like ALIP and NURC, for instance, 4-gram trained with spontaneous text was built with a showed some interesting and promising results when small amount of text, which might have decreased the compared to the baseline. These results are shown in performance of the model. However, the results on the Table 3. Prepared Speech PT_PT track were much better com- Table 2 Dev set analisys by subset Prepared Speech PT_BR Prepared Speech PT_PT Spontaneous Speech Mixed Model CER WER CER WER CER WER CER WER Wav2vec 2.0 XLSR-53 Baseline 4.53% 13.09% 17.73% 42.09% 14.60 % 30.13% 12.87% 28.86% Wav2vec 2.0 XLSR-53 - Base 5.65% 15.38% 16.82% 38.66% 15.13% 30.86% 13.18% 28.94% Wav2vec 2.0 XLSR-53 - Norm 5.43% 14.85% 17.07% 38.06% 14.77% 30.1% 13.01% 28.28% Wav2vec 2.0 XLSR-53 - Norm and SDA 5.26% 14.46% 17.38% 38.80% 14.77% 30.12% 13.04% 28.37% Wav2vec 2.0 XLSR-53 - Norm + Prepared Speech 4.38% 12.46% 15.63% 35.27% 15.00% 30.83% 12.50% 27.35% Wav2vec 2.0 XLSR-53 - Norm + Spontaneous Speech 4.71% 13.38% 16.30% 37.44% 14.93% 30.36% 12.72% 27.89% Table 3 Dev set analisys by dataset in CORAA ASR ALIP NURC-Recife C-ORAL-BRASIL I SP2010 Model CER WER CER WER CER WER CER WER Wav2vec 2.0 XLSR-53 Baseline 15.08% 30.13% 14.02% 32.50% 20.09% 38.41% 9.21% 19.49% Wav2vec 2.0 XLSR-53 - Base 15.42% 30.23% 14.27% 33.32% 21.23% 40.19% 9.58% 19.72% Wav2vec 2.0 XLSR-53 - Norm 14.98% 29.68% 14.23% 32.42% 20.47% 38.98% 9.38% 19.36% Wav2vec 2.0 XLSR-53 - Norm and SDA 14.97% 29.24% 14.00% 32.26% 20.68% 39.65% 9.42% 19.31% Wav2vec 2.0 XLSR-53 - Norm + Prepared Speech 15.19% 30.36% 14.61% 33.78% 20.30% 39.30% 9.89% 19.88% Wav2vec 2.0 XLSR-53 - Norm + Spontaneous Speech 15.31% 29.84% 14.23% 32.67% 19.89% 37.93% 10.32% 21.01% pared to the previous experiments. This result suggests with additional public available data in multiple domains that the LM might improve results when there are few do- and using normalization, even for a few epochs, can im- main data used to train the Wav2vec model, since most of prove performance. With our results we were able to our training data was composed by Brazilian Portuguese improve on the test set in 3 of the 4 tracks available over audios. the strong baseline provided. Furthermore, as we had suspected earlier, the model As future works, we plan to train a ASR model using with gain normalization that were trained considering the a dynamic noise insertion approach that do not depend spontaneous speech data from the CORAA ASR corpus on choosing specific datasets previously. and for a longer period of time performed better in its respective subtrack, strengthening our hypothesis that our results did not improve in the main experiments due Acknowledgments to a low number of training epochs. This research was funded by CEIA with support by the Goiás State Foundation (FAPEG grant #201910267000527)5 . 4. Conclusions We also would like to thank Cyberlabs Group6 for the support for this work. In this work we presented our efforts to build a robust ASR model using multiple approaches, such as selective noise insertion and domain specific fine-tuning. In our 5 http://centrodeia.org/ experiments we found that fine-tuning a strong baseline 6 https://cyberlabs.ai/ Table 4 Test set analisys by subset Prepared Speech PT_BR Prepared Speech PT_PT Spontaneous Speech Mixed Model CER WER CER WER CER WER CER WER Wav2vec 2.0 XLSR-53 Baseline 3.56% 11.19% 17.08% 39.75% 12.39% 26.24% 11.35% 25.85% Wav2vec 2.0 XLSR-53 - Norm 4.40% 12.71% 15.33% 35.31% 12.51% 26.50% 11.18% 25.25% Wav2vec 2.0 XLSR-53 - Norm + Prepared Speech 3.55 % 11.25% 15.16% 34.68% 12.61% 26.81% 10.98% 24.89% Table 5 Additional Experiments. All the experiments used the Wav2vec 2.0 XLSR-53 pre-trained model for finetuning. Prepared Speech PT_BR Prepared Speech PT_PT Spontaneous Speech Mixed Model CER WER CER WER CER WER CER WER Norm + Prepared Speech + Prepared KenLM 5.59% 14.11% 15.23% 32.36% 16.26% 30.66% 13.33% 26.95% Norm + Prepared Speech + Spontaneous KenLM 5.83% 15.58% 16.52% 36.74% 16.73% 33.78% 13.95% 29.97% Norm + Spontaneous Speech + Prepared KenLM 5.80% 13.94% 15.80% 32.58% 16.38% 30.75% 13.59% 27.00% Norm + Spontaneous Speech + Spontaneous KenLM 6.05% 15.52% 16.58% 36.98% 16.91% 33.45% 14.11% 29.85% Norm + Prepared Speech + Early Stopping 4.54% 12.96% 15.73% 35.97% 14.75% 30.16% 12.45% 27.31% Norm + Spontaneous Speech + Early Stopping 4.59% 13.14% 15.97% 36.52% 14.66% 29.83% 12.47% 27.33% References Coraa: a large corpus of spontaneous and prepared speech manually validated for speech recognition [1] J. Li, Recent advances in end-to-end automatic in brazilian portuguese, 2021. arXiv:2110.15731 . speech recognition, 2021. arXiv:2111.01690 . [10] M. L. Seltzer, D. Yu, Y. Wang, An investigation [2] D. Amodei, S. Ananthanarayanan, R. Anubhai, of deep neural networks for noise robust speech J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, recognition, in: 2013 IEEE international conference Q. Cheng, G. Chen, et al., Deep speech 2: End-to- on acoustics, speech and signal processing, IEEE, end speech recognition in english and mandarin, 2013, pp. 7398–7402. in: International conference on machine learning, [11] T. Likhomanenko, Q. Xu, V. Pratap, P. Tomasello, PMLR, 2016, pp. 173–182. J. Kahn, G. Avidov, R. Collobert, G. Synnaeve, Re- [3] I. M. Quintanilha, S. L. Netto, L. W. P. Biscainho, thinking evaluation in asr: Are our models robust An open-source end-to-end asr system for brazilian enough?, arXiv preprint arXiv:2010.11745 (2020). portuguese using dnns built from newly assembled [12] W.-N. Hsu, A. Sriram, A. Baevski, T. Likhomanenko, corpora, Journal of Communication and Informa- Q. Xu, V. Pratap, J. Kahn, A. Lee, R. Collobert, tion Systems 35 (2020) 230–242. G. Synnaeve, et al., Robust wav2vec 2.0: Analyzing [4] A. Baevski, S. Schneider, M. Auli, vq-wav2vec: domain shift in self-supervised pre-training, arXiv Self-supervised learning of discrete speech rep- preprint arXiv:2104.01027 (2021). resentations, in: International Conference on [13] V. Alencar, A. Alcaim, Lsf and lpc-derived features Learning Representations (ICLR), 2020. URL: https: for large vocabulary distributed continuous speech //openreview.net/pdf?id=rylwJxrYDS. recognition in brazilian portuguese, in: 2008 42nd [5] A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Baner- Asilomar conference on signals, systems and com- jee, F. Makedon, A survey on contrastive self- puters, IEEE, 2008, pp. 1237–1241. supervised learning, Technologies 9 (2021) 2. [14] V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, R. Col- [6] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec lobert, Mls: A large-scale multilingual dataset for 2.0: A framework for self-supervised learning speech research, Interspeech 2020 (2020). URL: of speech representations, in: H. Larochelle, http://dx.doi.org/10.21437/Interspeech.2020-2826. M. Ranzato, R. Hadsell, M. F. Balcan, H. Lin doi:10.21437/interspeech.2020- 2826 . (Eds.), Advances in Neural Information Pro- [15] A. Conneau, K. Khandelwal, N. Goyal, V. Chaud- cessing Systems, volume 33, Curran As- hary, G. Wenzek, F. Guzmán, É. Grave, M. Ott, sociates, Inc., 2020, pp. 12449–12460. URL: L. Zettlemoyer, V. Stoyanov, Unsupervised cross- https://proceedings.neurips.cc/paper/2020/file/ lingual representation learning at scale, in: Proceed- 92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf. ings of the 58th Annual Meeting of the Association [7] S. Schneider, A. Baevski, R. Collobert, M. Auli, for Computational Linguistics, 2020, pp. 8440–8451. wav2vec: Unsupervised pre-training for speech [16] R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, recognition., in: INTERSPEECH, 2019. M. Henretty, R. Morais, L. Saunders, F. Tyers, G. We- [8] L. R. S. Gris, E. Casanova, F. S. de Oliveira, ber, Common voice: A massively-multilingual A. da Silva Soares, A. C. Junior, Brazilian por- speech corpus, in: Proceedings of the 12th Lan- tuguese speech recognition using wav2vec 2.0, 2021. guage Resources and Evaluation Conference, 2020, arXiv:2107.11414 . pp. 4218–4222. [9] A. C. Junior, E. Casanova, A. Soares, F. S. de Oliveira, [17] E. Salesky, M. Wiesner, J. Bremerman, R. Cattoni, L. Oliveira, R. C. F. Junior, D. P. P. da Silva, F. G. M. Negri, M. Turchi, D. W. Oard, M. Post, The Fayet, B. B. Carlotto, L. R. S. Gris, S. M. Aluísio, multilingual tedx corpus for speech recognition and translation, arXiv preprint arXiv:2102.01757 (2021). [18] S. C. L. Gonçalves, Projeto alip (amostra linguística do interior paulista) e banco de dados iboruna: 10 anos de contribuição com a descrição do português brasileiro, Estudos Linguísticos (São Paulo. 1978) 48 (2019) 276–297. URL: https://revistas.gel.org. br/estudos-linguisticos/article/view/2430. doi:10. 21165/el.v48i1.2430 . [19] T. Raso, H. Mello, C-oral - brasil i: Corpus de refer- ência do portugues brasileiro falado informal, 2012, pp. 362–367. doi:10.1007/978- 3- 642- 28885- 2_ 40 . [20] M. Oliveira, Jr, Nurc digital um protocolo para a dig- italização, anotação, arquivamento e disseminação do material do projeto da norma urbana linguís- tica culta (nurc), CHIMERA: Romance Corpora and Linguistic Studies 3 (2016) 149–174. [21] M. R B, O. L, Mapping paulistano portuguese: the sp2010 project, Firenze, Italy: Fizenze University Press, 2012, pp. 459–463. [22] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Y. Bengio, Y. LeCun (Eds.), 3rd International Conference on Learning Representa- tions, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL: http://arxiv.org/abs/1412.6980. [23] D. Snyder, G. Chen, D. Povey, Musan: A music, speech, and noise corpus, 2015. arXiv:1510.08484 . [24] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, S. Khudan- pur, A study on data augmentation of reverberant speech for robust speech recognition, in: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224. doi:10.1109/ICASSP.2017.7953152 . [25] K. Heafield, KenLM: Faster and smaller lan- guage model queries, in: Proceedings of the Sixth Workshop on Statistical Machine Transla- tion, Association for Computational Linguistics, Ed- inburgh, Scotland, 2011, pp. 187–197. URL: https: //aclanthology.org/W11-2123.