Domain Specific Wav2vec 2.0 Fine-tuning For The SE&R
2022 Challenge
Alef Iury Siqueira Ferreira1 , Gustavo dos Reis Oliveira1
1
    Federal University of Goias, Goiânia, Brazil


                                       Abstract
                                       This paper presents our efforts to build a robust ASR model for the shared task Automatic Speech Recognition for spontaneous
                                       and prepared speech & Speech Emotion Recognition in Portuguese (SE&R 2022). The goal of the challenge is to advance the ASR
                                       research for the Portuguese language, considering prepared and spontaneous speech in different dialects. Our method consist
                                       on fine-tuning an ASR model in a domain-specific approach, applying gain normalization and selective noise insertion. The
                                       proposed method improved over the strong baseline provided on the test set in 3 of the 4 tracks available.

                                       Keywords
                                       speech recognition, Portuguese, prepared speech, spontaneous speech, wild data


1. Introduction
The performance of Automatic Speech Recognition sys-
tems (ASRs) has increased significantly with the devel-
opment of modern neural network topologies and the
use of massive amount of data to train the models [1].
Although the accuracy of recent models improved for
high-resource languages, such as English, the develop-
ment of ASR models in other languages is still a difficult
task using the same technologies [2, 3]. In this scenario,
Self-Supervised Learning (SSL), a method in which rep-
resentations with semantic information are learned by                                          Figure 1: Illustration of the Wav2vec 2.0 framework [6].
using unlabelled data, emerged as an important advance,
allowing the training of deeper models using less labelled
data [4, 5]. In this line of work, this paper explores the use
                                                                                                  source languages. In Portuguese, for example, [8] and
of the Wav2vec 2.0 [6], a framework for self-supervised
                                                                                                  [9] demonstrated that the fine-tuning of the Wav2vec
learning of discrete representations from raw audio data.
                                                                                                  2.0 model achieves state-of-the-art (SOTA) results only
    Wav2vec 2.0 (Figure 1) is inspired by previous works
                                                                                                  using publicly available datasets.
in unsupervised pre-training for speech recognition, that
                                                                                                     An important aspect to consider when training an
is, Wav2vec [7] and Vq-Wav2vec [4]. During pre-training
                                                                                                  ASR model is the quality and the domain of the data [10,
the model learns speech representations solving a con-
                                                                                                  11, 12]. While most of the available public datasets are
trastive task which requires identifying the correct quan-
                                                                                                  composed of prepared speech [9], mostly read sentences
tized latent speech representations of a masked time step
                                                                                                  [13, 14], the domain of real ASRs are far more complex,
among a set of distractors. After the self-supervised pre-
                                                                                                  mainly because it is formed by spontaneous speech and
training, the model can be fine-tuned on labeled data in a
                                                                                                  different speech dialects. Quality is another issue: most
supervised task, like ASR, adding a randomly initialized
                                                                                                  of ASR use cases involve high noise environments or low
linear projection on top of the context network with 𝑁
                                                                                                  recording equipment, which is not adressed in most of
classes and a loss function specific to the task at hand,
                                                                                                  the public datasets available [9].
like CTC, for instance.
                                                                                                     To stimulate research that can advance the present
    The model shows important results among low re-
                                                                                                  SOTA in ASR in Portuguese, for both prepared and spon-
                                                                                                  taneous speech, the shared-task Automatic Speech Recog-
Proceedings of the First Workshop on Automatic Speech Recognition
for Spontaneous and Prepared Speech & Speech Emotion Recognition                                  nition for spontaneous and prepared speech & Speech Emo-
in Portuguese (SE&R 2022), co-located with PROPOR 2022. March 21st, tion Recognition in Portuguese (SE&R 2022) introduces a
2022 (Online).                                                                                    new baseline for ASR and a new dataset in Portuguese
Envelope-Open alef_iury_c.c@discente.ufg.br (A. I. S. Ferreira);                                  [9]. The Corpus of Annotated Audios (CORAA ASR),
gustavoṙeis2_c.c@discente.ufg.br (G. d. R. Oliveira)
                     © 2022 Copyright for this paper by its authors. Use permitted under Creative a large corpus of spontaneous and prepared speech, is
    CEUR
    Workshop
    Proceedings
                     Commons License Attribution 4.0 International (CC BY 4.0).
                     CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                  composed by various subsets in Portuguese with differ-
ent characteristics. The baseline achieves a Word Error           • Corpus of Annotated Audios (CORAA ASR) v1
Rate (WER) of 24.18% on CORAA ASR test set, a difficult             [9]: is a public available dataset that contains
dataset containing samples with low quality, noise, and             290.77 hours of validated pairs (audio-transcription)
a variety of domains and dialects.                                  in Portuguese (mostly Brazilian Portuguese vari-
   In this work, we investigate the fine-tuning of the base-        ant) and is comprised by five other corpora: ALIP
line model [9] proposed by the shared task, a fine-tuned            [18], C-ORAL Brasil I [19], NURC-Recife [20],
model based on the Wav2vec 2.0 XLSR-53 [15], using                  SP2010 [21] and TEDx Portuguese talks.
only public available Portuguese datasets, including the
CORAA ASR dataset. We conducted several experiments
in different domains for the challenge and explored the        2.2. Experiments
use of selective noise insertion and audio normalization
during training. This work is organized as follows: Sec-   Our experiments consists on the fine-tuning of the base-
tion 2 discuss the proposed methods and Section 3 shows    line model of [9]. For each experiment, we trained the
and discusses the obtained results. Finally, Section 4     model for 5 epochs, using a batch size of 192, and us-
presents the conclusions of this work.                     ing Adam [22] where the learning rate is 3𝑒−05 that is
                                                           warmed up for the first 400 updates, then linearly de-
                                                           cayed for the remained. For the experiments, we used a
2. Methods                                                 NVIDIA TESLA V100 32GB, a NVIDIA TESLA Tesla P100
                                                           16GB and a NVIDIA A100 80GB, depending on the type
2.1. Datasets                                              of audio pre-processing used. The code to replicate the
                                                           results is available at https://github.com/alefiury/SE-R_
We used several publicly available datasets in Portuguese.
                                                           2022_Challenge_Wav2vec2.
Besides CORAA ASR, most of them is composed by pre-
                                                              In total, we conducted five main experiments to test
pared speech. In general, we opted to use all the data in
                                                           our methods:
the gathered datasets for training, except the dev part of
the CORAA ASR, as presented in Table 1. The datasets             • Experiment 1: Wav2vec 2.0 XLSR-53 - Base: For
used in this work are:                                             this experiment, the model was fine-tuned con-
       • CETUC [13]: contains approximately 145 hours               sidering the whole train set, but did not receive
         of Brazilian Portuguese speech distributed among           neither normalization nor noise addition;
         50 male and 50 female speakers, each pronounc-           • Experiment 2: Wav2vec 2.0 XLSR-53 - Norm: The
         ing approximately 1,000 phonetically balanced              model was fine-tuned with the whole train set
         sentences selected from the CETEN-Folha1 cor-              with normalization. For the normalization, the
         pus;                                                       mean gain of all the audios in the train set was
       • Common Voice (CV) 7.0 [16]: is a project pro-              considered;
         posed by Mozilla Foundation with the goal to             • Experiment 3: Wav2vec 2.0 XLSR-53 - Norm and
         create a wide open dataset in different languages.         SNA: The model was fine-tuned with gain normal-
         In this project, volunteers donate and validate            ization and selective noise addition. The audios
         speech using the official site2 ;                          were normalized considering the mean gain of all
       • Multilingual LibriSpeech (MLS) [14]: a massive             the audios in the train set, and those audios per-
         dataset available in many languages. The MLS is            taining to datasets that were considered to have a
         based on audiobook recordings in public domain             low presence of noise, namely MLS and CETUC,
         like LibriVox3 . The dataset contains a total of           received randomly one of the following 5 possi-
         6k hours of transcribed data in many languages.            ble types of noises: additive noise, being music
         The set in Portuguese used in this work4 (mostly           or nonspeech noises from the MUSAN Corpus
         Brazilian variant) has approximately 284 hours             [23], Room impulse responses [24], Addition or
         of speech, obtained from 55 audiobooks read by             reduction of gain, Pitch shift and Gaussian noise;
         62 speakers;                                             • Experiment 4: Wav2vec 2.0 XLSR-53 - Norm +
       • Multilingual TEDx [17]: a collection of audio              Prepared Speech: Model fine-tuned based on the
         recordings from TEDx talks in 8 source languages.          final trained model of Experiment 2, but con-
         The Portuguese set (mostly Brazilian Portuguese            sidering just the prepared speech data from the
         variant) contains 164 hours of transcribed speech.         CORAA ASR dataset, trained for more 5 epochs;
                                                                  • Experiment 5: Wav2vec 2.0 XLSR-53 - Norm +
   1
     https://www.linguateca.pt/cetenfolha/                          Spontaneous Speech: Model fine-tuned based
   2
     https://commonvoice.mozilla.org/pt
   3
     https://librivox.org/                                          on the final trained model of Experiment 2, but
   4
     http://www.openslr.org/94/                                     considering just the spontaneous speech data
Table 1
Dataset splits used in this work.
                Dataset                                Subset                 Type     Train    Dev (validation)    Test
                Others                                 CETUC        Prepared Speech   144.65h                 –        –
                                                Common Voice        Prepared Speech   112.08h                 –        –
                                              MLS (Portuguese)      Prepared Speech   168.34h                 –        –
                                Multilingual TEDx (Portuguese)      Prepared Speech   152.17h                 –        –
                CORAA ASR[9]                            ALIP     Spontaneous Speech    33.40h              0.99h    1.57h
                                              C-ORAL Brasil I    Spontaneous Speech     6.54h              1.13h    1.97h
                                                NURC-Recife      Spontaneous Speech   137.08h              1.29h    2.94h
                                                      SP2010     Spontaneous Speech    27.83h              1.13h    2.18h
                                             TEDx Portuguese        Prepared Speech    68.67h              1.37h    2.70h
                Total                                                                 850.76h              5.91h   11.36h


        from the CORAA ASR dataset, trained for more                   3.2. Final Results
        5 epochs.
                                                                      Table 4 compares the baseline with our selected models
                                                                      in the test set. The model Wav2vec 2.0 XLSR-53 - Norm
3. Results and Discussion                                             + Prepared Speech surpassed the strong baseline in the
                                                                      Prepared Speech PT_BR, Prepared Speech PT_PT and the
The shared-task consists of 4 tracks. Each track has a                Mixed tracks. As seen in the results based on the dev set,
domain specific scenario, that includes prepared speech               the fact that most of the data of the datasets that were
and spontaneous speech. In this regard, we conducted                  added to the train set are comprised of prepared speech,
a prior analysis (Section 3.1) using the dev set to select            might have contributed to the increase in performance
the best approaches based on 3 of the 4 tracks available:             in this domain in both Portuguese variants. Lastly, even
Mixed, Prepared Speech PT_BR and Spontaneous Speech.                  though we were not able to surpass the baseline model in
The best models were selected and then submitted for                  the Spontaneous Speech track, we achieved competitive
evaluation. Our final results are presented in Section 3.2.           results with both submitted models.

3.1. dev set Analysis                                                  3.3. Additional Experiments
Overall, our models did not show a huge improvement in                 After selecting and submitting the best results, we per-
performance when compared to the baseline model. Even                  formed some additional experiments to further explore
though we fine-tuned a model that is considered the state-             our proposed methods using the dev set. We tried to use
of-the-art in Brazilian Portuguese, we suspect that the                text correction in the outputs of the ASR models, and
number of training epochs might have been insufficient                 train the normalized models for a longer period of time
to obtain an increase in performance, or that the baseline             using early stopping considering the prepared speech and
model might have already reached a local optima.                       spontaneous speech data from the CORAA ASR dataset.
   Furthermore, as presented in Table 2, the model that                   The text correction was done with an additional post-
was fine-tuned using prepared speech clearly improved                  processing step, using a KenLM [25] model. For the dif-
the results in the Prepared Speech subset (and conse-                  ferent tasks, we used 2 different KenLM models: one for
quently the Mixed subset). The same phenomenon could                   spontaneous speech, which was built using subsets of
not be seen in the Spontaneous Speech subset. A possible               the CORAA ASR dataset containing spontaneous speech
explanation to this fact is that most of the data of the               phrases. And the other one was built considering wikipedia
datasets that were added to the train set are comprised                in portuguese texts, as proposed by [3]. Both were 4-
of prepared speech, which might have contributed to the                grams. We found that this post-processing of the predic-
increase in performance in this particular domain. An-                 tions from the ASR models did not improve the results
other possible explanation is the low number of training               from the dev set on the Prepared Speech PT_BR and
epochs used to train the models.                                       Spontaneous Speech tracks, as can be seen in Table 5, in
   Additionally, the noise insertion did not gave further              fact they were worse. One possible explanation is that
improvement in performance. Nevertheless, the results                  some of the decoder hyper-parameters did not work well
of the SDA model in some more noisy subsets of the                     with our ASR models. Another possibility is that the
CORAA ASR dataset, like ALIP and NURC, for instance,                   4-gram trained with spontaneous text was built with a
showed some interesting and promising results when                     small amount of text, which might have decreased the
compared to the baseline. These results are shown in                   performance of the model. However, the results on the
Table 3.                                                               Prepared Speech PT_PT track were much better com-
Table 2
Dev set analisys by subset
                                                        Prepared Speech PT_BR              Prepared Speech PT_PT      Spontaneous Speech                     Mixed
   Model                                                 CER              WER               CER           WER            CER          WER              CER           WER
   Wav2vec 2.0 XLSR-53 Baseline                         4.53%            13.09%            17.73%        42.09%       14.60 %         30.13%        12.87%        28.86%
   Wav2vec 2.0 XLSR-53 - Base                           5.65%            15.38%            16.82%        38.66%          15.13%       30.86%        13.18%        28.94%
   Wav2vec 2.0 XLSR-53 - Norm                           5.43%            14.85%            17.07%        38.06%          14.77%       30.1%         13.01%        28.28%
   Wav2vec 2.0 XLSR-53 - Norm and SDA                   5.26%            14.46%            17.38%        38.80%          14.77%       30.12%        13.04%        28.37%
   Wav2vec 2.0 XLSR-53 - Norm + Prepared Speech         4.38%            12.46%            15.63%        35.27%          15.00%       30.83%       12.50%         27.35%
   Wav2vec 2.0 XLSR-53 - Norm + Spontaneous Speech      4.71%            13.38%            16.30%        37.44%          14.93%       30.36%        12.72%        27.89%


Table 3
Dev set analisys by dataset in CORAA ASR
                                                                         ALIP                   NURC-Recife        C-ORAL-BRASIL I                   SP2010
        Model                                                   CER             WER          CER         WER       CER            WER          CER           WER
        Wav2vec 2.0 XLSR-53 Baseline                            15.08%          30.13%      14.02%      32.50%     20.09%         38.41%       9.21%         19.49%
        Wav2vec 2.0 XLSR-53 - Base                              15.42%          30.23%      14.27%      33.32%     21.23%         40.19%       9.58%         19.72%
        Wav2vec 2.0 XLSR-53 - Norm                              14.98%          29.68%      14.23%      32.42%     20.47%         38.98%       9.38%         19.36%
        Wav2vec 2.0 XLSR-53 - Norm and SDA                      14.97%          29.24%      14.00%      32.26%     20.68%         39.65%       9.42%         19.31%
        Wav2vec 2.0 XLSR-53 - Norm + Prepared Speech            15.19%          30.36%      14.61%      33.78%     20.30%         39.30%       9.89%         19.88%
        Wav2vec 2.0 XLSR-53 - Norm + Spontaneous Speech         15.31%          29.84%      14.23%      32.67%     19.89%         37.93%       10.32%        21.01%


pared to the previous experiments. This result suggests                                  with additional public available data in multiple domains
that the LM might improve results when there are few do-                                 and using normalization, even for a few epochs, can im-
main data used to train the Wav2vec model, since most of                                 prove performance. With our results we were able to
our training data was composed by Brazilian Portuguese                                   improve on the test set in 3 of the 4 tracks available over
audios.                                                                                  the strong baseline provided.
   Furthermore, as we had suspected earlier, the model                                      As future works, we plan to train a ASR model using
with gain normalization that were trained considering the                                a dynamic noise insertion approach that do not depend
spontaneous speech data from the CORAA ASR corpus                                        on choosing specific datasets previously.
and for a longer period of time performed better in its
respective subtrack, strengthening our hypothesis that
our results did not improve in the main experiments due                                  Acknowledgments
to a low number of training epochs.
                                                                                         This research was funded by CEIA with support by the
                                                                                         Goiás State Foundation (FAPEG grant #201910267000527)5 .
4. Conclusions                                                                           We also would like to thank Cyberlabs Group6 for the
                                                                                         support for this work.
In this work we presented our efforts to build a robust
ASR model using multiple approaches, such as selective
noise insertion and domain specific fine-tuning. In our                                     5
                                                                                                http://centrodeia.org/
experiments we found that fine-tuning a strong baseline                                     6
                                                                                                https://cyberlabs.ai/


Table 4
Test set analisys by subset
                                                       Prepared Speech PT_BR              Prepared Speech PT_PT     Spontaneous Speech                   Mixed
    Model                                              CER               WER               CER          WER          CER            WER            CER           WER
    Wav2vec 2.0 XLSR-53 Baseline                       3.56%          11.19%              17.08%        39.75%      12.39%          26.24%        11.35%        25.85%
    Wav2vec 2.0 XLSR-53 - Norm                         4.40%          12.71%              15.33%        35.31%      12.51%          26.50%        11.18%        25.25%
    Wav2vec 2.0 XLSR-53 - Norm + Prepared Speech       3.55 %         11.25%              15.16%        34.68%      12.61%          26.81%        10.98%        24.89%
Table 5
Additional Experiments. All the experiments used the Wav2vec 2.0 XLSR-53 pre-trained model for finetuning.
                                                  Prepared Speech PT_BR     Prepared Speech PT_PT   Spontaneous Speech         Mixed
  Model                                           CER         WER            CER         WER        CER        WER       CER           WER
  Norm + Prepared Speech + Prepared KenLM         5.59%       14.11%        15.23%      32.36%      16.26%     30.66%    13.33%    26.95%
  Norm + Prepared Speech + Spontaneous KenLM      5.83%       15.58%        16.52%      36.74%      16.73%     33.78%    13.95%    29.97%
  Norm + Spontaneous Speech + Prepared KenLM      5.80%       13.94%        15.80%      32.58%      16.38%     30.75%    13.59%    27.00%
  Norm + Spontaneous Speech + Spontaneous KenLM   6.05%       15.52%        16.58%      36.98%      16.91%     33.45%    14.11%    29.85%
  Norm + Prepared Speech + Early Stopping         4.54%       12.96%        15.73%      35.97%      14.75%     30.16%    12.45%    27.31%
  Norm + Spontaneous Speech + Early Stopping      4.59%       13.14%        15.97%      36.52%      14.66%     29.83%    12.47%    27.33%


References                                                                     Coraa: a large corpus of spontaneous and prepared
                                                                               speech manually validated for speech recognition
 [1] J. Li, Recent advances in end-to-end automatic                            in brazilian portuguese, 2021. arXiv:2110.15731 .
     speech recognition, 2021. arXiv:2111.01690 .                         [10] M. L. Seltzer, D. Yu, Y. Wang, An investigation
 [2] D. Amodei, S. Ananthanarayanan, R. Anubhai,                               of deep neural networks for noise robust speech
     J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro,                  recognition, in: 2013 IEEE international conference
     Q. Cheng, G. Chen, et al., Deep speech 2: End-to-                         on acoustics, speech and signal processing, IEEE,
     end speech recognition in english and mandarin,                           2013, pp. 7398–7402.
     in: International conference on machine learning,                    [11] T. Likhomanenko, Q. Xu, V. Pratap, P. Tomasello,
     PMLR, 2016, pp. 173–182.                                                  J. Kahn, G. Avidov, R. Collobert, G. Synnaeve, Re-
 [3] I. M. Quintanilha, S. L. Netto, L. W. P. Biscainho,                       thinking evaluation in asr: Are our models robust
     An open-source end-to-end asr system for brazilian                        enough?, arXiv preprint arXiv:2010.11745 (2020).
     portuguese using dnns built from newly assembled                     [12] W.-N. Hsu, A. Sriram, A. Baevski, T. Likhomanenko,
     corpora, Journal of Communication and Informa-                            Q. Xu, V. Pratap, J. Kahn, A. Lee, R. Collobert,
     tion Systems 35 (2020) 230–242.                                           G. Synnaeve, et al., Robust wav2vec 2.0: Analyzing
 [4] A. Baevski, S. Schneider, M. Auli, vq-wav2vec:                            domain shift in self-supervised pre-training, arXiv
     Self-supervised learning of discrete speech rep-                          preprint arXiv:2104.01027 (2021).
     resentations, in: International Conference on                        [13] V. Alencar, A. Alcaim, Lsf and lpc-derived features
     Learning Representations (ICLR), 2020. URL: https:                        for large vocabulary distributed continuous speech
     //openreview.net/pdf?id=rylwJxrYDS.                                       recognition in brazilian portuguese, in: 2008 42nd
 [5] A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Baner-                            Asilomar conference on signals, systems and com-
     jee, F. Makedon, A survey on contrastive self-                            puters, IEEE, 2008, pp. 1237–1241.
     supervised learning, Technologies 9 (2021) 2.                        [14] V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, R. Col-
 [6] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec                         lobert, Mls: A large-scale multilingual dataset for
     2.0: A framework for self-supervised learning                             speech research, Interspeech 2020 (2020). URL:
     of speech representations, in: H. Larochelle,                             http://dx.doi.org/10.21437/Interspeech.2020-2826.
     M. Ranzato, R. Hadsell, M. F. Balcan, H. Lin                              doi:10.21437/interspeech.2020- 2826 .
     (Eds.), Advances in Neural Information Pro-                          [15] A. Conneau, K. Khandelwal, N. Goyal, V. Chaud-
     cessing Systems, volume 33, Curran As-                                    hary, G. Wenzek, F. Guzmán, É. Grave, M. Ott,
     sociates, Inc., 2020, pp. 12449–12460. URL:                               L. Zettlemoyer, V. Stoyanov, Unsupervised cross-
     https://proceedings.neurips.cc/paper/2020/file/                           lingual representation learning at scale, in: Proceed-
     92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf.                               ings of the 58th Annual Meeting of the Association
 [7] S. Schneider, A. Baevski, R. Collobert, M. Auli,                          for Computational Linguistics, 2020, pp. 8440–8451.
     wav2vec: Unsupervised pre-training for speech                        [16] R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer,
     recognition., in: INTERSPEECH, 2019.                                      M. Henretty, R. Morais, L. Saunders, F. Tyers, G. We-
 [8] L. R. S. Gris, E. Casanova, F. S. de Oliveira,                            ber, Common voice: A massively-multilingual
     A. da Silva Soares, A. C. Junior, Brazilian por-                          speech corpus, in: Proceedings of the 12th Lan-
     tuguese speech recognition using wav2vec 2.0, 2021.                       guage Resources and Evaluation Conference, 2020,
     arXiv:2107.11414 .                                                        pp. 4218–4222.
 [9] A. C. Junior, E. Casanova, A. Soares, F. S. de Oliveira,             [17] E. Salesky, M. Wiesner, J. Bremerman, R. Cattoni,
     L. Oliveira, R. C. F. Junior, D. P. P. da Silva, F. G.                    M. Negri, M. Turchi, D. W. Oard, M. Post, The
     Fayet, B. B. Carlotto, L. R. S. Gris, S. M. Aluísio,                      multilingual tedx corpus for speech recognition and
     translation, arXiv preprint arXiv:2102.01757 (2021).
[18] S. C. L. Gonçalves, Projeto alip (amostra linguística
     do interior paulista) e banco de dados iboruna: 10
     anos de contribuição com a descrição do português
     brasileiro, Estudos Linguísticos (São Paulo. 1978)
     48 (2019) 276–297. URL: https://revistas.gel.org.
     br/estudos-linguisticos/article/view/2430. doi:10.
     21165/el.v48i1.2430 .
[19] T. Raso, H. Mello, C-oral - brasil i: Corpus de refer-
     ência do portugues brasileiro falado informal, 2012,
     pp. 362–367. doi:10.1007/978- 3- 642- 28885- 2_
     40 .
[20] M. Oliveira, Jr, Nurc digital um protocolo para a dig-
     italização, anotação, arquivamento e disseminação
     do material do projeto da norma urbana linguís-
     tica culta (nurc), CHIMERA: Romance Corpora and
     Linguistic Studies 3 (2016) 149–174.
[21] M. R B, O. L, Mapping paulistano portuguese: the
     sp2010 project, Firenze, Italy: Fizenze University
     Press, 2012, pp. 459–463.
[22] D. P. Kingma, J. Ba, Adam: A method for stochastic
     optimization, in: Y. Bengio, Y. LeCun (Eds.), 3rd
     International Conference on Learning Representa-
     tions, ICLR 2015, San Diego, CA, USA, May 7-9,
     2015, Conference Track Proceedings, 2015. URL:
     http://arxiv.org/abs/1412.6980.
[23] D. Snyder, G. Chen, D. Povey, Musan: A
     music, speech, and noise corpus, 2015.
     arXiv:1510.08484 .
[24] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, S. Khudan-
     pur, A study on data augmentation of reverberant
     speech for robust speech recognition, in: 2017 IEEE
     International Conference on Acoustics, Speech and
     Signal Processing (ICASSP), 2017, pp. 5220–5224.
     doi:10.1109/ICASSP.2017.7953152 .
[25] K. Heafield, KenLM: Faster and smaller lan-
     guage model queries, in: Proceedings of the
     Sixth Workshop on Statistical Machine Transla-
     tion, Association for Computational Linguistics, Ed-
     inburgh, Scotland, 2011, pp. 187–197. URL: https:
     //aclanthology.org/W11-2123.