<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>of the Automatic Speech Recognition for Spontaneous and Prepared Speech &amp; Speech Emotion Recognition in Portuguese (SE&amp;R) Shared-tasks at PROPOR 2022</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ricardo Marcacini</string-name>
          <email>ricardo.marcacini@icmc.usp.br</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arnaldo Candido Junior</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edresson Casanova</string-name>
          <email>edresson@usp.br</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Automatic Speech Recognition, Speech Emotion Recognition, Portuguese Processing.</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Federal Univesity of Technology - Paraná, Avenida Brasil</institution>
          ,
          <addr-line>4232, Medianeira, Paraná</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Mathematics and Computer Sciences, University of São Paulo</institution>
          ,
          <addr-line>São Carlos - SP</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <abstract>
        <p>The Automatic Speech Recognition for Spontaneous and Prepared Speech &amp; Speech Emotion Recognition in Portuguese (SE&amp;R 2022) challenge is a workshop consisting on two main tracks: Automatic Speech Recognition (ASR) for spontaneous and prepared speech for Portuguese; and Speech Emotion Recognition (SER) in Portuguese. This language still demands more resources for robust speech processing. To improve the research options, two corpora are proposed: CORAA ASR and CORAA SER. CORAA ASR contains 389 hours of spontaneous and prepared speech while CORAA SER is a 50 minute corpus for sentiment recognition. In this work, we present an overview of the challenge, discuss the submissions and present the obtained results. The best ASR model performance for CORAA ASR achieved an Character Error Rate of 10.98%, while the best model for CORAA SER achived 72.8% Macro-F1.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In this work, we present the challenge Automatic Speech Recognition for Spontaneous and
Prepared Speech &amp; Speech Emotion Recognition in Portuguese (SE&amp;R 2022), a workshop
presented in the 15th International Conference on Computational Processing of Portuguese
(PROPOR 2022). The workshop consisted on two main tracks: Automatic Speech Recognition
(ASR) for spontaneous and prepared speech for Portuguese; and Speech Emotion Recognition
(SER) in Portuguese.</p>
      <p>
        Our main objective in proposing this challenge was to promote research in Portuguese
audio processing. While some languages as English have many available resources for audio
processing, such as corpora, datasets, models and processing tools, other languages still lack in
Proceedings of the First Workshop on Automatic Speech Recognition for Spontaneous and Prepared Speech &amp; Speech
Emotion Recognition in Portuguese, co-located with PROPOR 2022. March 21st, 2022 (Online).
†These authors contributed equally.
(E. Casanova)
this area. In particular, audio resource availability imposes limitations for conducting research
in the Portuguese language. Although this scenario is gradually changing, as new corpora are
released [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ], research obstacles due to resource scarcity are still a problem.
      </p>
      <p>Two corpora were proposed for use in the challenge: CORAA ASR and CORAA SER. CORAA
(Corpus of Annotated Audios) are a group of resources to foster research in spoken Portuguese
processing. A focus is given on the Brazilian Portuguese variant. CORAA ASR corpus contains
389 hours of spontaneous and prepared speech, segmented at utterance level, together with
the respective transcriptions for each utterance. CORAA SER is a 50 minute corpus for speech
emotion recognition containing utterances and their respective polarities or valencies. Three
categories were proposed: neural; non-neutral female; non-neutral male.</p>
      <p>
        For the ASR track, participants should submit models to be evaluated against the CORAA
ASR test set (11.3 hours). Researches could use CORAA ASR training and development sets, as
well as external corpora. A baseline model based on Wav2vec 2.0 [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] was also made available,
although participants could also use other models.
      </p>
      <p>For the SER track, participants should submit models to be analyzed against 308 audios
belonging in the CORAA SER test set. The remaining of CORAA SER could be used to train and
validated the model. As in the ASR track, external resources could also be used. Two baselines
were provided as a starting point, allowing authors to improve the models or the use other
models.</p>
      <p>This work is organized as follows. Section 2 present details about the the SE&amp;R ASR track
and received submissions. Section 3 contain information and results for the SE&amp;R SER track.
Section 4 presents the concluding remarks.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Automatic Speech Recognition</title>
      <p>Automatic Speech Recognition is a complex task, presenting many challenges for speech based
applications due to a diferent number of reasons. A first challenge is that modern speech
modeling usually requires large portions of data in order to a model perform in a satisfactory
way. A second challenge are additional complexities in spoken language when compared to
the written variant. For example, utterance borders in spoken text are normally less clear
than in written text. A third issue are external factors such as environmental noises and
recording quality are also a concern for ASR systems. In the case of spontaneous utterances, a
fourth problem are dificulties arising due to phenomena including voice overlapping, laughs,
sentence reformulations and disfluencies (such as stuttering, filled pauses and hesitations).
Finally, a fith challenge ASR systems must face is accounting for the mapping of speech
phonetics to orthographic written systems. This mapping is demanding because of phenomena
as loanwords, acronyms, neologisms and rare proper nouns and orthographic irregularities
regarding pronounce.</p>
      <p>
        These challenges have slowed the adoption of voice based interfaces and applications,
although advances in the area have been made, for example personal digital assistants and closed
caption systems for television and streaming services. However, several languages are lacking
in resources to build such systems, or the systems do exist, but they are propriety alternatives.
For the Portuguese Language, open resources are becoming available. In 2020, three new datases
were released: BRSD v2 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] ; Multilingual LibriSpeech (MLS), which includes Portuguese [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ];
and Common Voice version 6.17 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In 2021, Multilingual TEDx Corpus [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] was released. These
resources encompass more than 574 hours of audio in Portuguese. However, there is still the
necessity of more data for the ASR task, particularly, regarding spontaneous speech, since the
existing resources consists mostly of prepared speech.
      </p>
      <p>
        CORAA ASR corpus [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and SE&amp;R 2022 ASR track are initiatives aimed at fomenting speech
related research in Portuguese processing. It contains both prepared and spontaneous speech.
The second is more challenging for systems due to the characteristics proper of this speaking
style.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Dataset and baseline</title>
        <p>
          CORAA ASR is composed of five corpus: (a) ALIP [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]; (b) C-ORAL-BRASIL I [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]; (c) Nurc-Recife
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]; (d) SP2010 [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]; and (e) TeDx Talks1 in Portuguese. Regarding composition, TeDx Talks
are composed of prepared talks, while Nurc-Recife contains both prepared and spontaneous
speech. The remaining corpora contain spontaneous speech.
        </p>
        <p>
          During CORAA ASR creation, all corpora but TeDx Talks had existing transcriptions. Previous
transcriptions were adapted to the ASR task by annotators which manually validated and
categorized them, indicating audio quality, presence of noise, more than one speaker in the
audio, among other data [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. When needed, automatic alignment between transcriptions and
segmented utterances were performed. Annotators also marked sentences for revision where
problems were found. TeDx Talks were transcribed for the first time. In this case, transcriptions
of numerals, acronyms, dates, loanwords among other related phenomena were guided to a
transcription manual specifically designed for the ASR task.
        </p>
        <p>The resulting dataset contains 289 hours of audio and transcriptions, with more than 2.7
million tokens and 58 thousand types. In total, the corpus have more than 400 thousand
segmented sentences with duration on average of 3.4 seconds. For the challenge, we categorized
the corpus into prepared and spontaneous speech. Additionally, we also categorized prepared
speech into European Portuguese (approximately 4.6 hours) and Brazilian Portuguese (the
remaining audios).</p>
        <p>
          For the baseline, we used Wav2Vec 2.0 XLSR-53. The model was fine-tuned for the version 1.1
of CORAA ASR. The corpus was divided into three sets: train (283.6 hours), development (5.7
hours) and test (11.6 hours). We trained the model for 40 epochs freezing its feature extractor.
More details of the training phase can be obtained at [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Results</title>
        <p>was evaluated mainly using CER (Character Error Rate), although WER (Word Error Rate) is
also presented.</p>
        <p>Two models were submitted to the ASR track: team GPED and team Dovahkiin. Only GPED
submitted the paper detailing the model. Overall, GPED performed better, winning in three
categories, while Dovahkkiin achieved best results against European Portuguese. We used a
strong baseline, which obtained rank two in two evaluated categories. The winning system
used open set, being trained on other corpora besides CORAA ASR, and applied the strategy of
generating domain specific models for the four proposed categories.</p>
        <p>The CERs and WERs observed tend to be higher than systems in other languages or in other
corpora for Portuguese. It is important to note that some of our subcorpora consists of noisy
audios, imposing some limitations in system performance. Pt-br prepared speech lead to the
smaller errors, as these speech style is easier to be processed than spontaneous. Pt-pt prepared
speech lead to higher errors. This is probably due the few audio examples for this language
variant. Spontaneous speech lead to mixed results despite the fact of being more challenging
and presenting more noisy. This implies the models adapted well for this speech style.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Speech Emotion Recognition for Brazilian Portuguese</title>
      <p>
        Speech Emotion Recognition (SER) is an increasingly relevant task for Human-Computer
Interaction [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and an active research area in information retrieval and natural language
processing. The general idea is to promote the interaction between machines through voice
conversation [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], which is potentially useful for medical applications, call centers, autonomous
vehicles, and personal digital assistants, among others. Recognizing the speaker’s emotional
state is a critical bridge that challenges the interaction between man-machine because the
speech information can be interpreted in diferent ways according to the state of the speaker’s
voice, such as surprise, anger, joy, and sadness [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        Although Speech Emotion Recognition has made promising advances in the English language,
which has many resources and labeled corpus for training models, we observe that this task is
still underexplored in the Portuguese language due to the lack of labeled corpus [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Another
important aspect is the type of Speech Emotion Recognition, which can be prepared speech
or spontaneous speech. In a prepared speech, actors and actresses record their voices from a
studio, without noise and using pre-defined scripts, usually the exact phrase being spoken with
diferent categories of emotion. Voices are generally recorded considering six types of primary
emotions: happiness, sadness, disgust, anger, fear, and surprise. However, models trained on
these corpora usually fail in real-world applications with ambient noise, pronunciation, and
accents. Spontaneous speech corpora are relevant in these scenarios as they have these patterns
intrinsic to real-world applications. On the other hand, it is more challenging to annotate audio
segments due to the dificulty in finding various emotional states of the speaker.
      </p>
      <p>
        We aim to mitigate the lack of spontaneous speech corpus for Brazilian Portuguese. To the
best of our knowledge, we present the first initiative of an annotated corpus of spontaneous
speech for Brazilian Portuguese. We used the C-ORAL Brasil I corpus [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and its paralinguistic
annotations, such as laughter, crying, screaming, etc., to identify potential audio snippets with
an emotional state other than neutral. In addition, we also use gender metadata, such as male
and female, to identify the speaker’s gender when switching from a non-neutral to a neutral
emotional state.
      </p>
      <p>We also proposed the Brazilian Portuguese Speech Emotion Recognition (SER) Task to
encourage the first models to be trained with the CORAA SER v1 corpus. This task aims
to motivate research for SER in our community, mainly to discuss theoretical and practical
aspects of Speech Emotion Recognition, audio pre-processing, feature extraction, and machine
learning models for Brazilian Portuguese. We provide a dataset called CORAA SER version
1.02 composed of approximately 50 minutes of audio segments labeled in three classes: neutral,
non-neutral female, and non-neutral male. While the neutral class represents audio segments
with no well-defined emotional state, the non-neutral classes represent segments associated
with one of the primary emotional states in the speaker’s speech.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset and Baseline Models</title>
        <p>We provide a training corpus with audio segments labeled in three categories: neutral (491
audios), non-neutral-female (89 audios), and non-neutral-male (45 audios).</p>
        <p>The test file has 308 audios, organized in the same categories, with ground truth labels
available only after the shared task: Neutral (248 audio files), non-neutral-female (37 audio files),
and Non-neutral-male (23 audio files).</p>
        <p>
          Contestants received the original dataset (raw wave files) as well as two pre-processed
versions:
• Prosodic features: we use features related to physical characteristics of speech, such as
intonation, rhythms, pitch, time, loudness, etc. This type of pre-processing is traditionally
used in early versions of emotion recognition systems [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. In total, 56 prosodic features
were made available.
2https://github.com/rmarcacini/ser-coraa-pt-br
• Wav2Vec features: we explore unsupervised pre-training for speech recognition to
extract features (i.e., embeddings) from the audio segments [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. These features can be
used for training a speech emotion recognition classifier. Wav2Vec has been a
state-ofthe-art deep learning model for the last five years. This model is trained on a large corpus
of unlabeled audios through a noise contrastive binary classification task strategy.
        </p>
        <p>To provide the baselines to the competitors, we trained two MLP classifiers (Multilayer
Perceptron), one for each type of audio representation (prosodic features and wav2vec embeddings).</p>
        <p>More details about the corpus are publicly available at https://github.com/rmarcacini/
ser-coraa-pt-br.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Results</title>
        <p>The winning team (IME-USP-FINGER) proposed a deep neural model based on pre-training
and transfer learning. In this case, pre-training and transfer learning are promising ways to
mitigate the small number of labeled audios. Moreover, each team was successful in an area
of action relevant to the SER community, such as statistical analysis of the corpus
(ICMCEESC-FFLCH), committee evaluation (LIA-UFMS), and potential practical applications in speech
disorder (SofiaFala).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Concluding Remarks</title>
      <p>In this work we presented SE&amp;R 2022, a challenge to stimulate research in Portuguese Speech
processing. Two tracks were defines: Automatic Speech Recognition for spontaneous and
prepared speech for Portuguese and Speech Emotion Recognition. Two corpora, CORAA ASR
and CORAA SER were available for participants. The best ASR model performance for CORAA
ASR achieved an Character Error Rate of 10.98%, while the best model for CORAA SER achived
72.8% Macro-F1.</p>
      <p>We believe Portuguese speech processing it is an important and active area, and that initiatives
like CORAA ASR and CORAA SER will help to develop the area. As future work, we plan to
increase the presented corpora by collecting and annotating new audios.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by
the São Paulo Research Foundation (FAPESP grant #2019/07665-4) and by the IBM Corporation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V. F. S.</given-names>
            <surname>Alencar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alcaim</surname>
          </string-name>
          ,
          <article-title>Lsf and lpc - derived features for large vocabulary distributed continuous speech recognition in brazilian portuguese</article-title>
          ,
          <source>in: 2008 42nd Asilomar Conference on Signals, Systems and Computers</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>1237</fpage>
          -
          <lpage>1241</lpage>
          . doi:
          <article-title>1 0 . 1 1 0 9 / A C S S C</article-title>
          .
          <volume>2 0 0 8 . 5 0 7 4 6 1 4 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Pratap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sriram</surname>
          </string-name>
          , G. Synnaeve,
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <article-title>Mls: A large-scale multilingual dataset for speech research</article-title>
          ,
          <year>Interspeech 2020</year>
          (
          <year>2020</year>
          ). URL: http://dx.doi.org/10.21437/ Interspeech.2020-
          <fpage>2826</fpage>
          .
          <source>doi:1 0 . 2 1</source>
          <volume>4 3 7</volume>
          / i n t e r s p e e
          <source>c h . 2 0</source>
          <volume>2 0 - 2 8 2 6 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ardila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Branson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kohler</surname>
          </string-name>
          , J. Meyer, M. Henretty,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saunders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tyers</surname>
          </string-name>
          , G. Weber,
          <article-title>Common voice: A massively-multilingual speech corpus</article-title>
          ,
          <source>in: Proceedings of the 12th Language Resources and Evaluation Conference</source>
          , European Language Resources Association, Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>4218</fpage>
          -
          <lpage>4222</lpage>
          . URL: https://www.aclweb.org/anthology/2020.lrec-
          <volume>1</volume>
          .
          <fpage>520</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Salesky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiesner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bremerman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cattoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Negri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Turchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Post</surname>
          </string-name>
          ,
          <article-title>The multilingual tedx corpus for speech recognition and translation</article-title>
          ,
          <source>CoRR abs/2102</source>
          .01757 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2102.01757.
          <article-title>a r X i v : 2 1 0 2 . 0 1 7 5 7</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , É. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>8440</fpage>
          -
          <lpage>8451</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          , wav2vec
          <volume>2</volume>
          .
          <article-title>0: A framework for self-supervised learning of speech representations</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Candido Junior</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Casanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soares</surname>
          </string-name>
          , F. S. de Oliveira, L. Oliveira,
          <string-name>
            <given-names>R. C. F.</given-names>
            <surname>Junior</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P. P.</given-names>
            da
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. G.</given-names>
            <surname>Fayet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. B.</given-names>
            <surname>Carlotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. R. S.</given-names>
            <surname>Gris</surname>
          </string-name>
          , et al.,
          <article-title>Coraa: a large corpus of spontaneous and prepared speech manually validated for speech recognition in brazilian portuguese</article-title>
          ,
          <source>arXiv preprint arXiv:2110.15731</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S. C. L.</given-names>
            <surname>Gonçalves</surname>
          </string-name>
          ,
          <article-title>Projeto ALIP (amostra linguística do interior paulista) e banco de dados iboruna: 10 anos de contribuição com a descrição do português brasileiro</article-title>
          ,
          <source>Revista Estudos Linguísticos</source>
          <volume>48</volume>
          (
          <year>2019</year>
          )
          <fpage>276</fpage>
          -
          <lpage>297</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Raso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mello</surname>
          </string-name>
          ,
          <string-name>
            <surname>C-oral - Brasil</surname>
            <given-names>I</given-names>
          </string-name>
          : Corpus de Referência do Português Brasileiro Falado Informal,
          <string-name>
            <surname>Editora</surname>
            <given-names>UFMG</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belo</surname>
            <given-names>Horizonte</given-names>
          </string-name>
          ,
          <string-name>
            <surname>MG</surname>
          </string-name>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Oliviera</surname>
          </string-name>
          <article-title>Jr., Nurc digital um protocolo para a digitalização, anotação, arquivamento e disseminação do material do projeto da norma urbana linguística culta (nurc)</article-title>
          , CHIMERA: Revista de Corpus de Lenguas Romances y
          <source>Estudios Lingüísticos</source>
          <volume>3</volume>
          (
          <year>2016</year>
          )
          <fpage>149</fpage>
          -
          <lpage>174</lpage>
          . URL: https://revistas.uam.es/chimera/article/view/6519.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>R. B. Mendes</surname>
          </string-name>
          , L. Oushiro,
          <article-title>Mapping paulistano portuguese: the sp2010 project</article-title>
          ,
          <source>in: Proceedings of the VIIth GSCP International Conference: Speech and Corpora</source>
          , Fizenze University Press, Firenze, Italy,
          <year>2012</year>
          , pp.
          <fpage>459</fpage>
          -
          <lpage>463</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Khalil</surname>
          </string-name>
          , E. Jones,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Babar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Jan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Zafar</surname>
          </string-name>
          , T. Alhussain,
          <article-title>Speech emotion recognition using deep learning techniques: A review</article-title>
          ,
          <source>IEEE Access 7</source>
          (
          <year>2019</year>
          )
          <fpage>117327</fpage>
          -
          <lpage>117345</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>H. M. Fayek</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lech</surname>
          </string-name>
          , L. Cavedon,
          <article-title>Evaluating deep learning architectures for speech emotion recognition</article-title>
          ,
          <source>Neural Networks</source>
          <volume>92</volume>
          (
          <year>2017</year>
          )
          <fpage>60</fpage>
          -
          <lpage>68</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>El Ayadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Kamel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Karray</surname>
          </string-name>
          ,
          <article-title>Survey on speech emotion recognition: Features, classification schemes, and databases</article-title>
          ,
          <source>Pattern recognition 44</source>
          (
          <year>2011</year>
          )
          <fpage>572</fpage>
          -
          <lpage>587</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J. R. Torres</given-names>
            <surname>Neto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Y.</given-names>
            <surname>Mano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ueyama</surname>
          </string-name>
          , et al.,
          <article-title>Verbo: voice emotion recognition database in portuguese language</article-title>
          ,
          <source>Journal of Computer Science</source>
          <volume>14</volume>
          (
          <year>2018</year>
          )
          <fpage>1420</fpage>
          -
          <lpage>1430</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Raso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mello</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Mittmann</surname>
          </string-name>
          ,
          <article-title>The c-oral-brasil i: Reference corpus for spoken brazilian portuguese</article-title>
          ,
          <source>in: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>106</fpage>
          -
          <lpage>113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. G.</given-names>
            <surname>Koolagudi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Vempada</surname>
          </string-name>
          ,
          <article-title>Emotion recognition from speech using global and local prosodic features</article-title>
          ,
          <source>International journal of speech technology 16</source>
          (
          <year>2013</year>
          )
          <fpage>143</fpage>
          -
          <lpage>160</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          , wav2vec:
          <article-title>Unsupervised pre-training for speech recognition</article-title>
          ,
          <source>Proc. Interspeech</source>
          <year>2019</year>
          (
          <year>2019</year>
          )
          <fpage>3465</fpage>
          -
          <lpage>3469</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>