<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>and Data Augmentation Techniques applied to Speech Emotion Recognition in SE&amp;R 2022</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Caroline Alves</string-name>
          <email>carolalves@usp.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bruno Carlotto</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bruno Dias</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anátale Garcia</string-name>
          <email>anatale.garcia@usp.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bruno Gianesi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Renan Izaias</string-name>
          <email>renan.izaias@usp.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Luiza de Morais</string-name>
          <email>marialuizamorais@usp.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paula de Oliveira</string-name>
          <email>paulamarindeoliveira@usp.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vinícius G. Santos</string-name>
          <email>vinicius.santos@alumni.usp.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafael Sicoli</string-name>
          <email>rafael.pac90@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Flaviane R. Fernandes Svartman</string-name>
          <email>flavianesvartman@usp.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sandra Aluisio</string-name>
          <email>sandra@icmc.usp.br</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sidney Leal</string-name>
          <email>sidleal@gmail.com</email>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Deep Learning, Transfer Learning, Data Augmentation, Speech Emotion Recognition</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Departamento de Letras Clássicas e Vernáculas</institution>
          ,
          <addr-line>FFLCH-USP</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Engenharia Mecatrônica</institution>
          ,
          <addr-line>EESC-USP</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <abstract>
        <p>In this work, our team ICMC-EESC-FFLCH explores several techniques to address data scarcity and imbalance in SE&amp;R 2022 task dedicated to speech emotion recognition (SER). We evaluate two types of transfer learning models: (i) Multi-task learning, in which two tasks are learned simultaneously, and (ii) Sequential transfer learning where the tasks are learned sequentially. In both models, the auxiliary task is genre classification from speech, using a large dataset with almost 145 hours of speech signals. As for the techniques to balance the training data, we have used the SMOTE (Synthetic Minority Over-sampling Technique) and Praat's Change gender command to over-sampling minority classes. Our Sequential transfer learning architecture, using the two baselines feature sets provided by the shared-task (prosodic audio features and embeddings generated by the Wav2Vec 2.0 model) and the two approaches to balance the training dataset reaches satisfactory performance with a 0.5353 F1-macro, surpassing the prosodic features baseline. On the other hand, our multi-task learning approach using the two baseline features sets and the SMOTE approach to balance the training dataset reaches only a 0.5301 F1-macro. Finally, our worst result is 0.469 F1-macro, obtained with the feature selection experiment (29 prosodic features manually chosen from the literature), using our multi-task learning architecture with the two approaches to balance the training dataset.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>2022</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        According to [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], speech emotion recognition (SER) systems are composed of methods, namely
feature extraction and emotion classification, that process and classify speech signals to detect
the embedded emotions of speech. They can also include a preprocessing step before the
extraction of the features used to normalize the signals, for example, the use of noise reduction
Proceedings of the First Workshop on Automatic Speech Recognition for Spontaneous and Prepared Speech &amp; Speech
Emotion Recognition in Portuguese, co-located with PROPOR 2022. March 21st, 2022 (Online).
nEvelop-O
techniques. Emotion classes depend on labeled data of the dataset used to create the model; these
datasets can be of three types: acted, elicited or natural. While most of the natural datasets are
from spontaneous speech recorded in noisy environments, acted speech databases are recorded
by professional actors in sound-proof studios. Elicited speech datasets are created by placing
speakers in a simulated emotional situation that can stimulate various emotions and can be
close to real ones. It is important to notice that, the definition of emotion is an open problem
in psychology and there are two models being used in SER systems: discrete and dimensional
emotional models. The first one is based on the six primary and culturally independent categories
of basic emotions [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]: sadness, happiness, fear, anger, disgust, and surprise, where other emotions
are obtained by the combination of the basic ones. Most of the existing SER systems focus on
all these basic emotional categories, sometimes including the neutral category (see, for example,
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a study focusing on Portuguese language), or in a small group of those emotions1. The
second one, the dimensional emotional model, uses a small number of latent dimensions to
define emotions such as: valence, arousal/excitation, control/power. In this model, emotions are
not independent of each other, instead, they are analogous to each other in a systematic way.
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] support of the thesis that the three dimensions of pleasure-displeasure (valence),
arousalnonarousal (excitation), and dominance-submissiveness (power/control) are both necessary and
suficient to describe a large variety of emotional states. Specifically, valence describes whether
an emotion is positive or negative, and it ranges between unpleasant and pleasant; excitation
defines the strength of the felt emotion, ranging from boredom to frantic excitement; and the
dimension of control/power refers to the seeming strength of the person (between weak and
strong). For example, the third dimension diferentiates anger from fear by considering the
strength or weakness of the person, respectively; however, as the surprise emotion may have
positive or negative valence depending on the context, it is dificult to categorize.
      </p>
      <p>
        Whereas most studies on SER deal with simulated, noise-free datasets recorded in sound-proof
studios [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], SE&amp;R 2022 brings a small dataset of approximately 50 minutes, with 625 audio
segments (training dataset) from the C-ORAL-BRASIL I corpus [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], consisting of audio segments
representing Brazilian Portuguese informal spontaneous speech, recorded in natural contexts
and noisy environments.
      </p>
      <p>
        The two baseline feature sets (prosodic audio features for emotion classification [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] and
embeddings generated by the Wav2Vec 2.0 model [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]) made available for SE&amp;R 2022 were used
in this work. Feature selection was also evaluated, focusing on four small prosodic features
sets, manually chosen, with 29, 19, 10, and 8 features, taken from pitch, intensity, and spectrum
groups of features. While the first SER systems used machine learning methods with a careful
feature engineering (see several examples in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]), recent approaches use ensembles to learn
hybrid acoustic features [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], and deep learning architectures, such as multi-task learning
[
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ], attention mechanisms [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and transfer learning approaches [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        Our contribution to SE&amp;R 2022 explores two architectures based on deep neural networks
(DNN) aiming at detecting Speech Emotion Recognition in Portuguese audio files. Our proposal
evaluates two types of inductive transfer learning: multi-task [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and sequential transfer
learning [17]. In both models, the auxiliary task is genre classification from speech 2. Since
1There are large lists of datasets used for emotion recognition in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>2Project’s github: https://github.com/BrunoGianesi/Speaker-Gender-Recognition.</p>
      <p>DNN-based classifiers have a generalization error problem when trained with limited datasets,
we explore two diferent data augmentation techniques aimed to balance the training data. We
have used the SMOTE [18] to create synthetic data for the minority classes and Praat’s [19]
Change gender command to manipulate the acoustic features in order to create new synthetic
data based on the pre-existing ones. The Jupyter notebooks and characterization of the training
dataset are publicly available at https://github.com/BrunoBaldissera/ser-transfer.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Experimental Framework</title>
      <p>First, we present the original dataset for the main task and the dataset used for the auxiliary task
of genre classification from speech in both inductive transfer learning architectures (Section 2.1),
noting that the original dataset is unbalanced. Therefore, we applied two techniques for data
augmentation (Section 2.2). Section 2.3 presents the feature sets we explored in our linguistically
motivated selection of prosodic features, based on the literature. Finally, Section 2.4 presents
our multi-task and sequential transfer learning architectures.
2.1. Datasets
2.1.1 Primary Task Dataset: oficial dataset of SE&amp;R shared-task on SER. In the SE&amp;R
2022 shared-task on SER, the audio segments are labeled in three classes: neutral, non-neutral
female, and non-neutral male. The neutral class is the majority class (491 samples) and is used
to label audio segments with no well-defined emotional state while the non-neutral classes label
segments (89 non-neutral-female and 45 non-neutral-male) associated with one of the primary
emotional states in the speaker’s speech. In order to better understand the training dataset used
in this study, seven annotators from our group pursued a qualitative analysis of the dataset.
They labeled every audio in the training set with “yes” (meaning presence) or “no” (meaning
absence) according to the following categories:
• Noise: any sort of noise not related with the primary voice(s)3, e.g., background chatting,
microphone hissing noise, music, children voices, etc.;
• Voice overlapping: periods in which there were two primary voices speaking at the
exact same moment;
• Diferent gender : the presence of more than one perceived gender in the primary voices
of the same audio; and
• Voices in sequence: the presence of more than one primary voice in the same audio,
but without direct overlapping between them.</p>
      <p>Our evaluation is summarized in Figures 1a and 1b. As we can see, there is a lot of noisy
audio. Although noise is not a problem for the auxiliary task (Audio Genre Classification) [ 20]
of the neural architectures, only an error analysis can identify possible problems for the SER
task as a whole. Also, two complex problems were found: high overlapping rate of voices and
audios with diferent genres, which we believe may have an impact on the classification of the 2
non-neutral classes (male and female). Of the 26 non-neutral audios that have diferent gender,
3We consider primary voices to be the loudest, and secondary voices to be the least prominent in the audio.
24 have voices overlapping and only 2 have voices in sequence. Of the 56 neutral audios that
have diferent gender, 53 have voices overlapping and only 3 have voices in sequence.
2.1.2 Auxiliary Task Dataset: CETUC. The task of classifying gender based on voice
identifies automatically a voice as male or female, based on the audio features. The gender
identification of a given speaker was implemented in an undergrad project of one of the authors
[20], to evaluate machine learning methods, such as decision trees, random forest, gradient
boosting, support vector machine, multi-layer perceptron and logistic regression, and to compare
the use of distinct features and models applied on diferent datasets. In addition, the study also
assessed whether the models generalize to other contexts, such as other languages (English) or
noisy environments, when trained on CETUC dataset [21] that was recorded in a controlled
environment.</p>
      <p>The best performance method (gradient boosting) was trained using the large dataset CETUC,
with almost 145 hours of speech signals spoken by 50 male and 50 female speakers4, each one
pronouncing 1,000 phonetically balanced sentences selected from the CETEN-Folha corpus5.
The best performance model used three sets of features from audio signals, totalling 44 features:
(i) 12 statistics extracted from the highest frequency value, after applying the Fourier transform
on the audios, divided into time windows of 0.2 seconds, (ii) the fundamental frequency (F0)
statistics (12) and (iii) 20 MFCCs (Mel-Frequency Cepstral Coeficients), and reached an accuracy
of 94,1%. This model was able to generalize well to audios with noise; it reached an accuracy of
90,8% on the testset MLS [22] with noise.</p>
      <sec id="sec-3-1">
        <title>2.2. Data Augmentation Approaches: SMOTE and Praat’s Change Gender</title>
        <p>We used two approaches to balance the training dataset applied specifically on audios of
non-neutral male and non-neutral female classes: SMOTE [18] and Praat’s Change gender
4https://igormq.github.io/datasets/
5https://www.linguateca.pt/cetenfolha/index_info.html
command [19].</p>
        <p>It is suggested by the authors of the original SMOTE paper that previously performing a
random under-sampling of the majority class followed by over-sampling the minority class
tends to yield good results. However, in this work, we have only over-sampled the minority
classes, following the work by [23], and using the technique in its simplest implementation.
Nonetheless, as the synthesis of new data with SMOTE uses a linear combination of randomly
chosen neighbors of the underrepresented instances in the feature space rather than just
replicating the given instances, we gave more focus to this augmentation approach in place of
the simple oversampling (even though a number of such tests was performed). We have used
the Python imbalanced-learn package [24]; all the parameters were set as default.</p>
        <p>Praat’s Change gender command allow us to manipulate the acoustic features to create new
synthetic data based on the preexisting ones. Through this method, we can change the perceived
gender of a given voice into the opposite gender. The second method for data augmentation
consists in the use of the algorithm for gender conversion available in the software for acoustic
analysis Praat. A total of 133 files were used, 45 of them containing male voices, then converted
to female ones, and 88 containing female voices, then converted to male ones6. The task was
undertaken by five annotators and had two phases: attribution of parameters for conversion
and quality evaluation of the generated voice. In the quality assessment phase, the annotators
changed the previously established default values in order to obtain voices that they judged
the most natural as possible. For the conversion process, we first defined the frequency range
in which the algorithm parameters were applied, using the values already predefined by the
program, with the minimum pitch value being 75 Hz, and the maximum 600 Hz. The algorithm
contains four parameters, described below, that can be used for gender conversion, from which
we have only used the first two:
• Formant shift ratio (default value is 1.0) determines the ratio for proportionally
modifying the value of formants, i.e., the sound frequency values at which the highest peaks
of intensity occur, resulting from the resonance of the sound wave in its path through
the vocal tract, from its production in the vocal folds until the moment of emission. The
factor valued 1.0 means there is no alteration. For the task, we established the factor
value 1.1 as the standard for male-to-female conversion, used in 30 of 45 files, and 0.8 for
female-to-male conversion, used in 72 of 88 files. As mentioned above, these values were
altered in some files in order to maintain a perceived natural quality of the converted
voice: for the other 15 male-to-female converted files, factors between 1.15 or 1.2 were
used, and for the other 16 female-to-male converted files, values between 0.85 or 0.9.
• New pitch median (default value is 0.0): a new median for the pitch values is established
for each file, which, in turn, is used to compose a factor expressed by the ratio between
this new median and the original median pitch. This factor is then used by the algorithm
to multiply the original pitch values to obtain new values. In this metric, the value 0.0
represents the default setting, yielding the factor 1.0, which means no alteration. We
established as standard values for this assignment the frequency measurement of 300 Hz
for male-to-female conversion, for 35 of 45 files, and 140 Hz for female-to-male conversion,
6For one of the audios, the algorithm could not produce a successful conversion.</p>
        <p>for 58 of 88 files. These values were also altered in some files to achieve a convincing
result: for male-to-female conversion, values between 250 Hz and 380 Hz were used for
the other 10 files, and for female-to-male conversion, values between 80 Hz and 260 Hz
were used for the other 30 files.
• Pitch range factor (default value: 1.0) provides for an additional modification in pitch by
an extra scaling of the values around the new pitch median, obtained in the previous step.
A factor of 1.0 means that no additional pitch modification will occur, and a factor valued
as 0.0 monotonizes the new sound to the new pitch median. Considering the essential
goal of the project, the default value was kept and no modifications for the pitch range
were provided.
• Duration factor (default value: 1.0) establishes a factor used for lengthening the sound
ifle. For a factor valued less than 1.0, the resulting sound will be shorter than the original,
and a value higher than 3.0 will not work. The default value provided by the software
was also maintained, as a change in the duration of the sound is deemed as unnecessary
for the development of the task.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.3. Selection of Prosodic Features for SER</title>
        <p>We grouped the 56 prosodic audio features (one of the baseline feature sets) into six classes7 in
order to select those strongly related to the classes defined for SE&amp;R 2022 and evaluate them
separately and conjoined: (1) related to voice quality (13 features), including local_jitter and
local_shimmer, those from Harmonics-to-Noise Ratio (HNR) and those from Glottal-to-Noise
Ratio (GNE); (2) related to intensity (9 features), for example, min_intensity, max_intensity; (3)
related to F0 (pitch) (10 features), for example, mean_pitch, stddev_pitch; (4) related to spectrum
(10 features), for example, skewness_spectrum, kurtosis_spectrum; (5) related to formants (10
features), for example, formant_dispersion, average_formant; (6) related to vocal tract length
(VTL) (4 features), for example, fitch_vtl, vtl_delta_f.</p>
        <p>The groups related to intensity (first 9 features), F0 (from 10 to 19), and spectrum (last 10
features), respectively shown in Table 1, were chosen for our feature selection experiment
which included the training of 7 multi-task and 5 sequential classifiers, totalling 12 experiments,
shown in Section 3.3. The classifiers used 10 (related to spectrum), 19 (intensity and F0) and 29
(spectrum, intensity, and F0) features and also a subset of 8 features, shown in bold in Table 1.</p>
        <p>According to [25], energy, pitch, and time are the three perceptual dimensions on which
most vocal indicators of various emotions are based. Therefore, the class of acoustic parameters
related to F0, intensity, and spectrum were selected because they are reported in the literature
as potential correlates of the vocal expression of emotions [25, 26, 27, 28].</p>
        <p>F0 (fundamental frequency) is an acoustic correlate of the rate of vocal cords vibration, that is,
the number of times a sound wave produced by the vocal cords is repeated during a given period
of time. F0 is perceived as the pitch of the voice, and the range of values for this frequency
varies according to sex and age8. In turn, sound intensity corresponds to the variations in the air
pressure of a sound wave and is perceived as the loudness of a sound. Loudness and pitch are,
7The feature voiced_fraction was allocated in the group of spectrum features, instead of with the pitch group.
8For instance, 80–200 Hz for adult males, 180–400 Hz for adult females [29], and higher ranges for children.
The mean values change for older ages.
in fact, elementary domains of the auditory signal and changes in sound intensity and F0 seem
to be relevant to emotion analysis: higher and wider pitch ranges and higher sound intensity
are typically associated with high arousal emotions (e.g., fear, anger, joy) compared to neutral
speech, while lower and narrower pitch ranges and lower sound intensity are more associated
with low arousal emotions (e.g., sadness, boredom, calmness) [25, 30, 31, 32, 33]. Studies have
also shown that emotion afects the distribution of spectral energy across the range of sound
frequencies: for example, stronger energy in higher frequency bands is usually associated with
high arousal emotions, while weaker energy in the same band is more associated with low
arousal emotions [31]9.</p>
      </sec>
      <sec id="sec-3-3">
        <title>2.4. Neural Architectures: multi-task and sequential transfer learning</title>
        <p>
          Transfer Learning is a machine learning approach that transfers weights trained in one task,
domain, or language to a diferent one, with the aim of improving the learning generalization
[17]. In this work, two Transfer Learning techniques were used: Multi-task and Sequential
Transfer Learning. In the first one, the training of the two tasks is performed simultaneously,
sharing a layer of weights between the two tasks [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. In the second, the weights trained in the
ifrst task are transferred to the second, sequentially [ 34]. Figure 2 presents the two architectures.
        </p>
        <p>For the Multi-task architecture, two MultiLayer Perceptron (MLP) neural networks were used,
with 4 layers each, sharing a common layer with 100 neurons. The first one focused on the
binary gender prediction task, using the CETUC dataset, with 44 neurons in the input layer and
one neuron in the output layer. The second (main task), focused on the prediction of the three
9Many of these studies used speech audios recorded in sound-proof booths with controlled scenarios.
Spontaneous speech recorded in natural contexts and noisy environments like SER shared-task dataset interferes with
extracted features results, as the acoustic signal is afected by sound sources competing with the target signal, the
performance of pitch detection algorithms degrades as the noise level increases, and even the speech signal energy
depends on the distance and position between the speaker’s mouth and microphone. Therefore, in future work, at
least methods for noise incorporation/reduction will be explored to assess the impact of noise on data.
(a) Multi-task
(b) Sequential
SER classes, with the number of neurons in the input layer varying from 8 to 824 (according to
the features used) and three neurons in the output layer. Both use a previous layer of 10 neurons
before the common layer. For the Sequential architecture, two MLP’s were also used, but they
were trained sequentially. The first for the binary gender prediction task with 44 neurons in the
input, a hidden layer of 30 neurons and one neuron in the output. The hidden layer was then
frozen and transferred to the second MLP, whose input layer ranged from 73 to 868 (according
to the features used) and with three neurons in the output layer (one for each class) of the
second task. The frozen layer acted by predicting the gender of the samples (auxiliary task) and
passing this prediction as a new internal feature to a layer of 5 neurons before the output (for
models with more features this layer was changed to 10 neurons).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Experiments</title>
      <p>All the 26 models described in Sections 3.1, 3.2 and 3.3 were trained using a batch size of 100
and 300 epochs.
3.1 Sequential Learning Results. Table 2 presents the results, in crescent order of F1-macro
values, for the experiments with the sequential learning architecture.
3.2 Multi-task Learning Results. Table 3 presents the results, in crescent order of F1-macro
values, for the experiments with the multi-task learning architecture.
3.3 Feature Selection Results. We focused on twelve experiments to evaluate small and
focused feature sets, shown on Table 4, in crescent order of F1-macro values.
3.4 Preliminary Evaluation of the Selected Models. Table 5 shows the confusion matrices
for the first fold (20% of data), related to the three selected models. In the matrices, rows are
termed as actual/true class and columns are termed as a predicted class. For the three selected
models, the neutral class had the worst performance. It seems that the auxiliary task (genre
classification from speech) has helped in classifying non-neutral male and non-neutral female
classes.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusions and Future Work</title>
      <p>In this work, we evaluate 26 DNN models, using 5-fold cross-validation over the training dataset,
and submitted our best models, i.e. those with higher F1-macro, for each group of experiments
in Sections 3.1, 3.2, and 3.3. One of the submitted models surpassed the prosodic features
baseline, reaching 0.5353 F1-macro. As a future work, we will perform an error analysis to
understand why our best submitted model had a good performance on the training dataset, but
only a 0.5353 F1-macro value on the test set.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by
the São Paulo Research Foundation (FAPESP grant 2019/07665-4) and by the IBM Corporation.
[17] S. Ruder, M. E. Peters, S. Swayamdipta, T. Wolf, Transfer learning in natural language
processing, in: Proceedings of the 2019 Conference of the North American Chapter of
the Association for Computational Linguistics: Tutorials, Association for Computational
Linguistics, Minneapolis, Minnesota, 2019, pp. 15–18. doi:10.18653/v1/N19- 5004.
[18] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: Synthetic minority
over-sampling technique, J. Artif. Int. Res. 16 (2002) 321–357.
[19] P. Boersma, D. Weenink, Praat: Doing phonetics by computer, 2010. URL: http://www.</p>
      <p>praat.org/.
[20] B. Gianesi, S. Aluisio, Classificação de gênero via análise de áudio utilizando métodos
de aprendizado de máquina tradicionais, 2021. URL: https://github.com/BrunoGianesi/
Speaker-Gender-Recognition, To appear in https://eesc.usp.br/biblioteca/.
[21] V. F. S. Alencar, A. Alcaim, LSF and LPC - Derived Features for Large Vocabulary
Distributed Continuous Speech Recognition in Brazilian Portuguese, in: 2008 42nd Asilomar
Conference on Signals, Systems and Computers, 2008, pp. 1237–1241.
[22] V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, R. Collobert, MLS: A Large-Scale Multilingual</p>
      <p>Dataset for Speech Research, in: Proc. Interspeech 2020, 2020, pp. 2757–2761.
[23] D. Liang, E. Thomaz, Audio-based activities of daily living (adl) recognition with large-scale
acoustic embeddings from online videos, Proc. ACM Interact. Mob. Wearable Ubiquitous
Technol. 3 (2019). URL: https://doi.org/10.1145/3314404. doi:10.1145/3314404.
[24] G. Lemaître, F. Nogueira, C. K. Aridas, Imbalanced-learn: A python toolbox to tackle the
curse of imbalanced datasets in machine learning, J. Mach. Learn. Res. 18 (2017) 559–563.
[25] J. Pittam, K. R. Scherer, Vocal expression and communication of emotion, in: M. Lewis,
J. M. Haviland (Eds.), Handbook of emotions, The Guilford Press, New York, 1993, pp.
185–198.
[26] K. R. Scherer, Vocal afect expression: a review and a model for future research,
Psychological Bulletin 99 (1986) 143–165.
[27] P. A. Barbosa, Detecting changes in speech expressiveness in participants of a radio
program, in: INTERSPEECH 2009, 10th Annual Conference of the International Speech
Communication Association, Brighton, United Kingdom, ISCA, 2009, pp. 2155–2158.
[28] M. El Ayadi, M. S. Kamel, F. Karray, Survey on speech emotion recognition: Features,
classification schemes, and databases, Pattern Recognition 44 (2011) 572–587.
[29] J. t’Hart, R. Collier, A. Cohen, A Perceptual Study of Intonation: An Experimental-Phonetic
Approach to Speech Melody, Cambridge Studies in Speech Science and Communication,
Cambridge University Press, 1990. doi:10.1017/CBO9780511627743.
[30] R. Banse, K. R. Scherer, Acoustic profiles in vocal emotion expression., Journal of
personality and social psychology 70 (1996) 614–36.
[31] T. Johnstone, K. R. Scherer, Vocal communication of emotion, in: M. Lewis, J. M.
Haviland</p>
      <p>Jones (Eds.), Handbook of emotions, 2 ed., The Guilford Press, New York, 2000, pp. 220–235.
[32] P. N. Juslin, P. Laukka, Impact of intended emotion intensity on cue utilization and
decoding accuracy in vocal expression of emotion, Emotion 1 (2001) 381–412.
[33] D. Guo, H. Yu, A. Hu, Y. Ding, Statistical analysis of acoustic characteristics of tibetan
lhasa dialect speech emotion, SHS Web of Conferences 25 (2016) 1–5.
[34] S. Ruder, Neural Transfer Learning for Natural Language Processing, Ph.D. thesis, National
University of Ireland, Galway, 2019.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Akçay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Oğuz</surname>
          </string-name>
          ,
          <article-title>Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers</article-title>
          ,
          <source>Speech Communication</source>
          <volume>116</volume>
          (
          <year>2020</year>
          )
          <fpage>56</fpage>
          -
          <lpage>76</lpage>
          . doi:https://doi.org/10.1016/j.specom.
          <year>2019</year>
          .
          <volume>12</volume>
          .001.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ekman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Oster</surname>
          </string-name>
          , Facial expressions of emotion,
          <source>Annual Review of Psychology</source>
          <volume>30</volume>
          (
          <year>1979</year>
          )
          <fpage>527</fpage>
          -
          <lpage>554</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Campos</surname>
          </string-name>
          , L. da
          <string-name>
            <given-names>S.</given-names>
            <surname>Moutinho</surname>
          </string-name>
          , DEEP:
          <article-title>Uma arquitetura para reconhecer emoção com base no espectro sonoro da voz de falantes da língua portuguesa</article-title>
          ,
          <year>2020</year>
          . URL: https://bdm. unb.br/handle/10483/27583, january
          <volume>18</volume>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , R. Liu,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Deep cross-corpus speech emotion recognition: Recent advances and perspectives</article-title>
          ,
          <source>Frontiers in Neurorobotics</source>
          <volume>15</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mehrabian</surname>
          </string-name>
          ,
          <article-title>Evidence for a three-factor theory of emotions</article-title>
          ,
          <source>Journal of research in Personality</source>
          <volume>11</volume>
          (
          <year>1977</year>
          )
          <fpage>273</fpage>
          -
          <lpage>294</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Raso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mello</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Mittmann</surname>
          </string-name>
          ,
          <string-name>
            <surname>The C-ORAL-BRASIL</surname>
            <given-names>I</given-names>
          </string-name>
          :
          <article-title>Reference corpus for spoken Brazilian Portuguese</article-title>
          ,
          <source>in: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)</source>
          ,
          <source>European Language Resources Association (ELRA)</source>
          , Istanbul, Turkey,
          <year>2012</year>
          , pp.
          <fpage>106</fpage>
          -
          <lpage>113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>I.</given-names>
            <surname>Luengo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Navas</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Hernáez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sánchez</surname>
          </string-name>
          ,
          <article-title>Automatic emotion recognition using prosodic parameters</article-title>
          ,
          <source>in: INTERSPEECH 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology</source>
          , Lisbon, Portugal, September 4-
          <issue>8</issue>
          ,
          <year>2005</year>
          , ISCA,
          <year>2005</year>
          , pp.
          <fpage>493</fpage>
          -
          <lpage>496</lpage>
          . URL: http://www.isca-speech.org/archive/interspeech_2005/i05_
          <fpage>0493</fpage>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. G.</given-names>
            <surname>Koolagudi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Vempada</surname>
          </string-name>
          ,
          <article-title>Emotion recognition from speech using global and local prosodic features</article-title>
          ,
          <source>Int. J. Speech Technol</source>
          .
          <volume>16</volume>
          (
          <year>2013</year>
          )
          <fpage>143</fpage>
          -
          <lpage>160</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          , wav2vec
          <volume>2</volume>
          .
          <article-title>0: A framework for self-supervised learning of speech representations</article-title>
          , CoRR abs/
          <year>2006</year>
          .11477 (
          <year>2020</year>
          ). arXiv:
          <year>2006</year>
          .11477.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Abbaschian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sierra-Sosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elmaghraby</surname>
          </string-name>
          ,
          <article-title>Deep learning techniques for speech emotion recognition, from databases to models</article-title>
          ,
          <source>Sensors</source>
          <volume>21</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zvarevashe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Olugbara</surname>
          </string-name>
          ,
          <article-title>Ensemble learning of hybrid acoustic features for speech emotion recognition</article-title>
          ,
          <source>Algorithms</source>
          <volume>13</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>X.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Church</surname>
          </string-name>
          ,
          <article-title>Speech Emotion Recognition with MultiTask Learning</article-title>
          ,
          <source>in: Proc. Interspeech</source>
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>4508</fpage>
          -
          <lpage>4512</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N. K.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. K.</given-names>
            <surname>Ha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. W.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. K.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Speech emotion recognition based on multi-task learning using a convolutional neural network</article-title>
          ,
          <source>in: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>704</fpage>
          -
          <lpage>707</lpage>
          . doi:
          <volume>10</volume>
          .1109/APSIPA.
          <year>2017</year>
          .
          <volume>8282123</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kawahara</surname>
          </string-name>
          ,
          <article-title>Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning</article-title>
          ,
          <source>in: Proc. Interspeech</source>
          <year>2019</year>
          ,
          <year>2019</year>
          , pp.
          <fpage>2803</fpage>
          -
          <lpage>2807</lpage>
          . doi:
          <volume>10</volume>
          .21437/Interspeech.2019-
          <volume>2594</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lech</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stolar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Best</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bolia</surname>
          </string-name>
          ,
          <article-title>Real-time speech emotion recognition using a pretrained image classification network: Efects of bandwidth reduction and companding</article-title>
          ,
          <source>Frontiers in Computer Science</source>
          <volume>2</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Caruana</surname>
          </string-name>
          ,
          <article-title>Multitask learning</article-title>
          ,
          <source>Machine Learning - Special issue on inductive transfer -</source>
          Volume
          <volume>28</volume>
          (
          <year>1997</year>
          )
          <fpage>41</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>