Transfer Learning and Data Augmentation Techniques applied to Speech Emotion Recognition in SE&R 2022 Caroline Alves1 , Bruno Carlotto2 , Bruno Dias1 , Anátale Garcia1 , Bruno Gianesi3 , Renan Izaias1 , Maria Luiza de Morais1 , Paula de Oliveira1 , Vinícius G. Santos1 , Rafael Sicoli1 , Flaviane R. Fernandes Svartman1 , Sandra Aluisio2 and Sidney Leal2 1 Departamento de Letras Clássicas e Vernáculas, FFLCH-USP 2 Instituto de Ciências Matemáticas e de Computação, ICMC-USP 3 Engenharia Mecatrônica, EESC-USP Abstract In this work, our team ICMC-EESC-FFLCH explores several techniques to address data scarcity and imbalance in SE&R 2022 task dedicated to speech emotion recognition (SER). We evaluate two types of transfer learning models: (i) Multi-task learning, in which two tasks are learned simultaneously, and (ii) Sequential transfer learning where the tasks are learned sequentially. In both models, the auxiliary task is genre classification from speech, using a large dataset with almost 145 hours of speech signals. As for the techniques to balance the training data, we have used the SMOTE (Synthetic Minority Over-sampling Technique) and Praat’s Change gender command to over-sampling minority classes. Our Sequential transfer learning architecture, using the two baselines feature sets provided by the shared-task (prosodic audio features and embeddings generated by the Wav2Vec 2.0 model) and the two approaches to balance the training dataset reaches satisfactory performance with a 0.5353 F1-macro, surpassing the prosodic features baseline. On the other hand, our multi-task learning approach using the two baseline features sets and the SMOTE approach to balance the training dataset reaches only a 0.5301 F1-macro. Finally, our worst result is 0.469 F1-macro, obtained with the feature selection experiment (29 prosodic features manually chosen from the literature), using our multi-task learning architecture with the two approaches to balance the training dataset. Keywords Deep Learning, Transfer Learning, Data Augmentation, Speech Emotion Recognition 1. Introduction According to [1], speech emotion recognition (SER) systems are composed of methods, namely feature extraction and emotion classification, that process and classify speech signals to detect the embedded emotions of speech. They can also include a preprocessing step before the extraction of the features used to normalize the signals, for example, the use of noise reduction Proceedings of the First Workshop on Automatic Speech Recognition for Spontaneous and Prepared Speech & Speech Emotion Recognition in Portuguese, co-located with PROPOR 2022. March 21st, 2022 (Online). Envelope-Open carolalves@usp.br (C. Alves); bruno.baldissera@usp.br (B. Carlotto); brunoadiaspapa1@usp.br (B. Dias); anatale.garcia@usp.br (A. Garcia); brunogianesi@usp.br (B. Gianesi); renan.izaias@usp.br (R. Izaias); marialuizamorais@usp.br (M. L. d. Morais); paulamarindeoliveira@usp.br (P. d. Oliveira); vinicius.santos@alumni.usp.br (V. G. Santos); rafael.pac90@gmail.com (R. Sicoli); flavianesvartman@usp.br (F. R. Fernandes Svartman); sandra@icmc.usp.br (S. Aluisio); sidleal@gmail.com (S. Leal) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) techniques. Emotion classes depend on labeled data of the dataset used to create the model; these datasets can be of three types: acted, elicited or natural. While most of the natural datasets are from spontaneous speech recorded in noisy environments, acted speech databases are recorded by professional actors in sound-proof studios. Elicited speech datasets are created by placing speakers in a simulated emotional situation that can stimulate various emotions and can be close to real ones. It is important to notice that, the definition of emotion is an open problem in psychology and there are two models being used in SER systems: discrete and dimensional emotional models. The first one is based on the six primary and culturally independent categories of basic emotions [2]: sadness, happiness, fear, anger, disgust, and surprise, where other emotions are obtained by the combination of the basic ones. Most of the existing SER systems focus on all these basic emotional categories, sometimes including the neutral category (see, for example, [3], a study focusing on Portuguese language), or in a small group of those emotions1 . The second one, the dimensional emotional model, uses a small number of latent dimensions to define emotions such as: valence, arousal/excitation, control/power. In this model, emotions are not independent of each other, instead, they are analogous to each other in a systematic way. [5] support of the thesis that the three dimensions of pleasure-displeasure (valence), arousal- nonarousal (excitation), and dominance-submissiveness (power/control) are both necessary and sufficient to describe a large variety of emotional states. Specifically, valence describes whether an emotion is positive or negative, and it ranges between unpleasant and pleasant; excitation defines the strength of the felt emotion, ranging from boredom to frantic excitement; and the dimension of control/power refers to the seeming strength of the person (between weak and strong). For example, the third dimension differentiates anger from fear by considering the strength or weakness of the person, respectively; however, as the surprise emotion may have positive or negative valence depending on the context, it is difficult to categorize. Whereas most studies on SER deal with simulated, noise-free datasets recorded in sound-proof studios [4], SE&R 2022 brings a small dataset of approximately 50 minutes, with 625 audio segments (training dataset) from the C-ORAL-BRASIL I corpus [6], consisting of audio segments representing Brazilian Portuguese informal spontaneous speech, recorded in natural contexts and noisy environments. The two baseline feature sets (prosodic audio features for emotion classification [7, 8] and embeddings generated by the Wav2Vec 2.0 model [9]) made available for SE&R 2022 were used in this work. Feature selection was also evaluated, focusing on four small prosodic features sets, manually chosen, with 29, 19, 10, and 8 features, taken from pitch, intensity, and spectrum groups of features. While the first SER systems used machine learning methods with a careful feature engineering (see several examples in [10]), recent approaches use ensembles to learn hybrid acoustic features [11], and deep learning architectures, such as multi-task learning [12, 13], attention mechanisms [14], and transfer learning approaches [15]. Our contribution to SE&R 2022 explores two architectures based on deep neural networks (DNN) aiming at detecting Speech Emotion Recognition in Portuguese audio files. Our proposal evaluates two types of inductive transfer learning: multi-task [16] and sequential transfer learning [17]. In both models, the auxiliary task is genre classification from speech2 . Since 1 There are large lists of datasets used for emotion recognition in [1] and [4]. 2 Project’s github: https://github.com/BrunoGianesi/Speaker-Gender-Recognition. DNN-based classifiers have a generalization error problem when trained with limited datasets, we explore two different data augmentation techniques aimed to balance the training data. We have used the SMOTE [18] to create synthetic data for the minority classes and Praat’s [19] Change gender command to manipulate the acoustic features in order to create new synthetic data based on the pre-existing ones. The Jupyter notebooks and characterization of the training dataset are publicly available at https://github.com/BrunoBaldissera/ser-transfer. 2. Experimental Framework First, we present the original dataset for the main task and the dataset used for the auxiliary task of genre classification from speech in both inductive transfer learning architectures (Section 2.1), noting that the original dataset is unbalanced. Therefore, we applied two techniques for data augmentation (Section 2.2). Section 2.3 presents the feature sets we explored in our linguistically motivated selection of prosodic features, based on the literature. Finally, Section 2.4 presents our multi-task and sequential transfer learning architectures. 2.1. Datasets 2.1.1 Primary Task Dataset: official dataset of SE&R shared-task on SER. In the SE&R 2022 shared-task on SER, the audio segments are labeled in three classes: neutral, non-neutral female, and non-neutral male. The neutral class is the majority class (491 samples) and is used to label audio segments with no well-defined emotional state while the non-neutral classes label segments (89 non-neutral-female and 45 non-neutral-male) associated with one of the primary emotional states in the speaker’s speech. In order to better understand the training dataset used in this study, seven annotators from our group pursued a qualitative analysis of the dataset. They labeled every audio in the training set with “yes” (meaning presence) or “no” (meaning absence) according to the following categories: • Noise: any sort of noise not related with the primary voice(s)3 , e.g., background chatting, microphone hissing noise, music, children voices, etc.; • Voice overlapping: periods in which there were two primary voices speaking at the exact same moment; • Different gender: the presence of more than one perceived gender in the primary voices of the same audio; and • Voices in sequence: the presence of more than one primary voice in the same audio, but without direct overlapping between them. Our evaluation is summarized in Figures 1a and 1b. As we can see, there is a lot of noisy audio. Although noise is not a problem for the auxiliary task (Audio Genre Classification) [20] of the neural architectures, only an error analysis can identify possible problems for the SER task as a whole. Also, two complex problems were found: high overlapping rate of voices and audios with different genres, which we believe may have an impact on the classification of the 2 non-neutral classes (male and female). Of the 26 non-neutral audios that have different gender, 3 We consider primary voices to be the loudest, and secondary voices to be the least prominent in the audio. 24 have voices overlapping and only 2 have voices in sequence. Of the 56 neutral audios that have different gender, 53 have voices overlapping and only 3 have voices in sequence. Figure 1: A qualitative analysis of the SER dataset performed by our team. Figures 1a and 1b show a characterization of the training dataset, presenting the number of audios with noise, primary overlapping voices, primary voices with different genders, primary voices in sequence, for both types of classes (neutral and non-neutral) audios. 2.1.2 Auxiliary Task Dataset: CETUC. The task of classifying gender based on voice identifies automatically a voice as male or female, based on the audio features. The gender identification of a given speaker was implemented in an undergrad project of one of the authors [20], to evaluate machine learning methods, such as decision trees, random forest, gradient boosting, support vector machine, multi-layer perceptron and logistic regression, and to compare the use of distinct features and models applied on different datasets. In addition, the study also assessed whether the models generalize to other contexts, such as other languages (English) or noisy environments, when trained on CETUC dataset [21] that was recorded in a controlled environment. The best performance method (gradient boosting) was trained using the large dataset CETUC, with almost 145 hours of speech signals spoken by 50 male and 50 female speakers4 , each one pronouncing 1,000 phonetically balanced sentences selected from the CETEN-Folha corpus5 . The best performance model used three sets of features from audio signals, totalling 44 features: (i) 12 statistics extracted from the highest frequency value, after applying the Fourier transform on the audios, divided into time windows of 0.2 seconds, (ii) the fundamental frequency (F0) statistics (12) and (iii) 20 MFCCs (Mel-Frequency Cepstral Coefficients), and reached an accuracy of 94,1%. This model was able to generalize well to audios with noise; it reached an accuracy of 90,8% on the testset MLS [22] with noise. 2.2. Data Augmentation Approaches: SMOTE and Praat’s Change Gender We used two approaches to balance the training dataset applied specifically on audios of non-neutral male and non-neutral female classes: SMOTE [18] and Praat’s Change gender 4 https://igormq.github.io/datasets/ 5 https://www.linguateca.pt/cetenfolha/index_info.html command [19]. It is suggested by the authors of the original SMOTE paper that previously performing a random under-sampling of the majority class followed by over-sampling the minority class tends to yield good results. However, in this work, we have only over-sampled the minority classes, following the work by [23], and using the technique in its simplest implementation. Nonetheless, as the synthesis of new data with SMOTE uses a linear combination of randomly chosen neighbors of the underrepresented instances in the feature space rather than just replicating the given instances, we gave more focus to this augmentation approach in place of the simple oversampling (even though a number of such tests was performed). We have used the Python imbalanced-learn package [24]; all the parameters were set as default. Praat’s Change gender command allow us to manipulate the acoustic features to create new synthetic data based on the preexisting ones. Through this method, we can change the perceived gender of a given voice into the opposite gender. The second method for data augmentation consists in the use of the algorithm for gender conversion available in the software for acoustic analysis Praat. A total of 133 files were used, 45 of them containing male voices, then converted to female ones, and 88 containing female voices, then converted to male ones6 . The task was undertaken by five annotators and had two phases: attribution of parameters for conversion and quality evaluation of the generated voice. In the quality assessment phase, the annotators changed the previously established default values in order to obtain voices that they judged the most natural as possible. For the conversion process, we first defined the frequency range in which the algorithm parameters were applied, using the values already predefined by the program, with the minimum pitch value being 75 Hz, and the maximum 600 Hz. The algorithm contains four parameters, described below, that can be used for gender conversion, from which we have only used the first two: • Formant shift ratio (default value is 1.0) determines the ratio for proportionally modify- ing the value of formants, i.e., the sound frequency values at which the highest peaks of intensity occur, resulting from the resonance of the sound wave in its path through the vocal tract, from its production in the vocal folds until the moment of emission. The factor valued 1.0 means there is no alteration. For the task, we established the factor value 1.1 as the standard for male-to-female conversion, used in 30 of 45 files, and 0.8 for female-to-male conversion, used in 72 of 88 files. As mentioned above, these values were altered in some files in order to maintain a perceived natural quality of the converted voice: for the other 15 male-to-female converted files, factors between 1.15 or 1.2 were used, and for the other 16 female-to-male converted files, values between 0.85 or 0.9. • New pitch median (default value is 0.0): a new median for the pitch values is established for each file, which, in turn, is used to compose a factor expressed by the ratio between this new median and the original median pitch. This factor is then used by the algorithm to multiply the original pitch values to obtain new values. In this metric, the value 0.0 represents the default setting, yielding the factor 1.0, which means no alteration. We established as standard values for this assignment the frequency measurement of 300 Hz for male-to-female conversion, for 35 of 45 files, and 140 Hz for female-to-male conversion, 6 For one of the audios, the algorithm could not produce a successful conversion. for 58 of 88 files. These values were also altered in some files to achieve a convincing result: for male-to-female conversion, values between 250 Hz and 380 Hz were used for the other 10 files, and for female-to-male conversion, values between 80 Hz and 260 Hz were used for the other 30 files. • Pitch range factor (default value: 1.0) provides for an additional modification in pitch by an extra scaling of the values around the new pitch median, obtained in the previous step. A factor of 1.0 means that no additional pitch modification will occur, and a factor valued as 0.0 monotonizes the new sound to the new pitch median. Considering the essential goal of the project, the default value was kept and no modifications for the pitch range were provided. • Duration factor (default value: 1.0) establishes a factor used for lengthening the sound file. For a factor valued less than 1.0, the resulting sound will be shorter than the original, and a value higher than 3.0 will not work. The default value provided by the software was also maintained, as a change in the duration of the sound is deemed as unnecessary for the development of the task. 2.3. Selection of Prosodic Features for SER We grouped the 56 prosodic audio features (one of the baseline feature sets) into six classes7 in order to select those strongly related to the classes defined for SE&R 2022 and evaluate them separately and conjoined: (1) related to voice quality (13 features), including local_jitter and local_shimmer, those from Harmonics-to-Noise Ratio (HNR) and those from Glottal-to-Noise Ratio (GNE); (2) related to intensity (9 features), for example, min_intensity, max_intensity; (3) related to F0 (pitch) (10 features), for example, mean_pitch, stddev_pitch; (4) related to spectrum (10 features), for example, skewness_spectrum, kurtosis_spectrum; (5) related to formants (10 features), for example, formant_dispersion, average_formant; (6) related to vocal tract length (VTL) (4 features), for example, fitch_vtl, vtl_delta_f. The groups related to intensity (first 9 features), F0 (from 10 to 19), and spectrum (last 10 features), respectively shown in Table 1, were chosen for our feature selection experiment which included the training of 7 multi-task and 5 sequential classifiers, totalling 12 experiments, shown in Section 3.3. The classifiers used 10 (related to spectrum), 19 (intensity and F0) and 29 (spectrum, intensity, and F0) features and also a subset of 8 features, shown in bold in Table 1. According to [25], energy, pitch, and time are the three perceptual dimensions on which most vocal indicators of various emotions are based. Therefore, the class of acoustic parameters related to F0, intensity, and spectrum were selected because they are reported in the literature as potential correlates of the vocal expression of emotions [25, 26, 27, 28]. F0 (fundamental frequency) is an acoustic correlate of the rate of vocal cords vibration, that is, the number of times a sound wave produced by the vocal cords is repeated during a given period of time. F0 is perceived as the pitch of the voice, and the range of values for this frequency varies according to sex and age8 . In turn, sound intensity corresponds to the variations in the air pressure of a sound wave and is perceived as the loudness of a sound. Loudness and pitch are, 7 The feature voiced_fraction was allocated in the group of spectrum features, instead of with the pitch group. 8 For instance, 80–200 Hz for adult males, 180–400 Hz for adult females [29], and higher ranges for children. The mean values change for older ages. Table 1 Features used in the classifiers of the feature selection experiment. 1 Min_intensity 16 Q1_pitch 2 Relative_min_intensity_time 17 Q3_pitch 3 Max_intensity 18 Mean_absolute_pitch_slope 4 Relative_max_intensity_time 19 Pitch_slope_without_octave_jumps 5 Mean_intensity 20 Center_of_gravity_spectrum 6 Stddev_intensity 21 Stddev_spectrum 7 Q1_intensity 22 Skewness_spectrum 8 Median_intensity 23 Kurtosis_spectrum 9 Q3_intensity 24 Central_moment_spectrum 10 Min_pitch 25 Voiced_fraction 11 Relative_min_pitch_time 26 Band_energy 12 Max_pitch 27 Band_density 13 Relative_max_pitch_time 28 Band_energy_difference 14 Mean_pitch 29 Band_density_difference 15 Stddev_pitch in fact, elementary domains of the auditory signal and changes in sound intensity and F0 seem to be relevant to emotion analysis: higher and wider pitch ranges and higher sound intensity are typically associated with high arousal emotions (e.g., fear, anger, joy) compared to neutral speech, while lower and narrower pitch ranges and lower sound intensity are more associated with low arousal emotions (e.g., sadness, boredom, calmness) [25, 30, 31, 32, 33]. Studies have also shown that emotion affects the distribution of spectral energy across the range of sound frequencies: for example, stronger energy in higher frequency bands is usually associated with high arousal emotions, while weaker energy in the same band is more associated with low arousal emotions [31]9 . 2.4. Neural Architectures: multi-task and sequential transfer learning Transfer Learning is a machine learning approach that transfers weights trained in one task, domain, or language to a different one, with the aim of improving the learning generalization [17]. In this work, two Transfer Learning techniques were used: Multi-task and Sequential Transfer Learning. In the first one, the training of the two tasks is performed simultaneously, sharing a layer of weights between the two tasks [16]. In the second, the weights trained in the first task are transferred to the second, sequentially [34]. Figure 2 presents the two architectures. For the Multi-task architecture, two MultiLayer Perceptron (MLP) neural networks were used, with 4 layers each, sharing a common layer with 100 neurons. The first one focused on the binary gender prediction task, using the CETUC dataset, with 44 neurons in the input layer and one neuron in the output layer. The second (main task), focused on the prediction of the three 9 Many of these studies used speech audios recorded in sound-proof booths with controlled scenarios. Sponta- neous speech recorded in natural contexts and noisy environments like SER shared-task dataset interferes with extracted features results, as the acoustic signal is affected by sound sources competing with the target signal, the performance of pitch detection algorithms degrades as the noise level increases, and even the speech signal energy depends on the distance and position between the speaker’s mouth and microphone. Therefore, in future work, at least methods for noise incorporation/reduction will be explored to assess the impact of noise on data. (a) Multi-task (b) Sequential Figure 2: Transfer Learning architectures: a) Multi-task: 2 MLP’s with 4 layers (1 shared); and b) Sequential: the second MLP with 5 layers uses a frozen layer from the first. Prosodic Features Set 1 is composed of 44 features described in the work developed by [20] while Prosodic Features Set 2 is composed of 56 features provided by the SE&R shared-task on SER and described in Section 2.3. SER classes, with the number of neurons in the input layer varying from 8 to 824 (according to the features used) and three neurons in the output layer. Both use a previous layer of 10 neurons before the common layer. For the Sequential architecture, two MLP’s were also used, but they were trained sequentially. The first for the binary gender prediction task with 44 neurons in the input, a hidden layer of 30 neurons and one neuron in the output. The hidden layer was then frozen and transferred to the second MLP, whose input layer ranged from 73 to 868 (according to the features used) and with three neurons in the output layer (one for each class) of the second task. The frozen layer acted by predicting the gender of the samples (auxiliary task) and passing this prediction as a new internal feature to a layer of 5 neurons before the output (for models with more features this layer was changed to 10 neurons). 3. Experiments All the 26 models described in Sections 3.1, 3.2 and 3.3 were trained using a batch size of 100 and 300 epochs. 3.1 Sequential Learning Results. Table 2 presents the results, in crescent order of F1-macro values, for the experiments with the sequential learning architecture. 3.2 Multi-task Learning Results. Table 3 presents the results, in crescent order of F1-macro values, for the experiments with the multi-task learning architecture. Table 2 Sequential Learning results using 5-fold cross-validation. We indicate in the model’s name which feature set was used and whether a data augmentation technique was used (+) or was not used (-). The last line indicates the value of F1-macro for the submitted model, using the full dataset. Average scores for all 5 folds Model name/feature sets/data aug techniques F1-macro Accuracy Loss (1) Seq. wav2vec - SMOTE + CG 0.4139 79.0240 0.1122 (2) Seq. all prosodic + SMOTE - CG 0.4653 53.4022 0.2001 (3) Seq. all prosodic + SMOTE + CG 0.5344 61.8574 0.1815 (4) Seq. all prosodic and wav2vec + SMOTE - CG 0.5621 64.0564 0.1566 (5) Seq. wav2vec + SMOTE - CG 0.7043 74.8603 0.1379 (6) Seq. wav2vec + SMOTE + CG 0.7067 75.4375 0.1170 (7) Seq. all prosodic and wav2vec + SMOTE + CG 0.8035 82.4077 0.0846 1st Submission - Model (7) (full dataset) 0.5353 Table 3 Multi-task Learning results using 5-fold cross-validation. We indicate in the model’s name which feature set was used and whether a data augmentation technique was used (+) or was not used (-). The last line indicates the value of F1-macro for the submitted model, using the full dataset. Average scores for all 5 folds Model name/feature sets/data augmentation techniques F1-macro Accuracy Loss (1) Multi-task wav2vec - SMOTE + CG 0.7492 9.3794 0.3163 (2) Multi-task all prosodic + SMOTE + CG 0.8145 8.3135 0.3054 (3) Multi-task all prosodic + SMOTE - CG 0.8234 8.9146 0.3115 (4) Multi-task wav2vec + SMOTE - CG 0.8498 7.3750 0.3413 (5) Multi-task wav2vec + SMOTE + CG 0.8882 5.6187 0.2786 (6) Multi-task all prosodic and wav2vec + SMOTE + CG 0.8941 5.3127 0.2770 (7) Multi-task all prosodic and wav2vec + SMOTE - CG 0.9052 4.9903 0.2733 2nd submission - Model (7) (full dataset) 0.5301 3.3 Feature Selection Results. We focused on twelve experiments to evaluate small and focused feature sets, shown on Table 4, in crescent order of F1-macro values. 3.4 Preliminary Evaluation of the Selected Models. Table 5 shows the confusion matrices for the first fold (20% of data), related to the three selected models. In the matrices, rows are termed as actual/true class and columns are termed as a predicted class. For the three selected models, the neutral class had the worst performance. It seems that the auxiliary task (genre classification from speech) has helped in classifying non-neutral male and non-neutral female classes. 4. Conclusions and Future Work In this work, we evaluate 26 DNN models, using 5-fold cross-validation over the training dataset, and submitted our best models, i.e. those with higher F1-macro, for each group of experiments in Sections 3.1, 3.2, and 3.3. One of the submitted models surpassed the prosodic features baseline, reaching 0.5353 F1-macro. As a future work, we will perform an error analysis to Table 4 Feature Selection results using 5-fold cross-validation. We indicate in the model’s name which feature set was used and whether a data augmentation technique was used (+) or was not used (-). The last line indicates the value of F1-macro for the submitted model, using the full dataset. Average scores for all 5 folds Model name/feature sets/data aug techniques F1-macro Accuracy Loss (1) Seq. 8 prosodic - SMOTE + CG 0.2917 77.9714 0.1230 (2) Seq. 8 prosodic + SMOTE + CG 0.3216 43.5904 0.2356 (3) Seq. 19 prosodic + SMOTE - CG 0.4022 52.5366 0.2077 (4) Seq. 8 prosodic + SMOTE - CG 0.4109 48.8061 0.2382 (5) Seq. 19 prosodic + SMOTE + CG 0.5315 0.1684 0.1684 (6) Multi-task. 8 prosodic - SMOTE + CG 0.7261 9.4466 0.3167 (7) Multi-task 8 prosodic + SMOTE - CG 0.7440 11.7683 0.3622 (8) Multi-task 19 prosodic + SMOTE - CG 0.7835 10.5195 0.3275 (9) Multi-task. 8 prosodic + SMOTE + CG 0.7917 9.3354 0.3156 (10) Multi-task 19 prosodic + SMOTE + CG 0.8172 8.6319 0.3086 (11) Multi-task 10 prosodic + SMOTE + CG 0.8214 8.4761 0.3070 (12) Multi-task 29 prosodic + SMOTE + CG 0.8266 8.2014 0.3265 3rd submission Model (12) (full dataset) 0.4696 Table 5 Confusion Matrices generated in the first iteration for the first fold (20% of data), during the training of the models. (N = Neutral, M = Non-Neutral Male, F = Non-Neutral Female). N F M Seq. all prosodic and wav2vec + SMOTE + CG: N 71 27 8 F 2 87 1 M 1 11 87 Multi-task all prosodic and wav2vec + SMOTE - CG: N 82 20 5 F 5 88 3 M 1 0 90 Multi-task 29 prosodic + SMOTE + CG: N 68 18 9 F 8 100 0 M 2 5 85 understand why our best submitted model had a good performance on the training dataset, but only a 0.5353 F1-macro value on the test set. Acknowledgments This research was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by the São Paulo Research Foundation (FAPESP grant 2019/07665-4) and by the IBM Corporation. References [1] M. B. Akçay, K. Oğuz, Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers, Speech Communication 116 (2020) 56–76. doi:https://doi.org/10.1016/j.specom.2019.12.001 . [2] P. Ekman, H. Oster, Facial expressions of emotion, Annual Review of Psychology 30 (1979) 527–554. [3] G. A. Campos, L. da S. Moutinho, DEEP: Uma arquitetura para reconhecer emoção com base no espectro sonoro da voz de falantes da língua portuguesa, 2020. URL: https://bdm. unb.br/handle/10483/27583, january 18, 2022. [4] S. Zhang, R. Liu, X. Tao, X. Zhao, Deep cross-corpus speech emotion recognition: Recent advances and perspectives, Frontiers in Neurorobotics 15 (2021). [5] J. A. Russell, A. Mehrabian, Evidence for a three-factor theory of emotions, Journal of research in Personality 11 (1977) 273–294. [6] T. Raso, H. Mello, M. M. Mittmann, The C-ORAL-BRASIL I: Reference corpus for spoken Brazilian Portuguese, in: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey, 2012, pp. 106–113. [7] I. Luengo, E. Navas, I. Hernáez, J. Sánchez, Automatic emotion recognition using prosodic parameters, in: INTERSPEECH 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, September 4-8, 2005, ISCA, 2005, pp. 493–496. URL: http://www.isca-speech.org/archive/interspeech_2005/i05_0493.html. [8] K. S. Rao, S. G. Koolagudi, R. R. Vempada, Emotion recognition from speech using global and local prosodic features, Int. J. Speech Technol. 16 (2013) 143–160. [9] A. Baevski, H. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A framework for self-supervised learning of speech representations, CoRR abs/2006.11477 (2020). arXiv:2006.11477 . [10] B. J. Abbaschian, D. Sierra-Sosa, A. Elmaghraby, Deep learning techniques for speech emotion recognition, from databases to models, Sensors 21 (2021). [11] K. Zvarevashe, O. Olugbara, Ensemble learning of hybrid acoustic features for speech emotion recognition, Algorithms 13 (2020). [12] X. Cai, J. Yuan, R. Zheng, L. Huang, K. Church, Speech Emotion Recognition with Multi- Task Learning, in: Proc. Interspeech 2021, 2021, pp. 4508–4512. [13] N. K. Kim, J. Lee, H. K. Ha, G. W. Lee, J. H. Lee, H. K. Kim, Speech emotion recognition based on multi-task learning using a convolutional neural network, in: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2017, pp. 704–707. doi:10.1109/APSIPA.2017.8282123 . [14] Y. Li, T. Zhao, T. Kawahara, Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning, in: Proc. Interspeech 2019, 2019, pp. 2803–2807. doi:10.21437/Interspeech.2019- 2594 . [15] M. Lech, M. Stolar, C. Best, R. Bolia, Real-time speech emotion recognition using a pre- trained image classification network: Effects of bandwidth reduction and companding, Frontiers in Computer Science 2 (2020). [16] R. Caruana, Multitask learning, Machine Learning - Special issue on inductive transfer - Volume 28 (1997) 41–75. [17] S. Ruder, M. E. Peters, S. Swayamdipta, T. Wolf, Transfer learning in natural language processing, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 15–18. doi:10.18653/v1/N19- 5004 . [18] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: Synthetic minority over-sampling technique, J. Artif. Int. Res. 16 (2002) 321–357. [19] P. Boersma, D. Weenink, Praat: Doing phonetics by computer, 2010. URL: http://www. praat.org/. [20] B. Gianesi, S. Aluisio, Classificação de gênero via análise de áudio utilizando métodos de aprendizado de máquina tradicionais, 2021. URL: https://github.com/BrunoGianesi/ Speaker-Gender-Recognition, To appear in https://eesc.usp.br/biblioteca/. [21] V. F. S. Alencar, A. Alcaim, LSF and LPC - Derived Features for Large Vocabulary Dis- tributed Continuous Speech Recognition in Brazilian Portuguese, in: 2008 42nd Asilomar Conference on Signals, Systems and Computers, 2008, pp. 1237–1241. [22] V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, R. Collobert, MLS: A Large-Scale Multilingual Dataset for Speech Research, in: Proc. Interspeech 2020, 2020, pp. 2757–2761. [23] D. Liang, E. Thomaz, Audio-based activities of daily living (adl) recognition with large-scale acoustic embeddings from online videos, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 3 (2019). URL: https://doi.org/10.1145/3314404. doi:10.1145/3314404 . [24] G. Lemaître, F. Nogueira, C. K. Aridas, Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning, J. Mach. Learn. Res. 18 (2017) 559–563. [25] J. Pittam, K. R. Scherer, Vocal expression and communication of emotion, in: M. Lewis, J. M. Haviland (Eds.), Handbook of emotions, The Guilford Press, New York, 1993, pp. 185–198. [26] K. R. Scherer, Vocal affect expression: a review and a model for future research, Psycho- logical Bulletin 99 (1986) 143–165. [27] P. A. Barbosa, Detecting changes in speech expressiveness in participants of a radio program, in: INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, United Kingdom, ISCA, 2009, pp. 2155–2158. [28] M. El Ayadi, M. S. Kamel, F. Karray, Survey on speech emotion recognition: Features, classification schemes, and databases, Pattern Recognition 44 (2011) 572–587. [29] J. t’Hart, R. Collier, A. Cohen, A Perceptual Study of Intonation: An Experimental-Phonetic Approach to Speech Melody, Cambridge Studies in Speech Science and Communication, Cambridge University Press, 1990. doi:10.1017/CBO9780511627743 . [30] R. Banse, K. R. Scherer, Acoustic profiles in vocal emotion expression., Journal of person- ality and social psychology 70 (1996) 614–36. [31] T. Johnstone, K. R. Scherer, Vocal communication of emotion, in: M. Lewis, J. M. Haviland- Jones (Eds.), Handbook of emotions, 2 ed., The Guilford Press, New York, 2000, pp. 220–235. [32] P. N. Juslin, P. Laukka, Impact of intended emotion intensity on cue utilization and decoding accuracy in vocal expression of emotion, Emotion 1 (2001) 381–412. [33] D. Guo, H. Yu, A. Hu, Y. Ding, Statistical analysis of acoustic characteristics of tibetan lhasa dialect speech emotion, SHS Web of Conferences 25 (2016) 1–5. [34] S. Ruder, Neural Transfer Learning for Natural Language Processing, Ph.D. thesis, National University of Ireland, Galway, 2019.