Transfer Learning and Data Augmentation Techniques applied to Speech Emotion Recognition in SE&R 2022

Transfer Learning and Data Augmentation Techniques applied to Speech Emotion Recognition in SE&R 2022 CarolineAlves carolalves@usp.br Departamento de Letras Clássicas e Vernáculas FFLCH-USP BrunoCarlotto Instituto de Ciências Matemáticas e de Computação ICMC-USP BrunoDias brunoadiaspapa1@usp.br Departamento de Letras Clássicas e Vernáculas FFLCH-USP AnátaleGarcia anatale.garcia@usp.br Departamento de Letras Clássicas e Vernáculas FFLCH-USP BrunoGianesi brunogianesi@usp.br Engenharia Mecatrônica EESC-USP RenanIzaias renan.izaias@usp.br Departamento de Letras Clássicas e Vernáculas FFLCH-USP MariaLuiza De Morais marialuizamorais@usp.br Departamento de Letras Clássicas e Vernáculas FFLCH-USP PaulaDe Oliveira paulamarindeoliveira@usp.br Departamento de Letras Clássicas e Vernáculas FFLCH-USP ViníciusGSantos vinicius.santos@alumni.usp.br Departamento de Letras Clássicas e Vernáculas FFLCH-USP RafaelSicoli Departamento de Letras Clássicas e Vernáculas FFLCH-USP FlavianeRFernandes Svartman flavianesvartman@usp.br Departamento de Letras Clássicas e Vernáculas FFLCH-USP SandraAluisio sandra@icmc.usp.br Instituto de Ciências Matemáticas e de Computação ICMC-USP SidneyLeal sidleal@gmail.com Instituto de Ciências Matemáticas e de Computação ICMC-USP Transfer Learning and Data Augmentation Techniques applied to Speech Emotion Recognition in SE&R 2022 1613-0073 85169C784A387D7749978856F0DE7B73 GROBID - A machine learning software for extracting information from scholarly documents Deep Learning Transfer Learning Data Augmentation Speech Emotion Recognition

In this work, our team ICMC-EESC-FFLCH explores several techniques to address data scarcity and imbalance in SE&R 2022 task dedicated to speech emotion recognition (SER). We evaluate two types of transfer learning models: (i) Multi-task learning, in which two tasks are learned simultaneously, and (ii) Sequential transfer learning where the tasks are learned sequentially. In both models, the auxiliary task is genre classification from speech, using a large dataset with almost 145 hours of speech signals. As for the techniques to balance the training data, we have used the SMOTE (Synthetic Minority Over-sampling Technique) and Praat's Change gender command to over-sampling minority classes. Our Sequential transfer learning architecture, using the two baselines feature sets provided by the shared-task (prosodic audio features and embeddings generated by the Wav2Vec 2.0 model) and the two approaches to balance the training dataset reaches satisfactory performance with a 0.5353 F1-macro, surpassing the prosodic features baseline. On the other hand, our multi-task learning approach using the two baseline features sets and the SMOTE approach to balance the training dataset reaches only a 0.5301 F1-macro. Finally, our worst result is 0.469 F1-macro, obtained with the feature selection experiment (29 prosodic features manually chosen from the literature), using our multi-task learning architecture with the two approaches to balance the training dataset.

Introduction

According to [1], speech emotion recognition (SER) systems are composed of methods, namely feature extraction and emotion classification, that process and classify speech signals to detect the embedded emotions of speech. They can also include a preprocessing step before the extraction of the features used to normalize the signals, for example, the use of noise reduction techniques. Emotion classes depend on labeled data of the dataset used to create the model; these datasets can be of three types: acted, elicited or natural. While most of the natural datasets are from spontaneous speech recorded in noisy environments, acted speech databases are recorded by professional actors in sound-proof studios. Elicited speech datasets are created by placing speakers in a simulated emotional situation that can stimulate various emotions and can be close to real ones. It is important to notice that, the definition of emotion is an open problem in psychology and there are two models being used in SER systems: discrete and dimensional emotional models. The first one is based on the six primary and culturally independent categories of basic emotions [2]: sadness, happiness, fear, anger, disgust, and surprise, where other emotions are obtained by the combination of the basic ones. Most of the existing SER systems focus on all these basic emotional categories, sometimes including the neutral category (see, for example, [3], a study focusing on Portuguese language), or in a small group of those emotions 1 . The second one, the dimensional emotional model, uses a small number of latent dimensions to define emotions such as: valence, arousal/excitation, control/power. In this model, emotions are not independent of each other, instead, they are analogous to each other in a systematic way. [5] support of the thesis that the three dimensions of pleasure-displeasure (valence), arousalnonarousal (excitation), and dominance-submissiveness (power/control) are both necessary and sufficient to describe a large variety of emotional states. Specifically, valence describes whether an emotion is positive or negative, and it ranges between unpleasant and pleasant; excitation defines the strength of the felt emotion, ranging from boredom to frantic excitement; and the dimension of control/power refers to the seeming strength of the person (between weak and strong). For example, the third dimension differentiates anger from fear by considering the strength or weakness of the person, respectively; however, as the surprise emotion may have positive or negative valence depending on the context, it is difficult to categorize.

Whereas most studies on SER deal with simulated, noise-free datasets recorded in sound-proof studios [4], SE&R 2022 brings a small dataset of approximately 50 minutes, with 625 audio segments (training dataset) from the C-ORAL-BRASIL I corpus [6], consisting of audio segments representing Brazilian Portuguese informal spontaneous speech, recorded in natural contexts and noisy environments.

The two baseline feature sets (prosodic audio features for emotion classification [7,8] and embeddings generated by the Wav2Vec 2.0 model [9]) made available for SE&R 2022 were used in this work. Feature selection was also evaluated, focusing on four small prosodic features sets, manually chosen, with 29, 19, 10, and 8 features, taken from pitch, intensity, and spectrum groups of features. While the first SER systems used machine learning methods with a careful feature engineering (see several examples in [10]), recent approaches use ensembles to learn hybrid acoustic features [11], and deep learning architectures, such as multi-task learning [12,13], attention mechanisms [14], and transfer learning approaches [15].

Our contribution to SE&R 2022 explores two architectures based on deep neural networks (DNN) aiming at detecting Speech Emotion Recognition in Portuguese audio files. Our proposal evaluates two types of inductive transfer learning: multi-task [16] and sequential transfer learning [17]. In both models, the auxiliary task is genre classification from speech 2 . Since DNN-based classifiers have a generalization error problem when trained with limited datasets, we explore two different data augmentation techniques aimed to balance the training data. We have used the SMOTE [18] to create synthetic data for the minority classes and Praat's [19] Change gender command to manipulate the acoustic features in order to create new synthetic data based on the pre-existing ones. The Jupyter notebooks and characterization of the training dataset are publicly available at https://github.com/BrunoBaldissera/ser-transfer.

Experimental Framework

First, we present the original dataset for the main task and the dataset used for the auxiliary task of genre classification from speech in both inductive transfer learning architectures (Section 2.1), noting that the original dataset is unbalanced. Therefore, we applied two techniques for data augmentation (Section 2.2). Section 2.3 presents the feature sets we explored in our linguistically motivated selection of prosodic features, based on the literature. Finally, Section 2.4 presents our multi-task and sequential transfer learning architectures.

Datasets

2.1.1 Primary Task Dataset: official dataset of SE&R shared-task on SER. In the SE&R 2022 shared-task on SER, the audio segments are labeled in three classes: neutral, non-neutral female, and non-neutral male. The neutral class is the majority class (491 samples) and is used to label audio segments with no well-defined emotional state while the non-neutral classes label segments (89 non-neutral-female and 45 non-neutral-male) associated with one of the primary emotional states in the speaker's speech. In order to better understand the training dataset used in this study, seven annotators from our group pursued a qualitative analysis of the dataset. They labeled every audio in the training set with "yes" (meaning presence) or "no" (meaning absence) according to the following categories:

• Noise: any sort of noise not related with the primary voice(s) 3 , e.g., background chatting, microphone hissing noise, music, children voices, etc.; • Voice overlapping: periods in which there were two primary voices speaking at the exact same moment; • Different gender: the presence of more than one perceived gender in the primary voices of the same audio; and • Voices in sequence: the presence of more than one primary voice in the same audio, but without direct overlapping between them.

Our evaluation is summarized in Figures 1a and 1b. As we can see, there is a lot of noisy audio. Although noise is not a problem for the auxiliary task (Audio Genre Classification) [20] of the neural architectures, only an error analysis can identify possible problems for the SER task as a whole. Also, two complex problems were found: high overlapping rate of voices and audios with different genres, which we believe may have an impact on the classification of the 2 non-neutral classes (male and female). Of the 26 non-neutral audios that have different gender, 1a and 1b show a characterization of the training dataset, presenting the number of audios with noise, primary overlapping voices, primary voices with different genders, primary voices in sequence, for both types of classes (neutral and non-neutral) audios.

Auxiliary

Task Dataset: CETUC. The task of classifying gender based on voice identifies automatically a voice as male or female, based on the audio features. The gender identification of a given speaker was implemented in an undergrad project of one of the authors [20], to evaluate machine learning methods, such as decision trees, random forest, gradient boosting, support vector machine, multi-layer perceptron and logistic regression, and to compare the use of distinct features and models applied on different datasets. In addition, the study also assessed whether the models generalize to other contexts, such as other languages (English) or noisy environments, when trained on CETUC dataset [21] that was recorded in a controlled environment.

The best performance method (gradient boosting) was trained using the large dataset CETUC, with almost 145 hours of speech signals spoken by 50 male and 50 female speakers 4 , each one pronouncing 1,000 phonetically balanced sentences selected from the CETEN-Folha corpus 5 . The best performance model used three sets of features from audio signals, totalling 44 features: (i) 12 statistics extracted from the highest frequency value, after applying the Fourier transform on the audios, divided into time windows of 0.2 seconds, (ii) the fundamental frequency (F0) statistics ( 12) and (iii) 20 MFCCs (Mel-Frequency Cepstral Coefficients), and reached an accuracy of 94,1%. This model was able to generalize well to audios with noise; it reached an accuracy of 90,8% on the testset MLS [22] with noise.

Data Augmentation Approaches: SMOTE and Praat's Change Gender

We used two approaches to balance the training dataset applied specifically on audios of non-neutral male and non-neutral female classes: SMOTE [18] and Praat's Change gender command [19].

It is suggested by the authors of the original SMOTE paper that previously performing a random under-sampling of the majority class followed by over-sampling the minority class tends to yield good results. However, in this work, we have only over-sampled the minority classes, following the work by [23], and using the technique in its simplest implementation. Nonetheless, as the synthesis of new data with SMOTE uses a linear combination of randomly chosen neighbors of the underrepresented instances in the feature space rather than just replicating the given instances, we gave more focus to this augmentation approach in place of the simple oversampling (even though a number of such tests was performed). We have used the Python imbalanced-learn package [24]; all the parameters were set as default.

Praat's Change gender command allow us to manipulate the acoustic features to create new synthetic data based on the preexisting ones. Through this method, we can change the perceived gender of a given voice into the opposite gender. The second method for data augmentation consists in the use of the algorithm for gender conversion available in the software for acoustic analysis Praat. A total of 133 files were used, 45 of them containing male voices, then converted to female ones, and 88 containing female voices, then converted to male ones6 . The task was undertaken by five annotators and had two phases: attribution of parameters for conversion and quality evaluation of the generated voice. In the quality assessment phase, the annotators changed the previously established default values in order to obtain voices that they judged the most natural as possible. For the conversion process, we first defined the frequency range in which the algorithm parameters were applied, using the values already predefined by the program, with the minimum pitch value being 75 Hz, and the maximum 600 Hz. The algorithm contains four parameters, described below, that can be used for gender conversion, from which we have only used the first two:

• Formant shift ratio (default value is 1.0) determines the ratio for proportionally modifying the value of formants, i.e., the sound frequency values at which the highest peaks of intensity occur, resulting from the resonance of the sound wave in its path through the vocal tract, from its production in the vocal folds until the moment of emission. The factor valued 1.0 means there is no alteration. For the task, we established the factor value 1.1 as the standard for male-to-female conversion, used in 30 of 45 files, and 0.8 for female-to-male conversion, used in 72 of 88 files. As mentioned above, these values were altered in some files in order to maintain a perceived natural quality of the converted voice: for the other 15 male-to-female converted files, factors between 1.15 or 1.2 were used, and for the other 16 female-to-male converted files, values between 0.85 or 0.9. • New pitch median (default value is 0.0): a new median for the pitch values is established for each file, which, in turn, is used to compose a factor expressed by the ratio between this new median and the original median pitch. This factor is then used by the algorithm to multiply the original pitch values to obtain new values. In this metric, the value 0.0 represents the default setting, yielding the factor 1.0, which means no alteration. We established as standard values for this assignment the frequency measurement of 300 Hz for male-to-female conversion, for 35 of 45 files, and 140 Hz for female-to-male conversion, for 58 of 88 files. These values were also altered in some files to achieve a convincing result: for male-to-female conversion, values between 250 Hz and 380 Hz were used for the other 10 files, and for female-to-male conversion, values between 80 Hz and 260 Hz were used for the other 30 files. • Pitch range factor (default value: 1.0) provides for an additional modification in pitch by an extra scaling of the values around the new pitch median, obtained in the previous step.

A factor of 1.0 means that no additional pitch modification will occur, and a factor valued as 0.0 monotonizes the new sound to the new pitch median. Considering the essential goal of the project, the default value was kept and no modifications for the pitch range were provided. • Duration factor (default value: 1.0) establishes a factor used for lengthening the sound file. For a factor valued less than 1.0, the resulting sound will be shorter than the original, and a value higher than 3.0 will not work. The default value provided by the software was also maintained, as a change in the duration of the sound is deemed as unnecessary for the development of the task.

Selection of Prosodic Features for SER

We grouped the 56 prosodic audio features (one of the baseline feature sets) into six classes7 in order to select those strongly related to the classes defined for SE&R 2022 and evaluate them separately and conjoined: (1) related to voice quality (13 features), including local_jitter and local_shimmer, those from Harmonics-to-Noise Ratio (HNR) and those from Glottal-to-Noise Ratio (GNE); (2) related to intensity (9 features), for example, min_intensity, max_intensity; (3) related to F0 (pitch) (10 features), for example, mean_pitch, stddev_pitch; (4) related to spectrum (10 features), for example, skewness_spectrum, kurtosis_spectrum; (5) related to formants (10 features), for example, formant_dispersion, average_formant; (6) related to vocal tract length (VTL) (4 features), for example, fitch_vtl, vtl_delta_f.

The groups related to intensity (first 9 features), F0 (from 10 to 19), and spectrum (last 10 features), respectively shown in Table 1, were chosen for our feature selection experiment which included the training of 7 multi-task and 5 sequential classifiers, totalling 12 experiments, shown in Section 3.3. The classifiers used 10 (related to spectrum), 19 (intensity and F0) and 29 (spectrum, intensity, and F0) features and also a subset of 8 features, shown in bold in Table 1.

According to [25], energy, pitch, and time are the three perceptual dimensions on which most vocal indicators of various emotions are based. Therefore, the class of acoustic parameters related to F0, intensity, and spectrum were selected because they are reported in the literature as potential correlates of the vocal expression of emotions [25,26,27,28]. F0 (fundamental frequency) is an acoustic correlate of the rate of vocal cords vibration, that is, the number of times a sound wave produced by the vocal cords is repeated during a given period of time. F0 is perceived as the pitch of the voice, and the range of values for this frequency varies according to sex and age 8 . In turn, sound intensity corresponds to the variations in the air pressure of a sound wave and is perceived as the loudness of a sound. Loudness and pitch are, in fact, elementary domains of the auditory signal and changes in sound intensity and F0 seem to be relevant to emotion analysis: higher and wider pitch ranges and higher sound intensity are typically associated with high arousal emotions (e.g., fear, anger, joy) compared to neutral speech, while lower and narrower pitch ranges and lower sound intensity are more associated with low arousal emotions (e.g., sadness, boredom, calmness) [25,30,31,32,33]. Studies have also shown that emotion affects the distribution of spectral energy across the range of sound frequencies: for example, stronger energy in higher frequency bands is usually associated with high arousal emotions, while weaker energy in the same band is more associated with low arousal emotions [31] 9 .

Neural Architectures: multi-task and sequential transfer learning

Transfer Learning is a machine learning approach that transfers weights trained in one task, domain, or language to a different one, with the aim of improving the learning generalization [17]. In this work, two Transfer Learning techniques were used: Multi-task and Sequential Transfer Learning. In the first one, the training of the two tasks is performed simultaneously, sharing a layer of weights between the two tasks [16]. In the second, the weights trained in the first task are transferred to the second, sequentially [34]. Figure 2 presents the two architectures.

For the Multi-task architecture, two MultiLayer Perceptron (MLP) neural networks were used, with 4 layers each, sharing a common layer with 100 neurons. The first one focused on the binary gender prediction task, using the CETUC dataset, with 44 neurons in the input layer and one neuron in the output layer. The second (main task), focused on the prediction of the three 9 Many of these studies used speech audios recorded in sound-proof booths with controlled scenarios. Spontaneous speech recorded in natural contexts and noisy environments like SER shared-task dataset interferes with extracted features results, as the acoustic signal is affected by sound sources competing with the target signal, the performance of pitch detection algorithms degrades as the noise level increases, and even the speech signal energy depends on the distance and position between the speaker's mouth and microphone. Therefore, in future work, at least methods for noise incorporation/reduction will be explored to assess the impact of noise on data. SER classes, with the number of neurons in the input layer varying from 8 to 824 (according to the features used) and three neurons in the output layer. Both use a previous layer of 10 neurons before the common layer. For the Sequential architecture, two MLP's were also used, but they were trained sequentially. The first for the binary gender prediction task with 44 neurons in the input, a hidden layer of 30 neurons and one neuron in the output. The hidden layer was then frozen and transferred to the second MLP, whose input layer ranged from 73 to 868 (according to the features used) and with three neurons in the output layer (one for each class) of the second task. The frozen layer acted by predicting the gender of the samples (auxiliary task) and passing this prediction as a new internal feature to a layer of 5 neurons before the output (for models with more features this layer was changed to 10 neurons).

Experiments

All the 26 models described in Sections 3.1, 3.2 and 3.3 were trained using a batch size of 100 and 300 epochs. 2 presents the results, in crescent order of F1-macro values, for the experiments with the sequential learning architecture. 3 presents the results, in crescent order of F1-macro values, for the experiments with the multi-task learning architecture.

Sequential Learning Results. Table

Multi-task Learning Results. Table

24 have voices overlapping and only 2 have voices in sequence. Of the 56 neutral audios that have different gender, 53 have voices overlapping and only 3 have voices in sequence.

Figure 1 :1Figure1: A qualitative analysis of the SER dataset performed by our team. Figures1a and 1bshow a characterization of the training dataset, presenting the number of audios with noise, primary overlapping voices, primary voices with different genders, primary voices in sequence, for both types of classes (neutral and non-neutral) audios.

Figure 2 :2Figure 2: Transfer Learning architectures: a) Multi-task: 2 MLP's with 4 layers (1 shared); and b) Sequential: the second MLP with 5 layers uses a frozen layer from the first. Prosodic Features Set 1 is composed of 44 features described in the work developed by[20] while Prosodic Features Set 2 is composed of 56 features provided by the SE&R shared-task on SER and described in Section 2.3.

Table 11Features used in the classifiers of the feature selection experiment.1Min_intensity16 Q1_pitch2Relative_min_intensity_time 17 Q3_pitch3Max_intensity18 Mean_absolute_pitch_slope4Relative_max_intensity_time 19 Pitch_slope_without_octave_jumps5Mean_intensity20 Center_of_gravity_spectrum6Stddev_intensity21 Stddev_spectrum7Q1_intensity22 Skewness_spectrum8Median_intensity23 Kurtosis_spectrum9Q3_intensity24 Central_moment_spectrum10 Min_pitch25 Voiced_fraction11 Relative_min_pitch_time26 Band_energy12 Max_pitch27 Band_density13 Relative_max_pitch_time28 Band_energy_difference14 Mean_pitch29 Band_density_difference15 Stddev_pitch

There are large lists of datasets used for emotion recognition in[1] and[4]. Project's github: https://github.com/BrunoGianesi/Speaker-Gender-Recognition. We consider primary voices to be the loudest, and secondary voices to be the least prominent in the audio. https://igormq.github.io/datasets/ https://www.linguateca.pt/cetenfolha/index_info.html For one of the audios, the algorithm could not produce a successful conversion. The feature voiced_fraction was allocated in the group of spectrum features, instead of with the pitch group. For instance, 80-200 Hz for adult males, 180-400 Hz for adult females[29], and higher ranges for children. The mean values change for older ages.

Acknowledgments

This research was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by the São Paulo Research Foundation (FAPESP grant 2019/07665-4) and by the IBM Corporation.

Table 2

Sequential Learning results using 5-fold cross-validation. We indicate in the model's name which feature set was used and whether a data augmentation technique was used (+) or was not used (-). The last line indicates the value of F1-macro for the submitted model, using the full dataset.

Preliminary Evaluation of the Selected Models.

Table 5 shows the confusion matrices for the first fold (20% of data), related to the three selected models. In the matrices, rows are termed as actual/true class and columns are termed as a predicted class. For the three selected models, the neutral class had the worst performance. It seems that the auxiliary task (genre classification from speech) has helped in classifying non-neutral male and non-neutral female classes.

Conclusions and Future Work

In this work, we evaluate 26 DNN models, using 5-fold cross-validation over the training dataset, and submitted our best models, i.e. those with higher F1-macro, for each group of experiments in Sections 3.1, 3.2, and 3.3. One of the submitted models surpassed the prosodic features baseline, reaching 0.5353 F1-macro. As a future work, we will perform an error analysis to

Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers MBAkçay KOğuz 10.1016/j.specom.2019.12.001 doi: Speech Communication 116 2020 Facial expressions of emotion PEkman HOster Annual Review of Psychology 30 1979 DEEP: Uma arquitetura para reconhecer emoção com base no espectro sonoro da voz de falantes da língua portuguesa GACampos LDa SMoutinho 2020. january 18, 2022 Deep cross-corpus speech emotion recognition: Recent advances and perspectives SZhang RLiu XTao XZhao Frontiers in Neurorobotics 15 2021 Evidence for a three-factor theory of emotions JARussell AMehrabian Journal of research in Personality 11 1977 The C-ORAL-BRASIL I: Reference corpus for spoken Brazilian Portuguese TRaso HMello MMMittmann Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), European Language Resources Association (ELRA) the Eighth International Conference on Language Resources and Evaluation (LREC'12), European Language Resources Association (ELRA)

Istanbul, Turkey

2012 Automatic emotion recognition using prosodic parameters ILuengo ENavas IHernáez JSánchez INTERSPEECH 2005 -Eurospeech, 9th European Conference on Speech Communication and Technology

Lisbon, Portugal

ISCA September 4-8, 2005. 2005 Emotion recognition from speech using global and local prosodic features KSRao SGKoolagudi RRVempada Int. J. Speech Technol 16 2013 wav2vec 2.0: A framework for self-supervised learning of speech representations ABaevski HZhou AMohamed MAuli arXiv:2006.11477 2020 Deep learning techniques for speech emotion recognition, from databases to models BJAbbaschian DSierra-Sosa AElmaghraby Sensors 21 2021 Ensemble learning of hybrid acoustic features for speech emotion recognition KZvarevashe OOlugbara Algorithms 13 2020 Speech Emotion Recognition with Multi-Task Learning XCai JYuan RZheng LHuang KChurch Proc. Interspeech 2021 Interspeech 2021 2021 Speech emotion recognition based on multi-task learning using a convolutional neural network NKKim JLee HKHa GWLee JHLee HKKim 10.1109/APSIPA.2017.8282123 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2017. 2017 Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning YLi TZhao TKawahara 10.21437/Interspeech.2019-2594 Proc. Interspeech 2019 Interspeech 2019 2019 Real-time speech emotion recognition using a pretrained image classification network: Effects of bandwidth reduction and companding MLech MStolar CBest RBolia Frontiers in Computer Science 2 2020 Multitask learning, Machine Learning -Special issue on inductive transfer RCaruana 1997 28 Transfer learning in natural language processing SRuder MEPeters SSwayamdipta TWolf 10.18653/v1/N19-5004 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, Association for Computational Linguistics the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, Association for Computational Linguistics

Minneapolis, Minnesota

2019 Smote: Synthetic minority over-sampling technique NVChawla KWBowyer LOHall WPKegelmeyer J. Artif. Int. Res 16 2002 PBoersma DWeenink Praat: Doing phonetics by computer 2010 Classificação de gênero via análise de áudio utilizando métodos de aprendizado de máquina tradicionais BGianesi SAluisio 2021 LSF and LPC -Derived Features for Large Vocabulary Distributed Continuous Speech Recognition in Brazilian Portuguese VF SAlencar AAlcaim Asilomar Conference on Signals, Systems and Computers 2008. 2008 MLS: A Large-Scale Multilingual Dataset for Speech Research VPratap QXu ASriram GSynnaeve RCollobert Proc. Interspeech 2020 Interspeech 2020 2020 Audio-based activities of daily living (adl) recognition with large-scale acoustic embeddings from online videos DLiang EThomaz 10.1145/3314404 Proc. ACM Interact. Mob. Wearable Ubiquitous Technol 3 2019 Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning GLemaître FNogueira CKAridas J. Mach. Learn. Res 18 2017 Vocal expression and communication of emotion JPittam KRScherer Handbook of emotions MLewis JMHaviland

New York

The Guilford Press 1993 Vocal affect expression: a review and a model for future research KRScherer Psychological Bulletin 99 1986 Detecting changes in speech expressiveness in participants of a radio program PABarbosa INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association

Brighton, United Kingdom

ISCA 2009 Survey on speech emotion recognition: Features, classification schemes, and databases MElAyadi MSKamel FKarray Pattern Recognition 44 2011 A Perceptual Study of Intonation: An Experimental-Phonetic Approach to Speech Melody JHart RCollier ACohen 10.1017/CBO9780511627743 Cambridge Studies in Speech Science and Communication Cambridge University Press 1990 Acoustic profiles in vocal emotion expression RBanse KRScherer Journal of personality and social psychology 70 1996 Vocal communication of emotion TJohnstone KRScherer Handbook of emotions MLewis JMHaviland-Jones

New York

The Guilford Press 2000 2 ed Impact of intended emotion intensity on cue utilization and decoding accuracy in vocal expression of emotion PNJuslin PLaukka Emotion 1 2001 Statistical analysis of acoustic characteristics of tibetan lhasa dialect speech emotion DGuo HYu AHu YDing SHS Web of Conferences 25 2016 Neural Transfer Learning for Natural Language Processing SRuder 2019 Galway National University of Ireland Ph.D. thesis