Speech Emotion Recognition in Portuguese for SofiaFala: SER SofiaFala

Speech Emotion Recognition in Portuguese for SofiaFala: SER SofiaFala AlexanderScaranti alexander.scaranti@gmail.com University of São Paulo (USP) DouglasAntonioRodrigues douglasarsilva@gmail.com University of São Paulo (USP) ProfFernando Meloni fernandomeloni@alumni.usp.br University of São Paulo (USP) ProfAlessandra Alaniz Macedo University of São Paulo (USP) Speech Emotion Recognition in Portuguese for SofiaFala: SER SofiaFala 1613-0073 09762218056F2C45F1E869BA2A3AB4B6 GROBID - A machine learning software for extracting information from scholarly documents Speech Processing Emotion Recognition Portuguese Language Natural Language Processing Artificial Intelligence SofiaFala

Emotion recognition through speech processing has been increasingly demanded as a response to scientific advances and improvement in information technologies. However, a gap exists when the demand concerns projects in the Portuguese language. Here, we propose a method for extracting and recognizing emotion in the Portuguese language. We have evaluated response time, length, silence ratio, long silence ratio, and silence rate. According to the SER 2022 evaluation, our strategy can reach a macro-averaged F1 score of 55% on a very imbalanced dataset. We have aligned our results with the SofiaFala project, which supports speech training in children with Down syndrome.

Introduction

In the last two years, the COVID-19 pandemic has swept the world, leading to new demands for different approaches to communication and interaction. In turn, the 5G technology, which emerged in the second decade of the 21st century, supports new possibilities. In this context, modern algorithm-aligned voice processing tools have paved new ground for improving people's quality of life, assisting people with incapacity, or even assisting long-distance interaction. These algorithms, created with researchers' hard work, have allowed new opportunities such as the Speech Emotion Recognition task to be envisioned.

Portuguese-speaking countries suffer from a scarcity of tools to support speech and emotion recognition. For instance, speech sound and language vary in the many regions of Brazil, a country with continental dimensions. This situation demands research into speech manipulation by considering utterances that sound prosodically distinct. Speaking manner or speech disorders can interfere with speech emotion recognition.

The SofiaFala software [1], developed in the LIS laboratory at USP-Ribeirão Preto-SP, recognizes sounds and images produced during exercises and provides reports on assistive speech training for speech disorders of children with Down syndrome [2].

Expressing emotions through speech is a part of oral communication through the voice. For voice analysis and knowledge to be generated, different data types (texts, images, and types of speech) must be manipulated through a coordinated analysis that considers connections and particularities of sound. This manipulation is challenging and desirable. For instance, SofiaFala can take advantage of emotion recognition during speech training.

Here, we propose a speech emotion recognition method that uses the corpus provided by the SER committee, namely CORAA version 1.1, which is composed of approximately 50 minutes of audio segments. Our work focuses on the clipping of emotions in speech. We intend to incorporate SER as a module of the SofiaFala app.

Our Proposal: SER System

Considering the dataset CORAA available for the shared task and aiming at recognizing emotion, we have developed a computer system called SER to carry out natural language processing and other steps.

SER was built in Python, and it executed the experiments presented in Section 3. Figure 1 illustrates the process and the computational modules. SER is composed of the following stages:

• Acquisition. All information acquired from the dataset CORAA-v1.1 has three classes: neutral, male neutral, and female neutral, amounting to 625 audio fragments that total 50 minutes of speech. The neutral class comprises audio segments without a well-defined emotional state. The non-neutral class represents segments associated with one of the primary emotional states in the speaker's speech. This non-neutral dataset, called the C-ORAL-BRASIL I corpus, has informal spontaneous speech of Brazilian Portuguese (Raso and Mello, 2012). • Preprocessing. We processed all the acquired audios to clean and to try to improve the performance of the next step, feature extraction. We also applied filters to remove noise from the audios [3]. Moreover, we converted all the audios from stereo to mono and distributed them into three classes: neutral, non-neutral female, and non-neutral male. • Prosody and Feature Extractions. Extraction is the method that analyzes and brings out information from the audio so that the learning model can be developed. Next, we will detail it. In terms of feature extraction, our system carried out some steps by considering:

-Prosody Extraction. Prosody or speech elements are properties of linguistic functions with features. We extracted the following features from all the audios in the base: response time, response length, silence ratio, long silence ratio, silence rate, frequency, and intensity. -Feature extraction with MFCC. MFCC is a feature extraction method for audio that uses the Fourier transform [4]. MFCC is the most used method in speech processing because it is the most suitable for representing audio and signal characteristics. This method captures sound exactly as humans recognize it.

-Transformation with Spectrogram (MEL). Logarithmic Transformation of an audio

signal frequency is said to be a MEL scale whose its central idea is sounds of equal distances (MEL scale) that mimic our perception of sound [5]. Transformation from the Hertz scale to the Mel scale is as follows:

𝑚 = 1127.𝑙𝑜𝑔(1 + 𝑓 /700)

-Aggregation of Chromagram. We used this strategy to increase the robustness of our logarithmic frequency spectrogram to variations in timbre and instrumentation.

The main idea of chroma features is to aggregate all spectral information related to a given pitch class into a single coefficient.

• Classification. We applied an MLP Neural Network [6] with the following parameters: Hidden Layer = 500, interaction = 600, MLPClassifier. • Analysis of Results. After the procedures described above, we divided the recognized emotions into neutral, neutral-male, and neutral-female.

Results and Discussion

The trained model has an F-Score of 84% when 80% (550 audios) of the training base (see Table 1) is used. The other 20% of the training base (125 audios in total) is for the tests. In Table 2, a confusion matrix shows data from the experiments. After we applied the developed model to the available test base and submitted it to the SER, we achieved an accuracy rate measured by the F-Score of 55% in the results.

Table 1 -Distribution of Results

Table 2 -Matrix Confusion

By using the 308 audios, we generated the results from the data available for testing. For classification, we created the MLPClassifier. As a result, 259, 27, and 22 audios were labelled as neutral, non-neutral female, and non-neutral male, respectively as shown in Table 3.

Table 3 -Classification

Graph 1 depicts the classification distribution. Neutral audios (84%) were the majority in the dataset, followed by non-neutral female (9%), and non-neutral male (7%).

Final Remarks

We have proposed a method for extracting and recognizing emotion in the Portuguese language. We have carried out a simple process based on preprocessing strategies, prosody extraction, MFCC, MEL, and Chromagram. We have reached our goal by using the dataset CORAA-v1.1, which has 625 audios classified as neutral, masculine, and feminine language. Our strategy does not take advantage of external models to manipulate the data, and, according to the SER 2022 evaluation, it can reach a macro-averaged F1 score of 55%. Due to simplicity, we have been to generate the results in 18 seconds by considering the whole set of CORAA audios.

By considering the SofiaFala project, we have looked for new possibilities for monitoring, understanding, and even treating speech and emotion. Here, we have developed a SofiaFala module aiming at improving a person's functional capacity of speech, and hence, communication. Moreover, we have contributed to the usability evaluation of SofiaFala [7].

As future work, we will integrate our SER module into the SofiaFala app. Moreover, we will evaluate the use of external models.

Figure 1 -1Figure 1 -The SER System: Process and Computational Modules.

Graph 1 -1Distribution of Results

Acknowledgments

This research was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by the São Paulo Research Foundation (FAPESP grant 2019/07665-4) and by the IBM Corporation.

The authors would like to thank the SofiaFala group, CNPq, C4AI-USP and SER 2022 organizers for their support.

Sistema de informação de apoio ao programa de educação para pais e famílias DSDe Paula SR GPanico JCDaneluzzi EE SRuiz JCFelipe AAMacedo Proceedings of XI Congresso Brasileiro de Informática em Saúde XI Congresso Brasileiro de Informática em Saúde 2008 Sofiafala: Software inteligente de apoio à fala PH D GRissato AAMacedo Anais Estendidos do XXVII Simpósio Brasileiro de Sistemas Multimídia e Web SBC 2021 Avaliação da influência da remoção de stopwords na abordagem estatística de extração automática de termos IBraga 7th Brazilian Symposium in Information and Human Language Technology (STIL 2009)

So Carlos, SP, Brazil

2009 18 Speech recognition using mfcc CIttichaichareon SSuksri TYingthawornsuk International conference on computer graphics, simulation and modeling 2012 KVenkataramanan HRRajamohan arXiv:1912.10458 Emotion recognition from speech 2019 arXiv preprint Use of different features for emotion recognition using mlp network HPalo MNMohanty MChandra Computational Vision and Robotics Springer 2015 A nonverbal recognition method to assist speech FMeloni BSicchieri PMandrá RBulcão-Neto AAMacedo IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS), IEEE 2021. 2021