<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Speech Emotion Recognition in Portuguese for SofiaFala: SER SofiaFala</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Alexander</forename><surname>Scaranti</surname></persName>
							<email>alexander.scaranti@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">University of São Paulo (USP)</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Douglas</forename><forename type="middle">Antonio</forename><surname>Rodrigues</surname></persName>
							<email>douglasarsilva@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">University of São Paulo (USP)</orgName>
							</affiliation>
						</author>
						<author>
							<persName><roleName>Prof</roleName><surname>Fernando Meloni</surname></persName>
							<email>fernandomeloni@alumni.usp.br</email>
							<affiliation key="aff0">
								<orgName type="institution">University of São Paulo (USP)</orgName>
							</affiliation>
						</author>
						<author>
							<persName><roleName>Prof</roleName><surname>Alessandra Alaniz Macedo</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of São Paulo (USP)</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Speech Emotion Recognition in Portuguese for SofiaFala: SER SofiaFala</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">09762218056F2C45F1E869BA2A3AB4B6</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T22:20+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Speech Processing</term>
					<term>Emotion Recognition</term>
					<term>Portuguese Language</term>
					<term>Natural Language Processing</term>
					<term>Artificial Intelligence</term>
					<term>SofiaFala</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Emotion recognition through speech processing has been increasingly demanded as a response to scientific advances and improvement in information technologies. However, a gap exists when the demand concerns projects in the Portuguese language. Here, we propose a method for extracting and recognizing emotion in the Portuguese language. We have evaluated response time, length, silence ratio, long silence ratio, and silence rate. According to the SER 2022 evaluation, our strategy can reach a macro-averaged F1 score of 55% on a very imbalanced dataset. We have aligned our results with the SofiaFala project, which supports speech training in children with Down syndrome.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In the last two years, the COVID-19 pandemic has swept the world, leading to new demands for different approaches to communication and interaction. In turn, the 5G technology, which emerged in the second decade of the 21st century, supports new possibilities. In this context, modern algorithm-aligned voice processing tools have paved new ground for improving people's quality of life, assisting people with incapacity, or even assisting long-distance interaction. These algorithms, created with researchers' hard work, have allowed new opportunities such as the Speech Emotion Recognition task to be envisioned.</p><p>Portuguese-speaking countries suffer from a scarcity of tools to support speech and emotion recognition. For instance, speech sound and language vary in the many regions of Brazil, a country with continental dimensions. This situation demands research into speech manipulation by considering utterances that sound prosodically distinct. Speaking manner or speech disorders can interfere with speech emotion recognition.</p><p>The SofiaFala software <ref type="bibr" target="#b0">[1]</ref>, developed in the LIS laboratory at USP-Ribeirão Preto-SP, recognizes sounds and images produced during exercises and provides reports on assistive speech training for speech disorders of children with Down syndrome <ref type="bibr" target="#b1">[2]</ref>.</p><p>Expressing emotions through speech is a part of oral communication through the voice. For voice analysis and knowledge to be generated, different data types (texts, images, and types of speech) must be manipulated through a coordinated analysis that considers connections and particularities of sound. This manipulation is challenging and desirable. For instance, SofiaFala can take advantage of emotion recognition during speech training.</p><p>Here, we propose a speech emotion recognition method that uses the corpus provided by the SER committee, namely CORAA version 1.1, which is composed of approximately 50 minutes of audio segments. Our work focuses on the clipping of emotions in speech. We intend to incorporate SER as a module of the SofiaFala app.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Our Proposal: SER System</head><p>Considering the dataset CORAA available for the shared task and aiming at recognizing emotion, we have developed a computer system called SER to carry out natural language processing and other steps.</p><p>SER was built in Python, and it executed the experiments presented in Section 3. Figure <ref type="figure" target="#fig_0">1</ref> illustrates the process and the computational modules. SER is composed of the following stages:</p><p>• Acquisition. All information acquired from the dataset CORAA-v1.1 has three classes: neutral, male neutral, and female neutral, amounting to 625 audio fragments that total 50 minutes of speech. The neutral class comprises audio segments without a well-defined emotional state. The non-neutral class represents segments associated with one of the primary emotional states in the speaker's speech. This non-neutral dataset, called the C-ORAL-BRASIL I corpus, has informal spontaneous speech of Brazilian Portuguese (Raso and Mello, 2012). • Preprocessing. We processed all the acquired audios to clean and to try to improve the performance of the next step, feature extraction. We also applied filters to remove noise from the audios <ref type="bibr" target="#b2">[3]</ref>. Moreover, we converted all the audios from stereo to mono and distributed them into three classes: neutral, non-neutral female, and non-neutral male. • Prosody and Feature Extractions. Extraction is the method that analyzes and brings out information from the audio so that the learning model can be developed. Next, we will detail it. In terms of feature extraction, our system carried out some steps by considering:</p><p>-Prosody Extraction. Prosody or speech elements are properties of linguistic functions with features. We extracted the following features from all the audios in the base: response time, response length, silence ratio, long silence ratio, silence rate, frequency, and intensity. -Feature extraction with MFCC. MFCC is a feature extraction method for audio that uses the Fourier transform <ref type="bibr" target="#b3">[4]</ref>. MFCC is the most used method in speech processing because it is the most suitable for representing audio and signal characteristics. This method captures sound exactly as humans recognize it.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>-Transformation with Spectrogram (MEL). Logarithmic Transformation of an audio</head><p>signal frequency is said to be a MEL scale whose its central idea is sounds of equal distances (MEL scale) that mimic our perception of sound <ref type="bibr" target="#b4">[5]</ref>. Transformation from the Hertz scale to the Mel scale is as follows:</p><formula xml:id="formula_0">𝑚 = 1127.𝑙𝑜𝑔(1 + 𝑓 /700)</formula><p>-Aggregation of Chromagram. We used this strategy to increase the robustness of our logarithmic frequency spectrogram to variations in timbre and instrumentation.</p><p>The main idea of chroma features is to aggregate all spectral information related to a given pitch class into a single coefficient.</p><p>• Classification. We applied an MLP Neural Network <ref type="bibr" target="#b5">[6]</ref> with the following parameters: Hidden Layer = 500, interaction = 600, MLPClassifier. • Analysis of Results. After the procedures described above, we divided the recognized emotions into neutral, neutral-male, and neutral-female.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Results and Discussion</head><p>The trained model has an F-Score of 84% when 80% (550 audios) of the training base (see Table <ref type="table">1</ref>) is used. The other 20% of the training base (125 audios in total) is for the tests. In Table <ref type="table">2</ref>, a confusion matrix shows data from the experiments. After we applied the developed model to the available test base and submitted it to the SER, we achieved an accuracy rate measured by the F-Score of 55% in the results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1 -Distribution of Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2 -Matrix Confusion</head><p>By using the 308 audios, we generated the results from the data available for testing. For classification, we created the MLPClassifier. As a result, 259, 27, and 22 audios were labelled as neutral, non-neutral female, and non-neutral male, respectively as shown in Table <ref type="table">3</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 3 -Classification</head><p>Graph 1 depicts the classification distribution. Neutral audios (84%) were the majority in the dataset, followed by non-neutral female (9%), and non-neutral male (7%). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Final Remarks</head><p>We have proposed a method for extracting and recognizing emotion in the Portuguese language. We have carried out a simple process based on preprocessing strategies, prosody extraction, MFCC, MEL, and Chromagram. We have reached our goal by using the dataset CORAA-v1.1, which has 625 audios classified as neutral, masculine, and feminine language. Our strategy does not take advantage of external models to manipulate the data, and, according to the SER 2022 evaluation, it can reach a macro-averaged F1 score of 55%. Due to simplicity, we have been to generate the results in 18 seconds by considering the whole set of CORAA audios.</p><p>By considering the SofiaFala project, we have looked for new possibilities for monitoring, understanding, and even treating speech and emotion. Here, we have developed a SofiaFala module aiming at improving a person's functional capacity of speech, and hence, communication. Moreover, we have contributed to the usability evaluation of SofiaFala <ref type="bibr" target="#b6">[7]</ref>.</p><p>As future work, we will integrate our SER module into the SofiaFala app. Moreover, we will evaluate the use of external models.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 -</head><label>1</label><figDesc>Figure 1 -The SER System: Process and Computational Modules.</figDesc><graphic coords="2,100.20,319.93,396.86,90.23" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Graph 1 -</head><label>1</label><figDesc>Distribution of Results</figDesc></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This research was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by the São Paulo Research Foundation (FAPESP grant 2019/07665-4) and by the IBM Corporation.</p><p>The authors would like to thank the SofiaFala group, CNPq, C4AI-USP and SER 2022 organizers for their support.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Sistema de informação de apoio ao programa de educação para pais e famílias</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>De Paula</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">R G</forename><surname>Panico</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>Daneluzzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">E S</forename><surname>Ruiz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>Felipe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Macedo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of XI Congresso Brasileiro de Informática em Saúde</title>
				<meeting>XI Congresso Brasileiro de Informática em Saúde</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Sofiafala: Software inteligente de apoio à fala</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">H D G</forename><surname>Rissato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Macedo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Anais Estendidos do XXVII Simpósio Brasileiro de Sistemas Multimídia e Web</title>
				<imprint>
			<publisher>SBC</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="91" to="94" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Avaliação da influência da remoção de stopwords na abordagem estatística de extração automática de termos</title>
		<author>
			<persName><forename type="first">I</forename><surname>Braga</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">7th Brazilian Symposium in Information and Human Language Technology (STIL 2009)</title>
				<meeting><address><addrLine>So Carlos, SP, Brazil</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page">18</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Speech recognition using mfcc</title>
		<author>
			<persName><forename type="first">C</forename><surname>Ittichaichareon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Suksri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Yingthawornsuk</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on computer graphics, simulation and modeling</title>
				<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="135" to="138" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">K</forename><surname>Venkataramanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">R</forename><surname>Rajamohan</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1912.10458</idno>
		<title level="m">Emotion recognition from speech</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Use of different features for emotion recognition using mlp network</title>
		<author>
			<persName><forename type="first">H</forename><surname>Palo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">N</forename><surname>Mohanty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chandra</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Computational Vision and Robotics</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="7" to="15" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A nonverbal recognition method to assist speech</title>
		<author>
			<persName><forename type="first">F</forename><surname>Meloni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Sicchieri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mandrá</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Bulcão-Neto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Macedo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS), IEEE</title>
				<imprint>
			<date type="published" when="2021">2021. 2021</date>
			<biblScope unit="page" from="360" to="365" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
