<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Speech Emotion Recognition in Portuguese for SofiaFala: SER SofiaFala</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexander Scaranti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Douglas Antonio Rodrigues Silva</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fernando Meloni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>D.Sc.</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandra Alaniz Macedo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>D.Sc.</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of São Paulo</institution>
          ,
          <addr-line>USP</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Emotion recognition through speech processing has been increasingly demanded as a response to scientific advances and improvement in information technologies. However, a gap exists when the demand concerns projects in the Portuguese language. Here, we propose a method for extracting and recognizing emotion in the Portuguese language. We have evaluated response time, length, silence ratio, long silence ratio, and silence rate. According to the SER 2022 evaluation, our strategy can reach a macro-averaged F1 score of 55% on a very imbalanced dataset. We have aligned our results with the SofiaFala project, which supports speech training in children with Down syndrome.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Speech Processing</kwd>
        <kwd>Emotion Recognition</kwd>
        <kwd>Portuguese Language</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Artificial Intelligence</kwd>
        <kwd>SofiaFala</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In the last two years, the COVID-19 pandemic has swept the world, leading to new demands
for diferent approaches to communication and interaction. In turn, the 5G technology, which
emerged in the second decade of the 21st century, supports new possibilities. In this context,
modern algorithm-aligned voice processing tools have paved new ground for improving people’s
quality of life, assisting people with incapacity, or even assisting long-distance interaction. These
algorithms, created with researchers’ hard work, have allowed new opportunities such as the
Speech Emotion Recognition task to be envisioned.</p>
      <p>Portuguese-speaking countries sufer from a scarcity of tools to support speech and emotion
recognition. For instance, speech sound and language vary in the many regions of Brazil, a
country with continental dimensions. This situation demands research into speech manipulation
by considering utterances that sound prosodically distinct. Speaking manner or speech disorders
can interfere with speech emotion recognition.</p>
      <p>
        The SofiaFala software[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], developed in the LIS laboratory at USP-Ribeirão Preto-SP,
recognizes sounds and images produced during exercises and provides reports on assistive speech
training for speech disorders of children with Down syndrome [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Expressing emotions through speech is a part of oral communication through the voice. For
voice analysis and knowledge to be generated, diferent data types (texts, images, and types of
speech) must be manipulated through a coordinated analysis that considers connections and
particularities of sound. This manipulation is challenging and desirable. For instance, SofiaFala
can take advantage of emotion recognition during speech training.</p>
      <p>Here, we propose a speech emotion recognition method that uses the corpus provided by the
SER committee, namely CORAA version 1.1, which is composed of approximately 50 minutes
of audio segments. Our work focuses on the clipping of emotions in speech. We intend to
incorporate SER as a module of the SofiaFala app.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Our Proposal: SER System</title>
      <p>Considering the dataset CORAA available for the shared task and aiming at recognizing emotion,
we have developed a computer system called SER to carry out natural language processing and
other steps.</p>
      <p>
        SER was built in Python, and it executed the experiments presented in Section 3. Figure 1
illustrates the process and the computational modules.
• Acquisition. All information acquired from the dataset CORAA-v1.1 has three classes:
neutral, male neutral, and female neutral, amounting to 625 audio fragments that total 50
minutes of speech. The neutral class comprises audio segments without a well-defined
emotional state. The non-neutral class represents segments associated with one of the
primary emotional states in the speaker’s speech. This non-neutral dataset, called the
C-ORAL-BRASIL I corpus, has informal spontaneous speech of Brazilian Portuguese
(Raso and Mello, 2012).
• Preprocessing. We processed all the acquired audios to clean and to try to improve the
performance of the next step, feature extraction. We also applied filters to remove noise
from the audios [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Moreover, we converted all the audios from stereo to mono and
distributed them into three classes: neutral, non-neutral female, and non-neutral male.
• Prosody and Feature Extractions. Extraction is the method that analyzes and brings out
information from the audio so that the learning model can be developed. Next, we will
detail it. In terms of feature extraction, our system carried out some steps by considering:
– Prosody Extraction. Prosody or speech elements are properties of linguistic
functions with features. We extracted the following features from all the audios in the
base: response time, response length, silence ratio, long silence ratio, silence rate,
frequency, and intensity.
– Feature extraction with MFCC. MFCC is a feature extraction method for audio that
uses the Fourier transform [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. MFCC is the most used method in speech processing
because it is the most suitable for representing audio and signal characteristics. This
method captures sound exactly as humans recognize it.
– Transformation with Spectrogram (MEL). Logarithmic Transformation of an audio
signal frequency is said to be a MEL scale whose its central idea is sounds of equal
distances (MEL scale) that mimic our perception of sound[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Transformation from
the Hertz scale to the Mel scale is as follows:
      </p>
      <p>
        = 1127.(1 +  /700)
– Aggregation of Chromagram. We used this strategy to increase the robustness of
our logarithmic frequency spectrogram to variations in timbre and instrumentation.
The main idea of chroma features is to aggregate all spectral information related to
a given pitch class into a single coeficient.
• Classification . We applied an MLP Neural Network[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] with the following parameters:
      </p>
      <p>Hidden Layer = 500, interaction = 600, MLPClassifier.
• Analysis of Results. After the procedures described above, we divided the recognized
emotions into neutral, neutral-male, and neutral-female.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Results and Discussion</title>
      <p>The trained model has an F-Score of 84% when 80% (550 audios) of the training base (see Table
1) is used. The other 20% of the training base (125 audios in total) is for the tests. In Table 2, a
confusion matrix shows data from the experiments. After we applied the developed model to
the available test base and submitted it to the SER, we achieved an accuracy rate measured by
the F-Score of 55% in the results.</p>
      <p>By using the 308 audios, we generated the results from the data available for testing. For
classification, we created the MLPClassifier. As a result, 259, 27, and 22 audios were labelled as
neutral, non-neutral female, and non-neutral male, respectively as shown in Table 3.</p>
      <p>Graph 1 depicts the classification distribution. Neutral audios (84%) were the majority in the
dataset, followed by non-neutral female (9%), and non-neutral male (7%).</p>
      <p>Graph 1 - Distribution of Results</p>
    </sec>
    <sec id="sec-4">
      <title>4. Final Remarks</title>
      <p>We have proposed a method for extracting and recognizing emotion in the Portuguese language.
We have carried out a simple process based on preprocessing strategies, prosody extraction,
MFCC, MEL, and Chromagram. We have reached our goal by using the dataset CORAA-v1.1,
which has 625 audios classified as neutral, masculine, and feminine language. Our strategy does
not take advantage of external models to manipulate the data, and, according to the SER 2022
evaluation, it can reach a macro-averaged F1 score of 55%. Due to simplicity, we have been to
generate the results in 18 seconds by considering the whole set of CORAA audios.</p>
      <p>
        By considering the SofiaFala project, we have looked for new possibilities for monitoring,
understanding, and even treating speech and emotion. Here, we have developed a SofiaFala
module aiming at improving a person’s functional capacity of speech, and hence, communication.
Moreover, we have contributed to the usability evaluation of SofiaFala[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>As future work, we will integrate our SER module into the SofiaFala app. Moreover, we will
evaluate the use of external models.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This research was carried out at the Center for Artificial Intelligence (C4AI- USP), with support
by the São Paulo Research Foundation (FAPESP grant 2019/07665-4) and by the IBM Corporation.</p>
      <p>The authors would like to thank the SofiaFala group, CNPq, C4AI- USP and SER 2022
organizers for their support.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D</given-names>
            <surname>. S. de Paula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R. G.</given-names>
            <surname>Panico</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Daneluzzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. E. S.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Felipe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Macedo</surname>
          </string-name>
          , Sistema de informação de apoio ao programa de educação para pais e famílias,
          <source>in: Proceedings of XI Congresso Brasileiro de Informática em Saúde</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P. H. D. G.</given-names>
            <surname>Rissato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Macedo</surname>
          </string-name>
          , Sofiafala: Software inteligente de apoio à fala, in: Anais Estendidos
          <string-name>
            <surname>do XXVII Simpósio Brasileiro de Sistemas Multimídia</surname>
          </string-name>
          e Web, SBC,
          <year>2021</year>
          , pp.
          <fpage>91</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>I. Braga</surname>
          </string-name>
          ,
          <article-title>Avaliação da influência da remoção de stopwords na abordagem estatística de extração automática de termos</article-title>
          ,
          <source>in: 7th Brazilian Symposium in Information and Human Language Technology (STIL</source>
          <year>2009</year>
          ), So Carlos,
          <string-name>
            <surname>SP</surname>
          </string-name>
          , Brazil,
          <year>2009</year>
          , p.
          <fpage>18</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ittichaichareon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suksri</surname>
          </string-name>
          , T. Yingthawornsuk,
          <article-title>Speech recognition using mfcc</article-title>
          ,
          <source>in: International conference on computer graphics, simulation and modeling</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>135</fpage>
          -
          <lpage>138</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Venkataramanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Rajamohan</surname>
          </string-name>
          ,
          <article-title>Emotion recognition from speech</article-title>
          , arXiv preprint arXiv:
          <year>1912</year>
          .
          <volume>10458</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Palo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Mohanty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chandra</surname>
          </string-name>
          ,
          <article-title>Use of diferent features for emotion recognition using mlp network, in: Computational Vision</article-title>
          and Robotics, Springer,
          <year>2015</year>
          , pp.
          <fpage>7</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Meloni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sicchieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mandrá</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bulcão-Neto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Macedo</surname>
          </string-name>
          ,
          <article-title>A nonverbal recognition method to assist speech</article-title>
          ,
          <source>in: 2021 IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>360</fpage>
          -
          <lpage>365</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>