<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A computational framework for speech emotion recognition in case of multisource data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessandra Grossi</string-name>
          <email>alessandra.grossi@unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgio Fratti</string-name>
          <email>giorgio.fratti@unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesca Gasparini</string-name>
          <email>francesca.gasparini@unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics</institution>
          ,
          <addr-line>Systems and Communication</addr-line>
          ,
          <institution>University of Milano-Bicocca</institution>
          ,
          <addr-line>Building U14, Viale Sarca 336</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>NeuroMI, Milan Center for Neuroscience</institution>
          ,
          <addr-line>Piazza dell'Ateneo Nuovo 1, 20126 Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Although several researches have been carried out in the field of Speech Emotion Recognition (SER), only few of them consider people of diferent ages or languages. In particular, most of the SER datasets reported in the literature are collected from young adults or take into account a single language, such as English or Chinese. These datasets tend to be poorly heterogeneous and dependent on the context in which they are collected. In general they are composed of acted utterances or they are recorded in situations properly designed to evoke certain emotions. This paper proposes a framework that allows to benefit of complementary information coming from multisource data to train a general SER model. To merge diferent sources, proper preprocessing steps to normalize the data source, the type of recorded speeches, and the subjects considered are here described. Furthermore we present a domain adaptation strategy that allows to benefit of the general model adapting it to a certain language and/or a certain population age. In particular here we are interested in developing SER models that consider Italian older adults. Preliminary results that consider several sources for training and diferent language as test set confirm the validity of the proposal.</p>
      </abstract>
      <kwd-group>
        <kwd>speech emotion recognition</kwd>
        <kwd>multisource</kwd>
        <kwd>older adults</kwd>
        <kwd>domanin adaptation</kwd>
        <kwd>XGboost</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        With the increasing of life expectancy, the promotion of positive psychological well-being of
older adults is becoming a primary need. Many older adults live alone in their own homes [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
usually isolated because of health problems or major life events that threaten to limit their social
interaction [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The negative impact of this isolation on mental and physical health leads to the
need to develop systems that can monitor[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and interact naturally with older adults during
their daily lives. In particular, Social Robots, as Companion Type Robots, are being developed
specifically to provide companionship and cognitive support to frail people [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to ensure their
health and psychological well-being [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Such robots must be able to interact with people in a
(F. Gasparini)
CEUR
Workshop
Proceedings
      </p>
      <p>
        © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
natural and realistic way, inferring their emotions and adapting their behaviour accordingly.
Similarly, conversational agents were proposed in healthcare domain as mean to help people
that live alone or sufer of mental illnesses like depression or anxiety disorder. Examples include
voice chatbots like Charlie [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. These bots aim to provide empathetic support to elderly people
and encourage conversations that simulate human interaction. Furthermore, a system that can
detect emotions from speech could be incorporated in automated call centres [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or toll-free
helplines for older adults (like the Silver Line) to identify emergency situations, vulnerabilities
or social isolation, and take appropriate action. Language and speech are one of the most natural
method of communication between humans, and diferent emotional information can be drawn
from the acoustic characteristics of speaker’s voice. Speech Emotion Recognition (SER) is the
task of recognizing the speaker’s emotion through the processing and classification of his/her
speech signal [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Several datasets exixt in the literature that try to face the problem of SER, as previoulsy deeply
described in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], where 48 diferent datasets have been analyzed and summarized in the pie charts
depicted in Figure 1. An excel file that synthesizes the whole analysis of these datasets is reported
as supplementary material available at the following link https://mmsp.unimib.it/download-1/.
These datasets have been acquired in diferent ways that can be grouped as: i) acted, ii) evoked,
and iii) natural conversation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Besides the huge number of datasets considered only few of
them are available and moreover their characteristics are significantly diferent to be merged
directly in order to provide a huge dataset to train a SER model. These diferences are related
to: emotional space, language, age, type of collection, devices adopted among others. We
here propose a framework intended to normalize these datasets with respect to their diferent
characteristics, in order to benefit of a consistend amount of speech data able to train a general
SER model. From this general model a successive adaptation step allows to apply the domain
adapted model to a specific SER application, for instance in case of a particular population in
terms of language and age.
      </p>
      <p>From here on, the manuscript is organised as follows. Section 2 provides a brief description
of the main stages involved in creating a SER computational system that combines diferent
datasets into a heterogeneous, unified multisource dataset. Section 3 presents our proposed
normalisation framework, mainly focused on standardising data from diferent acted datasets.
The efectiveness of the proposed framework is evaluated in the 4 section by comparing diferent
classification models. Finally, next steps and conclusions are outlined in the last section.</p>
    </sec>
    <sec id="sec-3">
      <title>2. A multisource data SER computational framework</title>
      <p>To benefit of complementary information coming from diferent datasets acquired under diferent
exeprimental conditions, and using diferent devices, it is necessary to define a framework that
faces all the related issues.</p>
      <p>The multisource framework for Speech Emotion Recognition here proposed is depicted in
Figure 2, and is composed of four main modules relted to defining the datasets to be used,
defining the proper signal processing steps to align and normalize the diferent data sources,
select and train an appropriate machine learning model, and finally define a proper startegy to
adapt the general model developed to a specific populationa with respect to age and language.
Each single module is described in what follows.</p>
      <sec id="sec-3-1">
        <title>2.1. Source selection</title>
        <p>The first step in defining a unique multisource heterogeneous dataset for training a general SER
model is to select the individual sources to be included in it. In this phase the characteristics of
each dataset should be taken into account.</p>
        <p>
          As already reported in the introduction, SER datasets vary according to the way in which the
emotions are elicited. In particular, they could be divided into acted, evoked, and spontaneous
or natural conversation datasets. Furthermore, diferent data could be labelled with diferent
emotions depending on the emotional model chosen. Two types of model are usually involved in
SER analysis: the discrete or categorical emotional models, which include the 6-basic emotions
defined by Ekman and Friesen [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] or Plutchik’s Wheel of Emotions [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], and the dimensional
or continuous emotional models, such as the 3D Valence-Arousal-Dominance space [13] . When
considering the union of diferent datasets, the use of similar types of emotional models makes
it possible to simplify the merging between data, thus avoiding the imbalance problems due to
the definition of a mapping between categorical and continuous spaces [ 14].
Also the characteristics of the speakers can influence the choice of the datasets to include in the
general model. Most of the available data are collected from young adults and considering a
single language, usually English. However, the human voice changes with age and subject’s
gender [15]. In addition, personality traits or cultural aspects, such as language, may influence
the way a person expresses his/her emotions. In particular, the presence of dialects has to
be taken into account in the definition of a model that can be applied to diferent contexts or
populations.
        </p>
        <p>The choice of which datasets to use to define the multisource heterogeneous corpus is therefore
relevant and and must consider all these aspects. In this context, the use of similar datasets may
simplify the integration process, but may result in reduced data variability.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Signal Processing</title>
        <p>A signal processing step should then be applied to the selected signals to minimize discrepancies
resulting from the diverse nature of the data. The choice of which processing techniques apply
to each data depends on the type of dataset considered, as well as the population and device
used to record the audio signals.</p>
        <p>The characteristics of the recording devices can afect some of the audio features, deteriorating
their quality or reliability [16]. Audio signals may be heterogeneous in terms of number
of channels (e.g. mono or stereo), volume, and frequency resolution when recorded using
diferent devices, such as high quality microphones or consumer-grade ones. According to the
literature [17], the volume adjustment can be performed standardizing the signals by z-score
normalization. This allows to normalize the audio in terms of volume, but also with reference
of the characteristics of the subject. In addition, a Sample Rate Conversion can be applied to
the raw signals of each dataset in order to define a single temporal resolution suitable for all
the data collected. This operation could be performed considering the minimum sampling rate
in the original datasets. Finally, to standardize the data according to the number of channels,
the stereo audio can be converted into a mono signal by selecting only one of the channels or
by averaging, for each sample, the data of the two channels.</p>
        <p>
          The recording environment and the method used to elicit the emotions can also create a mismatch
between the signals of diferent source, thus afecting the preprocessing step. For instance,
Fahad et al. [18] report some issues due to the use of speech audio collected during natural
environmental conversations, such as the presence of background noise, multiple voices, long
period of silence or utterances with diferent length. Various denoising techniques have been
proposed in literature to minimize the data mismatch due to these issues. In particular, noise
reduction methods based on filters, estimators or spectral subtraction are usually applied in
particular to reduce background noise [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Moreover, as further speech enhancement techniques,
the audio signal can be filtered using band pass filters or first-class FIR high-pass digital filters
[19] in order to select the frequency range related to human voice. Finally, three strategies
have been proposed to overcome the problem related to utterances with diferent length [ 18]:
i) computing global features as summary statistics of local features extracted on the frames
in which audio signal is splitted by applying a fixed size sliding window; ii) using padding
strategy to standardize the signals in length; iii) dividing audio signal into segments of fixed
length. Concerning this latter, short utterances of 0.5 - 1.00 second are preferred [20] to longer
utterances as they allow to extract significant features while maintaining the quasi-stationary
state of the speech signal.
        </p>
        <p>The final factor to consider in pre-processing is the speakers heterogeneity. Speech audio signals
are subjective and vary according to personal characteristics, such as age, gender or vocal tract
length of the speaker. Such diferences make it necessary to apply subject-based normalization
to the audio signals and it becomes mandatory in the case of multisource datasets. Two strategies
have been investigated in previous studies, applied to each subject data: i) the standardization or
range normalization applied to the whole signal sequence; ii) the neutral-based normalization,
where the parameters are obtained on baseline or neutral signal and then applied to the rest of
the audios of the same subject [17].</p>
      </sec>
      <sec id="sec-3-3">
        <title>2.3. Model definition</title>
        <p>Once pre-processed, the audio signals of the multisource dataset can be used for the definition
of the general model. Several algorithms for speech emotion recognition have been proposed
in the literature, including traditional machine learning techniques as well as deep learning
approaches. Based on the model chosen, the necessary input must be obtained from the audio
signals in the form of feature vectors, images or raw signals.</p>
        <p>In particular, in the case of traditional machine learning techniques, four types of acoustic
features can be extracted from speech audio signals: prosodic features, like rhythm and intonation,
spectral features, voice quality features and Teager Energy Operator (TEO) Based Features [21].
Some of this features are subject independent, such as the Ratio of Spectral Flatness to spectral
center (RSS) [22] or the features based on weighted bi-spectrum [23] , while others have to be
normalized to take into account diferences between datasets.</p>
        <p>Several deep learning algorithms, such as Convolutional Netural Network, need image as input.
Time-frequency representation of the audio signals, including spectrograms or scalograms,
are usually employed for this purpose. However, several issues have to be taken into account
during this conversion. In particular, the length of the signals and the image range have to be
homogeneous to make the data comparable.</p>
        <p>To train and validate the general model, several evaluation strategies have been proposed in
literature. In case of speech, the use of traditional techniques as hold-out cross validation or
k-fold cross validation can lead to biased results. In fact, audio signals from the same subjects
or related to similar utterances may appear in both training and test set, thus making the model
biased on this type of data. Evaluation strategies such as Leave One Subject out (LOSO), Leave
One Utterance Out (LOUO) or Subject independent k-fold cross validation are mandatory in
case of multisource dataset analysis. Finally, the emotional model selected can afect also the
evaluation metrics used to assess the classifier. Metrics such as the per-class f1-score or macro
f1-score have to be preferred in case of multi-class classification as they take into account the
issues due to unbalance among the diferent classes.</p>
        <p>Concerning this latter, the use of diferent datasets can lead to the definition of classes not
balanced in number of instances. Data augmentation strategies or undersampling methods
have been proposed in the state of art as solution for this problem. However, the creation of
synthetic data, as well as the random selection of subset, could afect the performances of the
classifier reducing the generalizability of the model or adding bias due to the similarity between
the original and the new data.</p>
      </sec>
      <sec id="sec-3-4">
        <title>2.4. Domain Adaptation</title>
        <p>The use of multisource dataset in the definition of the SER classifier allows to reduce the
generalization error, creating models able to capture meaningful patterns of the speech data. However,
these models are not always proper for describing scenarios characterized by few, unlabeled
data, as occurs with certain languages or age groups. Recent studies [24, 14] have presented
multiple transfer-learning methodologies that reuse knowledge acquired from difering but
correlated tasks (source domain) to enhance recognition accuracy for a novel task (target domain).
Several approaches presented in literature have aimed to enhance deep learning SER models
performance by fine-tuning pre-trained networks, primarily based on images, using speech data
gathered from specific domains[ 25]. Furthermore, pre-trained networks can also be employed
as features extraction methods as outlined in [26, 27]. Finally, feature-based domain adaptation
strategies have also been tested by researchers to adapt pre-trained machine learning classifiers
to new labelled data, as reported in [14, 28].</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Experiments on the proposed framework</title>
      <p>In the following analysis, two aspects of the proposed framework are considered: i) the benefits
obtained from a multisource approach to build a general model for SER, addressing all the
normalization aspects required, and ii) the adoption of a domain adaptation strategy to fine
tune a general model for a diferent specific scenario.</p>
      <p>Due to the lack of available datasets, especially containing a significant number of Italian
elderly, for this preliminary study some assumptions have been made:
• Only acted datasets have been taken into account for the experiments.
• Ekman’s universal emotions, including angry, fear, disgust, surprise, happy, sad, and
neutral, have been selected as emotional model.</p>
      <p>• Only domain adaptation with respect to language is here considered.</p>
      <p>Based on these assumptions, four acted datasets have been involved to test the performance
of the general model: analysis: CREMA_D, RAVDESS, SAVEE, and EMOVO. The Crowd-sourced
Emotional Multimodal Actors Dataset (CREMA_D) [29] and the Ryerson Audio-Visual Database
of Emotional Speech and Song (RAVDESS) [30] are two multi-modal acted datasets collected
in a controlled environment. In the CREMA_D dataset, 96 professional actors (48 male and
43 female) performed 12 semantically neutral phrases while simulating six distinct emotions
(Anger, Disgust, Fear, Happiness, Neutral, and Sadness). Participants of diferent ethnic and
ages were involved in the dataset, including 6 older adults. Similarly, the RAVDNESS dataset
consists in 7,356 recordings obtained from 24 young adult actors (12 males, 12 females) with a
mean age of 26 years. Each participant performed 60 spoken utterances and 44 sung utterances,
expressing eight emotions (happy, sad, angry, fearful, surprised, disgusted, calm, and neutral)
under two distinct levels of intensity (normal and strong). In both the datasets, the utterances
are in English and the audio signals were recorded using a sampling frequency of 48 kHz.
The Surrey Audio-Visual Expressed Emotion (SAVEE) database [31] is an acted dataset where
4 male actors, aged from 27 to 31 years, pronounce 15 English phonetically-balanced TIMIT
sentences in seven diferent emotions, including the six Ekman universal emotions as well as
the neutral state. Unlike CREMA_D and RAVDNESS, in SAVEE only a subset of the utterances
is common to all the emotions, while most of them change according to the emotional state
expressed. The audios are recorded using a sampling rate of 44.1 kHz.</p>
      <p>Finally, a SER italian dataset have been involved into the analysis to evaluate the performance
of a general model when adapted to recognize emotions from a specific population or scenario.
The Italian Emotional Speech Database (EMOVO) is an acted dataset that includes 588 speech
audios collected from 6 professional actors (3 male and 3 female) while they were playing 14
sentences mimicking 6 diferent emotions (disgust, anger, fear, surprise, joy, and sadness) plus
neutral state. All the utterances are in Italian and they include both semantically correct and
”nonsense” phrases. The audios are recorded considering a sampling frequency of 48 kHz as
well as a bit depth of 16 bit.</p>
      <p>In this preliminary experiments, CREMA_D and RAVDNESS have been selected as training
and/or test set, while SAVEE has been used only for testing. EMOVO instead has been employed
to test the performance of the model when a domain adaptation strategy is applied.</p>
      <p>Using these datasets, two diferent pipelines are here compared: a basic not normalized
framework (hereinafter referred to as Basic_framework), and a framework with normalization
and pre-processing (Norm_framework from hereon), are depicted in Figure 3</p>
      <p>Basic_framework. Since the audios of the considered dataset are already mono signals, the
ifrst step in the Basic_framework pipeline is the segmentation in frames. To limit the variance
in the length of the utterances, only the central 2-seconds frame of each audio is considered.
A zero padding strategy has been applied to the signals shorter then 2-seconds to standardize
them in number of samples. Furthermore, to take into account temporal variation of audio
signal, in all the analysis performed, the 2-second audio signals have been segmented in four
segments, using a not-overlapping window of 0.5 seconds. A total of 140 acoustic features were
used in the definition of the model: 35 acoustic features for each segment in which audio signal
have been divided.</p>
      <p>In particular, both temporal and spectral features have been considered for this analysis:
• MFCC: the first 20 Mel-Frequency Cepstral Coeficients, representing the envelope of
the short-time power spectrum, were evaluated on each 0.5 second frame as descriptors
of the shape of the vocal tract. A window size of 2048 samples and a hop length of 512
samples were selected for the computation of the Fast Fourier Transform (FFT).
• Chroma features: 12 chroma features (or pitch class profile) were evaluated for each frame
as indicators of the harmonic and melodic characteristics of voice. Similarly to MFCC, the
window size and the hop length of the FFT used to compute the chromagram were setted
respectively to 2048 samples and 512 samples. In our analysis, the chroma spectrogram is
wrapped averaging, for each frame and along the time-axis, the values of all the pitches
belonging to same pitch classes.
• RMS: The global energy of each 0.5 second frame of the signal has been calculated by
taking the Root Mean Square Energy of their amplitude. This prosodic features is often
evaluated in SER and it allows to describe the audio loudness.
• Spectral centroid: The speech brightness is evaluated using the Spectral Centroid feature,
which is the average of the signal frequency values weighted by the magnitude of each
frequency. For the evaluation of the frequency content, a Fast Fourier Transform using a
moving window of size 2048 samples and an overlap of 25% has been employed.
• ZCR: the Zero-Crossing Rate (ZCR) feature represent the number of time the signal cross
from positive to negative and viceversa, resunting thus a good measure of the frequency
content of the signal. A current analysis, a single ZCR is evaluated for each frames.</p>
      <p>Norm_Framework. In addition to the standard operations applied to the audio signals by
the Basic_Framework, the Norm_Framework incorporates further pre-processing steps for
standardizing the data concerning variations in data sources or subjects. In particular, before
applying the segmentation, a Butterworth band pass filter of 6-order is applied to the signals to
select the frequency band related to human voice (300 Hz - 4.5 kHz).</p>
      <p>In the Norm_Framework, the segmentation task is followed by a subject normalization step
performed using the neutral-based normalization. The z-score parameters were evaluated
for each subject based on his/her signals labeled as neutral emotion, and then used to
standardise his/her remaining audios. Finally, a down-sampling step is applied to all the
signals of both training and test set in order to standardize the data to a single Sampling
Frequency. A fixed 16 KHz sampling frequency has been chosen in accordance with [ 32] and [33].</p>
      <p>Notice that only a subset of the processing tasks outlined in Chapter 2 have been taken into
account in the two proposed pipelines. In particular, we focused on the steps necessary to
standardise data in acted datasets, as the only considered in this analysis.</p>
      <p>For each experiment performed, the features extracted from the audio signals have been then
used to train a multi-class gradient boosted decision trees algorithm implemented as XGBoost
[34]. All the six universal emotions (angry, fear, disgust, surprise, happy, sad), plus the neutral
state have been considered in the classification task. To assess and compare the performance
of the diferent classifiers, a 5-folds subject independent cross validation strategy has been
applied. In this method, the dataset is randomly partitioned in five subsets so that the data of
the same subject never occurs into two diferent folds. At each iteration all observations from
one of these groups of subjects are used to test the model, while the remaining observations
are used as training set. Several well-known evaluation metrics have been computed from
the resulting confusion matrix. In particular, the overall performance of the classifier were
measured using the Accuracy and Macro-F1 scores. The last experiment performed, concerns
the use of a domain adaptation module to analyze the ability of the general model in adapting
to a specific scenario represented by a small dataset. In this analysis, the domain adaptation
is used to adapt the general model, trained on English utterances, to recognize data collected
in Italian language from young adults. The Transfer AdaBoost for Classification (TrAdaBoost)
supervised domain adaptation strategy has been tested for this purpose. In according to [14],
the split of the data into Target and Test sets has been performed considering a Leave One
Subject Out Cross validation strategy.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Discussion</title>
      <p>The initial experiments aim to verify the benefits obtained from a multisource approach to build
a general model for SER.</p>
      <p>To this end we here consider as test sets: CREMA_D, RAVDESS, and a multisource dataset
defined as union of CREMA_D and RAVDESS. The same CREMA_D and RAVDESS, together
with the SAVEE dataset are then considered as test sets.</p>
      <p>In Table 2 the classification results achieved in the diferent trials are summarized. The
application of the proposed normalized framework allows to increase significantly the performance
of the classification model in almost all the analysis carried out. In particular, Macro F1-Score
values between 55% and 58% are obtained when the same dataset is employed as training and
test set using the subject independent 5-fold cross validation. These values outperform the
results achieved by the same model in the Basic_framework benchmark case (about 44%). When
diferent datasets are used as training and test set, as expected, the general performance of
the classification models decreases. However, also in these cases, the use of the
Norm_framework allows to improve the performance of the models, especially when the SAVEE dataset is
considered as test set. Herein, the use of multisource dataset as training set allows to achieve
the best result when the normalization framework is applied with a Macro F1-Score value of
30%. It is worth noting how the use of multisource training datasets enhances performance
compared to using a single dataset for training. This emphasises the significance of adopting a
more diversified and heterogeneous set of data when training SER models.</p>
      <p>The second set of experiments focuses on the domain adaptation (DA) step. The TrAdaBoost
DA module here considered has been applied to fine tune the general model to Italian language.
The performance of this module, compared with the performance obtained without domain
adaptation, are reported in Table 2. Although the performance is not high, the results obtained
show an increase both by including the normalisation procedure and applying DA, suggesting
that this approach is noteworthy and should be further investigated. It is not easy to compare
the results of our framework with others in the state of the art, mainly because even if several
datasets are considered, they are diferent from the ones here adopted. Moreover, it is also
dificult to find models that are validated using subject independent cross validation. Finally, up
to our knowledge there are no other works that deal with the domain adaptation approach with
respect to language or age.</p>
      <sec id="sec-5-1">
        <title>Training</title>
        <sec id="sec-5-1-1">
          <title>CREMA_D</title>
        </sec>
        <sec id="sec-5-1-2">
          <title>RAVDESS</title>
        </sec>
        <sec id="sec-5-1-3">
          <title>Multisource Basic</title>
          <p>(RAVDESS+CREMA_D) Norm</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Test</title>
        <p>Basic
Norm
Basic
Norm</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion</title>
      <p>
        The lack of huge emotionally labelled speech dataset makes it necessary to define strategies
to merge individual and heterogeneous data into a single multisource dataset, to benefit from
complementary information. The positive increase in performance obtained in this work
highlighted the potential of defining a general computational framework capable of identifying
emotions from unobserved data acquired in diferent experimental conditions. An important
future development in this direction is to include in the multisource training set, natural
conversations, which better reflect the real scenarios of applicability. A second interesting
outcome of this work is the increase of performance obtained applying a domain adaptation
module that fine tuned the general model to a specific scenario. In this work we have considered
only DA on a diferent language, but we plan in the future to test our proposal also on a
dataset of audios recorded from Italian older adults, composed of acted utterances and natural
conversations, that we have already collected but not completely labelled, and that is described
in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This research is partially supported by the FONDAZIONE CARIPLO “AMPEL: Artificial
intelligence facing Multidimensional Poverty in ELderly” (CUP H45F20000840007 Ref. 2020-0232)
and by the co-funding European Union – Next Generation EU, in the context of The
National Recovery and Resilience Plan, Investment Partenariato Esteso PE8 ”Conseguenze e sfide
dell’invecchiamento”, Project Age-It (Ageing Well in an Ageing Society) PE00000015 − CUP:
H43C22000840006.
fact that may explain their complexity and provide tools for clinical practice, American
scientist 89 (2001) 344–350.
[13] A. Mehrabian, J. A. Russell, An approach to environmental psychology., the MIT Press,
1974.
[14] F. Gasparini, A. Grossi, Sentiment recognition of italian elderly through domain adaptation
on cross-corpus speech dataset, in: Proceedings of the Italian Workshop on Artificial
Intelligence for an Ageing Society 2022, volume 3367 of AI*IA, CEUR, 2022, pp. 12–28.
[15] A. Dehqan, R. C. Scherer, G. Dashti, A. Ansari-Moghaddam, S. Fanaie, The efects of aging
on acoustic parameters of voice, Folia Phoniatrica et Logopaedica 64 (2013) 265–270.
[16] F. Busquet, F. Efthymiou, C. Hildebrand, Voice analytics in the wild: Validity and predictive
accuracy of common audio-recording devices, Behavior Research Methods (2023) 1–21.
[17] R. Böck, O. Egorow, I. Siegert, A. Wendemuth, Comparative study on normalisation
in emotion recognition from speech, in: Intelligent Human Computer Interaction: 9th
International Conference, IHCI 2017, Evry, France, December 11-13, 2017, Proceedings 9,
Springer, 2017, pp. 189–201.
[18] M. S. Fahad, A. Ranjan, J. Yadav, A. Deepak, A survey of speech emotion recognition in
natural environment, Digital signal processing 110 (2021) 102951.
[19] X. Wu, Q. Zhang, Design of aging smart home products based on radial basis function
speech emotion recognition, Frontiers in Psychology 13 (2022) 882709.
[20] J. Chang, X. Zhang, Q. Zhang, Y. Sun, Investigating duration efects of emotional speech
stimuli in a tonal language by using event-related potentials, IEEE Access 6 (2018)
13541–13554.
[21] M. El Ayadi, M. S. Kamel, F. Karray, Survey on speech emotion recognition: Features,
classification schemes, and databases, Pattern recognition 44 (2011) 572–587.
[22] E. H. Kim, K. H. Hyun, S. H. Kim, Y. K. Kwak, Improved emotion recognition with a novel
speaker-independent feature, IEEE/ASME transactions on mechatronics 14 (2009) 317–325.
[23] C. Yogesh, M. Hariharan, R. Yuvaraj, R. Ngadiran, S. Yaacob, K. Polat, et al., Bispectral
features and mean shift clustering for stress and emotion recognition from natural speech,
Computers &amp; Electrical Engineering 62 (2017) 676–691.
[24] S. Sahoo, P. Kumar, B. Raman, P. P. Roy, A segment level approach to speech emotion
recognition using transfer learning, in: Asian Conference on Pattern Recognition, Springer,
2019, pp. 435–448.
[25] M. N. Stolar, M. Lech, R. S. Bolia, M. Skinner, Real time speech emotion recognition using
rgb image classification and transfer learning, in: 2017 11th International Conference on
Signal Processing and Communication Systems (ICSPCS), IEEE, 2017, pp. 1–8.
[26] G. Boateng, T. Kowatsch, Speech emotion recognition among elderly individuals using
multimodal fusion and transfer learning, in: Companion Publication of the 2020 International
Conference on Multimodal Interaction, 2020, pp. 12–16.
[27] S. Akinpelu, S. Viriri, Robust feature selection-based speech emotion classification using
deep transfer learning, Applied Sciences 12 (2022) 8265.
[28] A. Hassan, R. Damper, M. Niranjan, On acoustic emotion recognition: compensating for
covariate shift, IEEE Transactions on Audio, Speech, and Language Processing 21 (2013)
1458–1468.
[29] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, R. Verma, Crema-d:
Crowdsourced emotional multimodal actors dataset, IEEE transactions on afective computing 5
(2014) 377–390.
[30] S. R. Livingstone, F. A. Russo, The ryerson audio-visual database of emotional speech
and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north
american english, PloS one 13 (2018) e0196391.
[31] S. Haq, P. J. Jackson, J. Edge, Audio-visual feature selection and reduction for emotion
classiifcation, in: Proc. Int. Conf. on Auditory-Visual Speech Processing (AVSP’08), Tangalooma,
Australia, 2008.
[32] S. Sarker, K. Akter, N. Mamun, A text independent speech emotion recognition based on
convolutional neural network, in: 2023 International Conference on Electrical, Computer
and Communication Engineering (ECCE), IEEE, 2023, pp. 1–4.
[33] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, B. Weiss, et al., A database of
german emotional speech., in: Interspeech, volume 5, 2005, pp. 1517–1520.
[34] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the
22nd acm sigkdd international conference on knowledge discovery and data mining, 2016,
pp. 785–794.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mozelius</surname>
          </string-name>
          ,
          <article-title>Human-computer interaction for older adults: a literature review on technology acceptance of ehealth systems</article-title>
          ,
          <source>Journal of Engineering Research and Sciences (JENRS) 1</source>
          (
          <year>2022</year>
          )
          <fpage>119</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hutson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Bentley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bianchi-Berthouze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bowling</surname>
          </string-name>
          ,
          <article-title>Investigating the suitability of social robots for the wellbeing of the elderly</article-title>
          ,
          <source>in: Afective Computing and Intelligent Interaction: 4th International Conference, ACII</source>
          <year>2011</year>
          ,
          <article-title>Memphis</article-title>
          ,
          <string-name>
            <surname>TN</surname>
          </string-name>
          , USA, October 9-
          <issue>12</issue>
          ,
          <year>2011</year>
          , Proceedings,
          <source>Part I 4</source>
          , Springer,
          <year>2011</year>
          , pp.
          <fpage>578</fpage>
          -
          <lpage>587</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. H.</given-names>
            <surname>García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Esteban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Romeo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Senft</surname>
          </string-name>
          , E. Billing,
          <article-title>Social robots in therapy and care</article-title>
          ,
          <source>in: 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>669</fpage>
          -
          <lpage>670</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Broekens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heerink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rosendal</surname>
          </string-name>
          , et al.,
          <article-title>Assistive social robots in elderly care: a review</article-title>
          ,
          <source>Gerontechnology</source>
          <volume>8</volume>
          (
          <year>2009</year>
          )
          <fpage>94</fpage>
          -
          <lpage>103</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ragno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Borboni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vannetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Amici</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cusano</surname>
          </string-name>
          ,
          <article-title>Application of social robots in healthcare: Review on characteristics, requirements</article-title>
          ,
          <source>technical solutions, Sensors</source>
          <volume>23</volume>
          (
          <year>2023</year>
          )
          <fpage>6820</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Valtolina</surname>
          </string-name>
          , L. Hu,
          <article-title>Charlie: A chatbot to improve the elderly quality of life and to make them more active to fight their sense of loneliness</article-title>
          ,
          <source>in: CHItaly 2021: 14th Biannual Conference of the Italian SIGCHI Chapter</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bojanić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Delić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpov</surname>
          </string-name>
          ,
          <article-title>Call redistribution for a call center based on speech emotion recognition</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>10</volume>
          (
          <year>2020</year>
          )
          <fpage>4653</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Akçay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Oğuz</surname>
          </string-name>
          ,
          <article-title>Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers</article-title>
          ,
          <source>Speech Communication</source>
          <volume>116</volume>
          (
          <year>2020</year>
          )
          <fpage>56</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Gasparini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Grossi</surname>
          </string-name>
          ,
          <article-title>Ser_ampel: a multi-source dataset for speech emotion recognition of italian older adults</article-title>
          ,
          <source>in: Proceedings of the 12th Italian Forum Ambient Assisted Living</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>T. M. Wani</surname>
            ,
            <given-names>T. S.</given-names>
          </string-name>
          <string-name>
            <surname>Gunawan</surname>
            ,
            <given-names>S. A. A.</given-names>
          </string-name>
          <string-name>
            <surname>Qadri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kartiwi</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Ambikairajah</surname>
          </string-name>
          ,
          <article-title>A comprehensive review of speech emotion recognition systems</article-title>
          ,
          <source>IEEE access 9</source>
          (
          <year>2021</year>
          )
          <fpage>47795</fpage>
          -
          <lpage>47814</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ekman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. V.</given-names>
            <surname>Friesen</surname>
          </string-name>
          ,
          <article-title>Constants across cultures in the face and emotion</article-title>
          .,
          <source>Journal of personality and social psychology 17</source>
          (
          <year>1971</year>
          )
          <fpage>124</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Plutchik</surname>
          </string-name>
          ,
          <article-title>The nature of emotions: Human emotions have deep evolutionary roots, a</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>