=Paper=
{{Paper
|id=Vol-3367/paper2
|storemode=property
|title=Sentiment Recognition of Italian Elderly through Domain Adaptation on Cross-corpus Speech Dataset
|pdfUrl=https://ceur-ws.org/Vol-3367/paper2.pdf
|volume=Vol-3367
|authors=Francesca Gasparini,Alessandra Grossi
|dblpUrl=https://dblp.org/rec/conf/aiia/GaspariniG22
}}
==Sentiment Recognition of Italian Elderly through Domain Adaptation on Cross-corpus Speech Dataset==
<pdf width="1500px">https://ceur-ws.org/Vol-3367/paper2.pdf</pdf>
<pre>
Sentiment recognition of Italian elderly through
domain adaptation on cross-corpus speech dataset
Francesca Gasparini1 , Alessandra Grossi1

1
    Department of Computer Science, Systems and Communications, University of Milano - Bicocca, Italy


                                         Abstract
                                         The aim of this work is to define a speech emotion recognition (SER) model able to recognize positive,
                                         neutral and negative emotions in natural conversations of Italian elderly people. Several datasets for
                                         SER are available in the literature. However most of them are in English or Chinese, have been recorded
                                         while actors and actresses pronounce short phrases and thus are not related to natural conversation.
                                         Moreover only few speeches among all the databases are related to elderly people. Therefore, in this
                                         work, a multi-language and multi-age corpus is considered merging a dataset in English, that includes
                                         also elderly people, with a dataset in Italian. A general model, trained on young and adult English actors
                                         and actresses is proposed, based on XGBoost. Then two strategies of domain adaptation are proposed to
                                         adapt the model either to elderly people and to Italian speakers. The results suggest that this approach
                                         increases the classification performance, underlining also that new datasets should be collected.


                                         Keywords
                                         Speech emotion recognition, Sentiment recognition, Domain adaptation, cross-corpus SER, cross-
                                         language SER


1. Introduction
Emotions play a relevant role in defining individuals’ behaviours and coordination in human-
human interactions [1]. In particular, humans find speech conversations more natural and
effective than its written form as way to express themselves [2]. During conversations, people
try to convey their thought not only by words but also by bodily, vocal or facial expressions
[1, 3]. Specifically in vocal expressions the affective state of individuals is expressed both by the
linguistic and acoustic information carried by the speech [4]. For instance, the same sentence
said with different intonations can express different emotions by the speaker and, thus, can lead
to a different response from the listener [5]. Therefore, in order to create a natural interaction
between humans and computers, the machine must be able to understand emotions from the
speaker’s voice and consequently adapt. Speech Emotion Recognition (SER) consists of the
task of processing and classifying speech signals in order to recognize the emotional state of

AIXAS2022: Italian Workshop on Artificial Intelligence for an Ageing Society, Udine, Italy, November 29, 2022
Envelope-Open francesca.gasparini@unimib.it (F. Gasparini); alessandra.grossi@unimib.it (A. Grossi)
Orcid 0000-0002-6279-6660 (F. Gasparini); 0000-0003-1308-8497 (A. Grossi)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
the speaker [5, 6]. Systems based on SER have different fields of application, such as health
care [7], e-learning tutoring [8], automotive [9] or entertainment [10, 11]. In particular, these
kinds of systems can be employed for the definition of diagnostic tools able to help therapists in
detecting psychological disorders [12] or for automatically recognising mental state alteration
in drivers [13]. Automatic emotion detection systems can also be used in the call center or
mobile communications to detect the emotions of callers and to help agents improving the
quality of service [14, 15], or in human-robot interactions to support a more natural and social
communication between human and machine [16, 17].
Several researches have been carried out in the field of Speech Emotion Recognition during the
last three decades [18]. In particular, many of these analysis are performed considering only
one between linguistic or acoustic information of speech while in recent analysis a multi-modal
approach is examined [19].
In our study, we focus only on acoustic information. In this field, both traditional machine
learning and deep learning approaches have been taken into account in previous literature. In
general, the traditional pipeline in a SER system consists of three steps: signal preprocessing,
features extraction and classification [20]. Concerning features extractions, different set of
features have been tested: traditional features extracted by audio signals [2], including prosodic
(such as pitch, energy and duration), spectral (such as fundamental frequency, Mel Frequency
Cepstral Coefficients or Linear Prediction Cepstral Coefficients) and voice quality features (such
as jitter or shimmer), as well as deep features extracted by pre-trained networks. In this latter,
the audio signals are usually represented as Spectrogram or Scalogram and used as input to
pre-trained network to extract features [21, 22]. With reference to classifiers, in several research
such as [23, 24], traditional classifiers have been employed. In particular, according to [25], the
classical classification techniques preferred in SER system are Gaussian Mixture Model, Hidden
Markov Model, Artificial Neutral Network, Decision Trees and Support Vector Machine. In
few analysis [26, 27] also ensemble techniques combining several classifiers have been tested.
Deep approaches have been also considered in the last years. In particular, framework using
Convolutional Neutral Network (CNN) [28], Recurrent Neural Network (RNN) [29] and Long
Short-Term memory network (LSTM) [30] have been evaluated, using both traditional features
[31] and raw audio signals. In some cases, also mechanism of attention [32, 33] or auto encoding
[34] have been added to classifiers in order to increase performance. The main SER approaches
have been summarized in review manuscripts such as [6] or [25].
Despite the huge number of analyzes carried out, there are still numerous issues that make
difficult to recognize emotions in speech. In [18] some of these challenges and the approaches
tested so far to solve them are summarized. In particular, speech emotion recognition algorithms
struggle in recognize emotions when people of different language or age are considered.
In literature there are many datasets collected for SER purpose. These corpora can be classified
into three groups with reference to how emotional speech is generated [35]: i) Acted datasets,
where the data are collected from actors/actresses that try to simulate emotions; ii) Evoked or
Elicited datasets, where the subjects are involved into situations especially created to evoke or
induce certain emotions; and iii) Spontaneous or Natural datasets, which contain more authentic
emotions as collected from real-world situations like call-centers or public places [18]. Most of
the datasets available in the literature are composed of recited speeches [36], while only few
of them consider natural conversations [37, 38, 39]. Moreover, the considered languages are
mainly English and Chinese. It has been demonstrated that language has a strong influence in
how emotions are expressed [24], and thus multi-language datasets have been proposed [40, 41].
Age is another factor that influences the acoustic characteristics of the voice, especially in the
case of elderly [42, 43]. However, this is still an open field of research and few works face the
problem of SER in case of elderly, or varying the age [22, 33, 44, 45], and old subjects are rarely
present in available datasets [46, 47, 48, 38].
In this work we consider the problem of SER, considering elderly Italian people. Moreover we
focus on positive, neutral and negative emotions. We propose to consider a multi-language,
multi-aged approach, considering a cross-corpus dataset, described in Section 2. We start from a
general model trained on an English dataset of young and adult subjects, and we refine this model
to adapt either to elderly and Italian language, as described in Section 3, adopting two different
domain adaptation techniques. In Section 4 preprocessing of raw data, feature extraction and
data augmentation, needed to apply the proposed solutions are presented. The results, discussed
in Section 5, underline the potentialities and the limits of the proposed approaches, while future
perspective are drawn in the Conclusions.


2. Cross-corpus dataset
In this work, we consider two datasets available in the literature, labeled with emotions, and
characterized by the presence of elderly subjects or by the presence of Italian sentences: the
CRowd-sourced Emotional Multimodal Actors dataset CREMA-D [47] and EMOVO [49].

CREMA-D [47] is a free audio-visual dataset collected to investigate facial and vocal expressions
and perception of acted emotions. It consists of 7442 audio and video recordings of professional
actors playing 12 utterances each one expressed in six emotional states (happy, sad, anger, fear,
disgust and neutral) at different intensity levels. In the first utterance, the actors were directed
to simulate each emotion in three levels of intensity (low, medium and high) while, for the
other eleven sentences, they were free to express the emotion at their preferred intensity. The
sentences selected for the experiment are in English and have a neutral semantic content. In
total, 48 actors and 43 actresses of different ages and ethnicity were involved in the experiments,
including 6 elderly with more than 60 years and 85 adults aged between 20 and 59 years. For
the purpose of our analysis, the two groups of subjects are considered separately with a total of
492 signals for elderly, named hereinafter CREMA-D-ELD, and 6950 signals for adults (CREMA-
D-ADULT ). For further details of CREMA-D dataset, please refer to the reference manuscript [47].

EMOVO [49] is an acted free audio speech emotional dataset based on the Italian language. The
corpus was collected from six young Italian actors (3 male and 3 female) with a mean age of
27.1 (no elderly actors were involved). Similarly to CREMA-D, in the experimental protocol, 14
utterances had to be performed by the actors simulating different emotional states. In particular,
for each utterance, 7 affective states were considered: neutral, disgust, fear, anger, joy, surprise
and sadness. The total number of utterances collected in the dataset is 588, with a mean of 98
   Table 1
   Summary of main CREMA-D and EMOVO characteristics

Dataset    Type    Emotions            Language    No. of    Tot. No. of      No. of    No. of    Mode
 Name              Considered                      Utterances Subjects        Males     Elderly
CREMA-D    Acted happy, sad, anger,     English        12           91          48        6       Audio/
                 fear, disgust and                                                                visual
                 neutral

EMOVO      Acted joy, surprise, sad,    Italian        14           6            3        0       Audio
                 anger, fear, dis-
                 gust and neutral


   Figure 1: Number of utterances mapped as negative, positive or neutral in the two datasets CREMA-D
   (left) and EMOVO (right)


   signals per actor. More details about EMOVO can be found in [49]

   In Table 1 the main information about these two datasets are summarized.
   In both the selected datasets, the signals are labeled using the six basic emotions defined by
   Ekman. In order to use these datasets in our analysis, each emotion has been converted into
   its respective sentiment according with the mapping defined in [50]. In particular, we have
   considered anger, fear, disgust and sadness as negative sentiments, happy (or joy) as positive
   sentiment and neutral as neutral sentiment. All the EMOVO signals labeled as “surprise” has
   been instead excluded from the analysis as difficult to be mapped into a single sentiment class
   [50]. The distribution of the utterances in the three sentiment classes is shown in Figure 1 for
   the two datasets considered.
   Concerning the sentiment analysis, two other datasets are usually adopted in Speech Sentiment
   Recognition researches: Multimodal EmotionLines Dataset (MELD) [50] and CMU Multimodal
   Opinion Sentiment and Emotion Intensity (CMU-MOSEI) [51]. The first [50] is a data corpus
composed by more then 13000 utterances from 1433 dialogues from the TV-series Friends and
labeled with three sentiment class: negative, positive and neutral. CMU-MOSEI [51], instead,
contains 23453 annotated video-clips from 250 different topics, gathered from online video
sharing websites and labeled with sentiment in Likert scale. Despite both the datasets are
directly labeled with sentiment, they were excluded from our analysis. In particular, concerning
MELD, the dataset has been discarded due to the presence, in several audios, of laugh tracks or
multiple voices overlapping the main actor’s speech. This makes the audio signal very noisy
and makes it difficult to identify which part of the audio is related to the labelled sentiment.
With reference to CMU-MOSEI, instead, the dataset has been excluded from the study because
of the lack of the subject’s age that makes impossible to separate signals collected from elderly
from the one’s collected from young or adults.


3. Data Adaptation strategies
The proposed analysis considers two research hypotheses:
    • Domain adaptation based on age, training a general Speech Sentiment Recognition model
      using speech data collected from English young and adults subjects and adapting this
      model on new data collected from English elderly subjects.
    • Domain adaptation based on language, trying to refine a pre-trained Speech Sentiment
      Recognition model on English young and adults subjects to recognize new data collected
      from Italian young and adults people.
In all the experiments performed, the gradient boosted decision trees algorithm implemented as
XGBoost [52] has been selected as classification model while two different instance weighting
domain adaptation strategies have been tested:
    • the Kullback-Leiber Important Estimation Procedure (KLIEP) strategy [53] that assigns a
      weight to the training instances during the classifier learning task in order to minimize
      the Kullback-Leibler divergence between train and target distributions. In our analysis
      we have considered the supervised implementation of this algorithm using “rbf” as Kernel
      with two different gamma: 0.1 and 1.
    • the Transfer AdaBoost for Classification (TrAdaBoost) [54] is a supervised domain adap-
      tation strategy that extends boosting-based learning algorithms to the field of transfer
      learning. In particular, at each iteration, the algorithm trains a new weak classifier giving
      less importance to the training instances poorly predicted in previous iterations while
      emphasising the target samples correctly recognized. The final model is the combination
      of the last half computed estimators weighted according to their relevance. The number
      of iterations selected in our experiments is 10.
The application of these two strategies requires to split the data into three distinct sets: i)
Training (or Source) set made up of a large amount of labeled data used to train the general
model; ii) Target set consisting of few samples belonging to a new but related domain that are
used to adapt the general model to this new data distribution and iii) Test set composed by
Figure 2: Pipeline used, in the analysis, for extracting features from the signals of Training and Target
datasets. Concerning the Test set, the Data Augmentation step is not applied.


data similar to Target set and used to evaluate the model performances. In our experiments,
the definition of these three sets changes according to the research hypothesis considered. In
particular, in multi-age analysis, the data of CREMA-D-ADULT have been used as Training set
while Target and Test sets have been defined as subsets of CREMA-D-ELD. Instead, in multi-
language analysis, the training of the general model is performed using CREMA-D-ADULT data
while Target and Test sets are both defined as partitions of EMOVO data.
Different validation strategies have been tested to partition the data of CREMA-D-ELD and
EMOVO into Target and Test set:
    • Leave One Subject Out (LOSO) Cross Validation strategy, where the folds are partitioned
      according to subject and thus, at each iteration, all the data of a single subject are used as
      Test set while the data of the remaining subjects are used as Target set.
    • Leave One Utterance Out (LOUO) Cross Validation strategy, where the folds are defined
      according to the pronounced utterances, thus at each iteration, all the data related to a
      single utterance are used to test the model while the data of the remaining utterances are
      used as Target set.
To test the performances of our classification models, several well-known evaluation metrics are
computed [55] including the accuracy, single class F1-score, evaluated as the harmonic mean of
single class precision and recall, and macro F1-score [56] computed as the unweighted mean of
the single class F1-score.
4. Model input data
To apply the strategies of domain adaptation described in the previous section, preprocessing,
feature extraction, and data augmentation to balance the classes have been performed on raw
data. The whole process is depicted in Figure 2.


4.1. Preprocessing
The audio signals of each dataset are preprocessed to extract only the information concerning
the target speaker’s voice. In particular, the audio clips were first converted from stereo to
mono by averaging samples across the two channels. Then, each signal was filtered using a
pass-band Butterworth filter with lower cutoff frequency at 300 Hz and upper cutoff frequency
at 3000 Hz to removes the spectral components out of the voice frequency range [57].


4.2. Feature extraction
From the pre-processed signals, the eGeMAPS acoustic feature set was extracted using the
python library implementation of openSMILE toolkit [58]. The eGeMAPS feature set (extended
Geneva Minimalistic Acoustic Parameter Set) [59] is a set of audio features proposed for affec-
tive analysis in voice signals. It consists of 25 Low Level Descriptor (LLD) features including
energy, frequency, cepstral, spectral and dynamic parameters. In order to summarize the
variation of these parameters over the time windows, some high level functional features are
extracted using statistical functions as arithmetic mean, standard deviation or percentile. Ap-
plying these statistics, a total of 88 features have been extracted for each considered signal. The
extracted features have been normalized by z-scoring in order to reduce inter signals differences.


4.3. Data Augmentation
Only for Training (or Source) and Target dataset, the feature extraction step has been followed
by data augmentation. In both the datasets, the cardinality of the negative sentiment class is four
times greater then positive or neutral ones. This is due to an imbalance among the number of
emotions mapped as negative (angry, fear, sadness, disgust) and the number of emotions mapped
as positive (happy) and neutral (neutral) in the selected emotion-sentiment transformation. In
order to create more balanced classes, a two steps procedure have been applied to training
and target data according to the experiment considered . First the majority class have been
under-sampled, discarding randomly half of the negative instances. In this process, the discarded
elements have been selected trying to keep balanced the number of elements for each negative
emotions. Then an oversampling strategy based on SMOTE algorithm [60] has been applied
to increase the number of samples in the two minority classes (positive and neutral). SMOTE
(Synthetic Minority Oversampling TEchnique) [60] is an oversampling method that random
generates new synthetic data for the minority class starting from the original data points. In
                     Table 2
                     Experiments carried out in cross-age analysis varying the Domain Adaptation Strategy considered
                     (second column) and the Validation strategy adopted (last column). In all the experiments, the general
                     model has been trained using the data of CREMA-D-ADULT while different subsets of CREMA-D-ELD
                     dataset have been selected as Target and Test set.

                                                                                                                    Validation
                     DA Strategy       Training Set              Target set                   Test set               Strategy
                                     CREMA-D-ADULT                             CREMA-D-ELD
                     No Domain       all 85 young and                                                              Training / Test
XGBoost classifier


                     Adaptation      adults subjects       no target dataset           6 elderly x 12 utterances    independent
                                     all 85 young and      5 elderly x 12 utterances   1 elderly x 12 utterances       LOSO
                     KLIEP
                                     adults subjects       11 utterances x 6 elderly   1 utterance x 6 elderly         LOUO

                                     all 85 young and      5 elderly x 12 utterances   1 elderly x 12 utterances       LOSO
                     TrAdaBoost
                                     adults subjects       11 utterances x 6 elderly   1 utterance x 6 elderly         LOUO


                      particular, at each iteration, the algorithm selects one of the k-nearest neighbors of a random
                      minority class element and create new artificial elements linear interpolating the two instances
                      using a random number between zero and one. The procedure is repeated until the cardinality
                      of the classes is balanced.


                      5. Results and discussion
                     The aim of this work is define a classification model able to automatically recognize three
                     sentiment states (positive, neutral and negative) using acoustic features extracted from speech
                     when different age and language are considered. In particular, two different experiments have
                     been carried out to evaluate the research hypothesis described in Section 3: domain adaptation
                     on elderly and domain adaptation on language.


                      5.1. Domain adaptation on elderly
                     In the first analysis, a multi-age corpus sentiment classification is considered. As described
                     in Section 3, the two parts of CREMA-D dataset have been used respectively for Training set
                     (CREMA-D-ADULT) and Target and Test set (CREMA-D-ELD). For each domain adaptation
                     strategy, two different evaluation methods are tested: LOSO and LOUO. The results achieved in
                     these experiments are compared with the performances reached by the XGBoost model when no
                     domain adaptation strategy is applied. In this case, thus, the classifier is trained on CREMA-D-
                     ADULT data and tested on the independent dataset CREMA-D-ELD. The classification settings
                     considered in the analysis are summarized in Table 2. For each of these analyses, Table 3 reported
                     the classification performance achieved by the XGBoost classifiers in terms of accuracy, macro
  Table 3
  Cross-age performance comparison using CREMA-D-ADULT as training set and CREMA-D-ELD as
  target and test set. The analysis are performed varying the Domain Adaptation strategy (second column)
  and Performance Evaluation method (third column). Three evaluation metrics are considered: macro
  F1-score, accuracy and single class F1-score.

Classifier              DA Strategy    Validation        Macro     Negative   Neutral    Positive   Accuracy
                                        Strategy        F1-score   F1-score   F1-score   F1-score
                        No Domain     Training / Test
                                                          62%        0,79       0,55       0,51       70%
   XGBoost classifier


                        Adaptation     independent
                                          LOSO            60%        0,76       0,57       0,49       67%
                          KLIEP
                                          LOUO            60%        0,76       0,56       0,47       67%
                                          LOSO            62%        0,79       0,56       0,52       70%
                        TrAdaBoost
                                          LOUO            62%        0,78       0,55       0,52       69%


  F1-score and single class F1-score. The results show how, in case of elderly, the use of domain
  adaptation techniques does not significantly increase the performances of the classification
  model with reference to the benchmark case without adaptation. A macro F1-score value of
  62%, in fact, is achieved both when TrAdaBoost or no domain adaptation is applied. Lower
  performances are instead obtained using the KLIEP domain adaptation algorithm with F1-score
  value near to 60%. Similar results are reached using both LOSO and LOUO evaluation strategies.
  Considering the values of per-class F1-scores reached emerges how, in all the experiments
  performed, the Negative class appears easier to be recognized than Neural and Positive ones.
  This difference can be due to the presence of a higher number of different instances in the
  negative class than in the other two classes where several instances were artificially created
  using SMOTE data augmentation strategy.
  From these preliminary results, it seems that data adaptation does not increase the performance
  of the proposed SER model. This is probably related to several aspects. The elderly here
  considered are actors or actresses, and thus they are not so significantly different from a
  population of young and adult persons. Moreover the elderly are only 6, of which only one is a
  female. A more realistic dataset should be consider to proper verify this research question.


   5.2. Domain adaptation on language
  The second part of our study focused on speech sentiment recognition when multi-language-
  corpus datasets are taken into account. The trials tested for this analysis are summarized in
  Table 4. Two different datasets were used: the English dataset CREMA-D-ADULT, used to train
  the model, and the Italian dataset EMOVO, as Target and Test set. Furthermore, similarly to
  elderly, the results obtained varying the domain adaptation technique (KLIEP and TrAdaBoost)
  and evaluation strategy (LOSO, LOUO) were compared with the performance reached by the
  classification model trained without domain adaptation. The values of accuracy, macro-F1
  score and per-class F1-scores achieved in the different experiments are reported in Table 5.
                       Table 4
                       Experiments carried out in cross-language analysis varying the Domain Adaptation Strategy considered
                       (second column) and the Validation strategy adopted (last column). In all the experiments the general
                       model has been trained using the data of CREMA-D-ADULT while different subsets of EMOVO dataset
                       have been selected as Target and Test set.

                                                                                                                                 Validation
                      DA Strategy                  Training Set              Target set                    Test set               Strategy
                                                  CREMA-D-ADULT                              EMOVO
                      No Domain                   all 85 young and                                                              Training / Test
XGBoost classifier


                      Adaptation                  adults subjects     no target dataset            6 subjects x 14 utterances    independent
                                                  all 85 young and    6 subjects x 14 utterances   1 subject x 14 utterances        LOSO
                      KLIEP
                                                  adults subjects     13 utterances x 6 subjects   1 utterance x 6 subjects         LOUO

                                                  all 85 young and    5 subjects x 14 utterances   1 subject x 14 utterances        LOSO
                      TrAdaBoost
                                                  adults subjects     13 utterances x 6 subjects   1 utterance x 6 subjects         LOUO


                       Table 5
                       Cross-language performance comparison using CREMA-D-ADULT as training set and EMOVO as target
                       and test set. The analysis are performed varying the Domain Adaptation strategy (second column)
                       and Performance Evaluation method (third column). Three evaluation metrics are considered: Macro
                       F1-score, Accuracy and single class F1-score.

                     Classifier              DA Strategy     Validation       Macro       Negative    Neutral      Positive     Accuracy
                                                              Strategy       F1-score     F1-score    F1-score     F1-score
                                             No Domain     Training / Test
                                                                               35%          0,71         0,28         0,07         57%
                        XGBoost classifier


                                             Adaptation     independent
                                                               LOSO            33%          0,64         0,19         0,16         48%
                                               KLIEP
                                                               LOUO            32%          0,68         0,22         0,06         51%
                                                               LOSO            44%          0,68         0,25         0,39         56%
                                             TrAdaBoost
                                                               LOUO            85%          0,91         0,90         0,74         88%


                       From the analysis of the results, it emerges how the best performances in both the validation
                       strategies were obtained applying the TrAdaBoost domain adaptation method. In particular,
                       the two macro F1-score values of 44% and 85% generated using respectively LOSO and LOUO
                       validation strategies outperform the value of 35% reached when no domain adaptation is
                       considered. Similarly to elderly, the lowest general performances were instead reached applying
                       the KLIEP domain adaptation strategy with macro F1-score values near to 32% in both the
                       analysis performed. Another general consideration regards the single classes recognition. In
                       almost all the trials, the use of domain adaptation techniques allowed to better recognize the
                       instances of Positive class, reaching often more balanced classification performances in identify
                       the three sentiments. Nevertheless, the Negative sentiment is still the class better recognized
from all the classification models examined, thus confirming what has already been observed
on the elderly analysis.
Finally, the last remark concerns the performance differences between the two validation
strategies applied. In particular, the partition of Target and Test set using utterances allows to
achieve better results than the one based on subjects. This can be explained by the fact that, in
addition to language, the division by utterance also takes into account the difference between
people with regard to personal vocal characteristics or how they express their emotions. Using
this method, data from each of the analyzed subjects appear in each of the folds generated,
allowing the classification model to better learn about vocal timbre differences or differences in
the individuals’ personalities. However, it is worth to underline that both the datasets analyzed
are acted, making perhaps more similar how the same subject expresses the same emotion, also
in different sentences. For this reason, in future analyzes, it may be necessary to validate the
hypotheses here proposed on new natural datasets collected in real situations.
All the analysis were run on a computer with Intel Core™ i7-7700HQ Processor using 16 GB of
RAM and 2.80 GHz CPU. In both the experiments, the proposed techniques take approximately
110 ms to extract features and classify a new instance lasting about 2 seconds. Not relevant
variations in computational time have been detected when different domain adaptation strategies
are applied. The high execution speed of the algorithms could allow their integration into
near real-time systems. In this case, streams of audio directly collected from the speaker
might be divided into segments of about two seconds and sent to the algorithm for processing
and classifying, generating thus a response to the user with a delay of two seconds. The
implementation of such systems implies, however, that the classifier used in the process was
already trained and adapted to the new data. These operations are highly time-consuming
and often require an execution time of minutes to be performed. For this reason, the proposed
domain adaptation techniques seem not suitable in the definition of classifiers continuously
adapting to newly acquired data, appearing instead useful in the development of near real-time
systems based on cyclic updated pre-trained classifiers or batch processing system. Future
analysis will be carried out in this regard.


6. Conclusion
The sentiment emotion recognition task is still an open field of research, especially when
considering different languages and ages. In particular in the case of our interest, Italian elderly,
no datasets are available in the literature. Domain adaptation techniques could partially solve
this lack of data. However our preliminary results indicate that there is the urgency of a more
realistic collection of data, that also faces the need of considering different ages. Domain
adaptation techniques seem to better perform in case of cross-language datasets, paving the
way for further researches in this direction. In particular, after a proper data collection, future
experiments could be conducted considering both language and local dialects, particularly
widespread among the elderly population. For what concerns the lack of performance increase
applying domain adaptation models in the case of multi-age corpus, conclusions can not be
drawn, due to the peculiarity of the datasets available (where the collected speeches were
recorded by professional actors) and given the low presence of elderly people. Finally, in the
presented study only audio signals have been taken into account. In the last years, the use of
acoustic or textual features extracted from speech has been often paired with the use of other
data collected from the speakers. In particular, in several literature datasets, visual signals such
as face expressions or body movements, physiological signals or behavioural biometric data
have been collected together with audio to consider a multimodal approach of Speech Emotion
Recognition [6]. In future works, similar strategies could be applied also in case of elderly Italian
people in order to create more robust and accurate cross-corpus emotional classifiers.


Acknowledgments
This research is supported by the FONDAZIONE CARIPLO “AMPEL: Artificial intelligence
facing Multidimensional Poverty in ELderly” (Ref. 2020-0232).


References
 [1] J. Lange, M. W. Heerdink, G. A. Van Kleef, Reading emotions, reading people: Emotion
     perception and inferences drawn from perceived emotions, Current Opinion in Psychology
     43 (2022) 85–90.
 [2] M. El Ayadi, M. S. Kamel, F. Karray, Survey on speech emotion recognition: Features,
     classification schemes, and databases, Pattern recognition 44 (2011) 572–587.
 [3] M. Swain, A. Routray, P. Kabisatpathy, Databases, features and classifiers for speech
     emotion recognition: a review, International Journal of Speech Technology 21 (2018)
     93–120.
 [4] C.-N. Anagnostopoulos, T. Iliou, I. Giannoukos, Features and classifiers for emotion
     recognition from speech: a survey from 2000 to 2011, Artificial Intelligence Review 43
     (2015) 155–177.
 [5] X. Wu, Q. Zhang, Design of aging smart home products based on radial basis function
     speech emotion recognition., Frontiers in Psychology 13 (2022) 882709–882709.
 [6] M. B. Akçay, K. Oğuz, Speech emotion recognition: Emotional models, databases, features,
     preprocessing methods, supporting modalities, and classifiers, Speech Communication
     116 (2020) 56–76.
 [7] D. J. France, R. G. Shiavi, S. Silverman, M. Silverman, M. Wilkes, Acoustical properties
     of speech as indicators of depression and suicidal risk, IEEE transactions on Biomedical
     Engineering 47 (2000) 829–837.
 [8] A. Hua, D. J. Litman, K. Forbes-Riley, M. Rotaru, J. Tetreault, A. Purandare, Using system
     and user performance features to improve emotion detection in spoken tutoring dialogs,
     in: Proceedings of the Annual Conference of the International Speech Communication
     Association, INTERSPEECH, volume 2, 2006, pp. 797–800.
 [9] D. Cevher, S. Zepf, R. Klinger, Towards multimodal emotion recognition in german speech
     events in cars using transfer learning, arXiv preprint arXiv:1909.02764 (2019).
[10] R. Nakatsu, J. Nicholson, N. Tosa, Emotion recognition and its application to computer
     agents with spontaneous interactive capabilities, in: Proceedings of the seventh ACM
     international conference on Multimedia (Part 1), 1999, pp. 343–351.
[11] A. Alhargan, N. Cooke, T. Binjammaz, Multimodal affect recognition in an interactive
     gaming environment using eye tracking and speech signals, in: Proceedings of the 19th
     ACM international conference on multimodal interaction, 2017, pp. 479–486.
[12] L.-S. A. Low, N. C. Maddage, M. Lech, L. B. Sheeber, N. B. Allen, Detection of clinical
     depression in adolescents’ speech during family interactions, IEEE Transactions on
     Biomedical Engineering 58 (2010) 574–586.
[13] F. Al Machot, A. H. Mosa, K. Dabbour, A. Fasih, C. Schwarzlmüller, M. Ali, K. Kyamakya, A
     novel real-time emotion detection system from audio streams based on bayesian quadratic
     discriminate classifier for adas, in: Proceedings of the Joint INDS’11 & ISTET’11, IEEE,
     2011, pp. 1–5.
[14] P. Gupta, N. Rajput, Two-stream emotion recognition for call center monitoring, in: Eighth
     Annual Conference of the International Speech Communication Association, Citeseer,
     2007.
[15] C. Vaudable, L. Devillers, Negative emotions detection as an indicator of dialogs quality
     in call centers, in: 2012 IEEE International Conference on Acoustics, Speech and Signal
     Processing (ICASSP), IEEE, 2012, pp. 5109–5112.
[16] F. Hegel, T. Spexard, B. Wrede, G. Horstmann, T. Vogt, Playing a different imitation
     game: Interaction with an empathic android robot, in: 2006 6th IEEE-RAS International
     Conference on Humanoid Robots, IEEE, 2006, pp. 56–61.
[17] C. Jones, A. Deeming, Affective human-robotic interaction, in: Affect and emotion in
     human-computer interaction, Springer, 2008, pp. 175–185.
[18] M. S. Fahad, A. Ranjan, J. Yadav, A. Deepak, A survey of speech emotion recognition in
     natural environment, Digital Signal Processing 110 (2021) 102951.
[19] B. T. Atmaja, A. Sasou, M. Akagi, Survey on bimodal speech emotion recognition from
     acoustic and linguistic information fusion, Speech Communication (2022).
[20] A. Thakur, S. Dhull, Speech emotion recognition: A review, Advances in Communication
     and Computational Technology (2021) 815–827.
[21] M. N. Stolar, M. Lech, R. S. Bolia, M. Skinner, Real time speech emotion recognition using
     rgb image classification and transfer learning, in: 2017 11th International Conference on
     Signal Processing and Communication Systems (ICSPCS), IEEE, 2017, pp. 1–8.
[22] G. Boateng, T. Kowatsch, Speech emotion recognition among elderly individuals using mul-
     timodal fusion and transfer learning, in: Companion Publication of the 2020 International
     Conference on Multimodal Interaction, 2020, pp. 12–16.
[23] M. Swain, S. Sahoo, A. Routray, P. Kabisatpathy, J. N. Kundu, Study of feature combination
     using hmm and svm for multilingual odiya speech emotion recognition, International
     Journal of Speech Technology 18 (2015) 387–393.
[24] S. Latif, A. Qayyum, M. Usman, J. Qadir, Cross lingual speech emotion recognition: Urdu
     vs. western languages, in: 2018 International Conference on Frontiers of Information
     Technology (FIT), IEEE, 2018, pp. 88–93.
[25] T. M. Wani, T. S. Gunawan, S. A. A. Qadri, M. Kartiwi, E. Ambikairajah, A comprehensive
     review of speech emotion recognition systems, IEEE Access 9 (2021) 47795–47814.
[26] M. Lugger, M.-E. Janoir, B. Yang, Combining classifiers with diverse feature sets for robust
     speaker independent emotion recognition, in: 2009 17th European Signal Processing
     Conference, IEEE, 2009, pp. 1225–1229.
[27] B. Schuller, M. Lang, G. Rigoll, Robust acoustic speech emotion recognition by ensembles
     of classifiers, in: Tagungsband Fortschritte der Akustik-DAGA# 05, München, 2005.
[28] A. M. Badshah, J. Ahmad, N. Rahim, S. W. Baik, Speech emotion recognition from spec-
     trograms with deep convolutional neural network, in: 2017 international conference on
     platform technology and service (PlatCon), IEEE, 2017, pp. 1–5.
[29] K. Aghajani, I. Esmaili Paeen Afrakoti, Speech emotion recognition using scalogram based
     deep structure, International Journal of Engineering 33 (2020) 285–292.
[30] J. Cho, R. Pappagari, P. Kulkarni, J. Villalba, Y. Carmiel, N. Dehak, Deep neural networks
     for emotion recognition combining audio and transcripts, arXiv preprint arXiv:1911.00432
     (2019).
[31] H. S. Kumbhar, S. U. Bhandari, Speech emotion recognition using mfcc features and lstm
     network, in: 2019 5th International Conference On Computing, Communication, Control
     And Automation (ICCUBEA), IEEE, 2019, pp. 1–3.
[32] B. T. Atmaja, M. Akagi, Speech emotion recognition based on speech segment using lstm
     with attention model, in: 2019 IEEE International Conference on Signals and Systems
     (ICSigSys), IEEE, 2019, pp. 40–44.
[33] Q. Jian, M. Xiang, W. Huang, A speech emotion recognition method for the elderly based on
     feature fusion and attention mechanism, in: Third International Conference on Electronics
     and Communication; Network and Computer Technology (ECNCT 2021), volume 12167,
     SPIE, 2022, pp. 398–403.
[34] M. Neumann, N. T. Vu, Improving speech emotion recognition with unsupervised represen-
     tation learning on unlabeled speech, in: ICASSP 2019-2019 IEEE International Conference
     on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 7390–7394.
[35] S. G. Koolagudi, K. S. Rao, Emotion recognition from speech: a review, International
     journal of speech technology 15 (2012) 99–117.
[36] F. Ringeval, A. Sonderegger, J. Sauer, D. Lalanne, Introducing the recola multimodal
     corpus of remote collaborative and affective interactions, in: 2013 10th IEEE international
     conference and workshops on automatic face and gesture recognition (FG), IEEE, 2013, pp.
     1–8.
[37] S. Steidl, Automatic classification of emotion related user states in spontaneous children’s
     speech, Logos-Verlag Berlin, Germany, 2009.
[38] W. Fan, X. Xu, X. Xing, W. Chen, D. Huang, Lssed: a large-scale dataset and benchmark
     for speech emotion recognition, in: ICASSP 2021-2021 IEEE International Conference on
     Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 641–645.
[39] D. Morrison, R. Wang, L. C. De Silva, Ensemble methods for spoken emotion recognition
     in call-centres, Speech communication 49 (2007) 98–112.
[40] E. Parada-Cabaleiro, G. Costantini, A. Batliner, A. Baird, B. Schuller, Categorical vs
     dimensional perception of italian emotional speech (2018).
[41] V. Hozjan, Z. Kacic, A. Moreno, A. Bonafonte, A. Nogueiras, Interface databases: Design
     and collection of a multilingual emotional speech database., in: LREC, 2002.
[42] D. Deliyski, Steve An Xue, Effects of aging on selected acoustic voice parameters: Pre-
     liminary normative data and educational implications, Educational gerontology 27 (2001)
     159–168.
[43] J. Sundberg, M. N. Thörnvik, A. M. Söderström, Age and voice quality in professional
     singers, Logopedics Phoniatrics Vocology 23 (1998) 169–176.
[44] D. Verma, D. Mukhopadhyay, Age driven automatic speech emotion recognition system, in:
     2016 International Conference on Computing, Communication and Automation (ICCCA),
     IEEE, 2016, pp. 1005–1010.
[45] G. Soğancıoğlu, O. Verkholyak, H. Kaya, D. Fedotov, T. Cadée, A. A. Salah, A. Karpov,
     Is everything fine, grandma? acoustic and linguistic modeling for robust elderly speech
     emotion recognition, arXiv preprint arXiv:2009.03432 (2020).
[46] B. W. Schuller, A. Batliner, C. Bergler, E.-M. Messner, A. Hamilton, S. Amiriparian, A. Baird,
     G. Rizos, M. Schmitt, L. Stappen, et al., The interspeech 2020 computational paralinguistics
     challenge: Elderly emotion, breathing & masks (2020).
[47] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, R. Verma, Crema-d: Crowd-
     sourced emotional multimodal actors dataset, IEEE transactions on affective computing 5
     (2014) 377–390.
[48] M. K. Pichora-Fuller, K. Dupuis, Toronto emotional speech set (TESS), 2020. URL: https:
     //doi.org/10.5683/SP2/E8H2MF. doi:10.5683/SP2/E8H2MF .
[49] G. Costantini, I. Iaderola, A. Paoloni, M. Todisco, Emovo corpus: an italian emotional
     speech database, in: International Conference on Language Resources and Evaluation
     (LREC 2014), European Language Resources Association (ELRA), 2014, pp. 3501–3504.
[50] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, R. Mihalcea, Meld: A mul-
     timodal multi-party dataset for emotion recognition in conversations, arXiv preprint
     arXiv:1810.02508 (2018).
[51] A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, L.-P. Morency, Multimodal language analysis
     in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph, in: Proceedings
     of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1:
     Long Papers), 2018, pp. 2236–2246.
[52] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the
     22nd acm sigkdd international conference on knowledge discovery and data mining, 2016,
     pp. 785–794.
[53] M. Sugiyama, S. Nakajima, H. Kashima, P. Buenau, M. Kawanabe, Direct importance
     estimation with model selection and its application to covariate shift adaptation, Advances
     in neural information processing systems 20 (2007).
[54] W. Dai, Q. Yang, G.-R. Xue, Y. Yu, Boosting for transfer learning, volume 227, 2007, pp.
     193–200. doi:10.1145/1273496.1273521 .
[55] M. Grandini, E. Bagli, G. Visani, Metrics for multi-class classification: an overview, arXiv
     preprint arXiv:2008.05756 (2020).
[56] Z. C. Lipton, C. Elkan, B. Naryanaswamy, Optimal thresholding of classifiers to maximize
     f1 measure, in: Joint European Conference on Machine Learning and Knowledge Discovery
     in Databases, Springer, 2014, pp. 225–239.
[57] B. Birch, C. Griffiths, A. Morgan, Environmental effects on reliability and accuracy of
     mfcc based voice recognition for industrial human-robot-interaction, Proceedings of the
     Institution of Mechanical Engineers, Part B: Journal of Engineering Manufacture 235 (2021)
     1939–1948.
[58] F. Eyben, M. Wöllmer, B. Schuller, Opensmile: the munich versatile and fast open-source
     audio feature extractor, in: Proceedings of the 18th ACM international conference on
     Multimedia, 2010, pp. 1459–1462.
[59] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers,
     J. Epps, P. Laukka, S. S. Narayanan, et al., The geneva minimalistic acoustic parameter
     set (gemaps) for voice research and affective computing, IEEE transactions on affective
     computing 7 (2015) 190–202.
[60] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: synthetic minority
     over-sampling technique, Journal of artificial intelligence research 16 (2002) 321–357.

</pre>