<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sentiment recognition of Italian elderly through domain adaptation on cross-corpus speech dataset</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francesca Gasparini</string-name>
          <email>francesca.gasparini@unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandra Grossi</string-name>
          <email>alessandra.grossi@unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop Proceedings</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science</institution>
          ,
          <addr-line>Systems and Communications</addr-line>
          ,
          <institution>University of Milano - Bicocca</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The aim of this work is to define a speech emotion recognition (SER) model able to recognize positive, neutral and negative emotions in natural conversations of Italian elderly people. Several datasets for SER are available in the literature. However most of them are in English or Chinese, have been recorded while actors and actresses pronounce short phrases and thus are not related to natural conversation. Moreover only few speeches among all the databases are related to elderly people. Therefore, in this work, a multi-language and multi-age corpus is considered merging a dataset in English, that includes also elderly people, with a dataset in Italian. A general model, trained on young and adult English actors and actresses is proposed, based on XGBoost. Then two strategies of domain adaptation are proposed to adapt the model either to elderly people and to Italian speakers. The results suggest that this approach increases the classification performance, underlining also that new datasets should be collected.</p>
      </abstract>
      <kwd-group>
        <kwd>Speech emotion recognition</kwd>
        <kwd>Sentiment recognition</kwd>
        <kwd>Domain adaptation</kwd>
        <kwd>cross-corpus SER</kwd>
        <kwd>cross-</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Emotions play a relevant role in defining individuals’ behaviours and coordination in
humanhuman interactions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In particular, humans find speech conversations more natural and
efective than its written form as way to express themselves [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. During conversations, people
try to convey their thought not only by words but also by bodily, vocal or facial expressions
[
        <xref ref-type="bibr" rid="ref1 ref3">1, 3</xref>
        ]. Specifically in vocal expressions the afective state of individuals is expressed both by the
linguistic and acoustic information carried by the speech [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For instance, the same sentence
said with diferent intonations can express diferent emotions by the speaker and, thus, can lead
to a diferent response from the listener [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Therefore, in order to create a natural interaction
between humans and computers, the machine must be able to understand emotions from the
speaker’s voice and consequently adapt. Speech Emotion Recognition (SER) consists of the
task of processing and classifying speech signals in order to recognize the emotional state of
the speaker [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. Systems based on SER have diferent fields of application, such as health
care [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], e-learning tutoring [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], automotive [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] or entertainment [
        <xref ref-type="bibr" rid="ref10">10, 11</xref>
        ]. In particular, these
kinds of systems can be employed for the definition of diagnostic tools able to help therapists in
detecting psychological disorders [12] or for automatically recognising mental state alteration
in drivers [13]. Automatic emotion detection systems can also be used in the call center or
mobile communications to detect the emotions of callers and to help agents improving the
quality of service [14, 15], or in human-robot interactions to support a more natural and social
communication between human and machine [16, 17].
      </p>
      <p>Several researches have been carried out in the field of Speech Emotion Recognition during the
last three decades [18]. In particular, many of these analysis are performed considering only
one between linguistic or acoustic information of speech while in recent analysis a multi-modal
approach is examined [19].</p>
      <p>
        In our study, we focus only on acoustic information. In this field, both traditional machine
learning and deep learning approaches have been taken into account in previous literature. In
general, the traditional pipeline in a SER system consists of three steps: signal preprocessing,
features extraction and classification [ 20]. Concerning features extractions, diferent set of
features have been tested: traditional features extracted by audio signals [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], including prosodic
(such as pitch, energy and duration), spectral (such as fundamental frequency, Mel Frequency
Cepstral Coeficients or Linear Prediction Cepstral Coeficients) and voice quality features (such
as jitter or shimmer), as well as deep features extracted by pre-trained networks. In this latter,
the audio signals are usually represented as Spectrogram or Scalogram and used as input to
pre-trained network to extract features [21, 22]. With reference to classifiers, in several research
such as [23, 24], traditional classifiers have been employed. In particular, according to [ 25], the
classical classification techniques preferred in SER system are Gaussian Mixture Model, Hidden
Markov Model, Artificial Neutral Network, Decision Trees and Support Vector Machine. In
few analysis [26, 27] also ensemble techniques combining several classifiers have been tested.
Deep approaches have been also considered in the last years. In particular, framework using
Convolutional Neutral Network (CNN) [28], Recurrent Neural Network (RNN) [29] and Long
Short-Term memory network (LSTM) [30] have been evaluated, using both traditional features
[31] and raw audio signals. In some cases, also mechanism of attention [32, 33] or auto encoding
[34] have been added to classifiers in order to increase performance. The main SER approaches
have been summarized in review manuscripts such as [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or [25].
      </p>
      <p>
        Despite the huge number of analyzes carried out, there are still numerous issues that make
dificult to recognize emotions in speech. In [ 18] some of these challenges and the approaches
tested so far to solve them are summarized. In particular, speech emotion recognition algorithms
struggle in recognize emotions when people of diferent language or age are considered.
In literature there are many datasets collected for SER purpose. These corpora can be classified
into three groups with reference to how emotional speech is generated [35]: i) Acted datasets,
where the data are collected from actors/actresses that try to simulate emotions; ii) Evoked or
Elicited datasets, where the subjects are involved into situations especially created to evoke or
induce certain emotions; and iii) Spontaneous or Natural datasets, which contain more authentic
emotions as collected from real-world situations like call-centers or public places [18]. Most of
the datasets available in the literature are composed of recited speeches [36], while only few
of them consider natural conversations [37, 38, 39]. Moreover, the considered languages are
mainly English and Chinese. It has been demonstrated that language has a strong influence in
how emotions are expressed [24], and thus multi-language datasets have been proposed [40, 41].
Age is another factor that influences the acoustic characteristics of the voice, especially in the
case of elderly [
        <xref ref-type="bibr" rid="ref11">42, 43</xref>
        ]. However, this is still an open field of research and few works face the
problem of SER in case of elderly, or varying the age [
        <xref ref-type="bibr" rid="ref12 ref13">22, 33, 44, 45</xref>
        ], and old subjects are rarely
present in available datasets [
        <xref ref-type="bibr" rid="ref14 ref15 ref16">46, 47, 48, 38</xref>
        ].
      </p>
      <p>In this work we consider the problem of SER, considering elderly Italian people. Moreover we
focus on positive, neutral and negative emotions. We propose to consider a multi-language,
multi-aged approach, considering a cross-corpus dataset, described in Section 2. We start from a
general model trained on an English dataset of young and adult subjects, and we refine this model
to adapt either to elderly and Italian language, as described in Section 3, adopting two diferent
domain adaptation techniques. In Section 4 preprocessing of raw data, feature extraction and
data augmentation, needed to apply the proposed solutions are presented. The results, discussed
in Section 5, underline the potentialities and the limits of the proposed approaches, while future
perspective are drawn in the Conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Cross-corpus dataset</title>
      <p>
        In this work, we consider two datasets available in the literature, labeled with emotions, and
characterized by the presence of elderly subjects or by the presence of Italian sentences: the
CRowd-sourced Emotional Multimodal Actors dataset CREMA-D [
        <xref ref-type="bibr" rid="ref15">47</xref>
        ] and EMOVO [
        <xref ref-type="bibr" rid="ref17">49</xref>
        ].
CREMA-D [
        <xref ref-type="bibr" rid="ref15">47</xref>
        ] is a free audio-visual dataset collected to investigate facial and vocal expressions
and perception of acted emotions. It consists of 7442 audio and video recordings of professional
actors playing 12 utterances each one expressed in six emotional states (happy, sad, anger, fear,
disgust and neutral) at diferent intensity levels. In the first utterance, the actors were directed
to simulate each emotion in three levels of intensity (low, medium and high) while, for the
other eleven sentences, they were free to express the emotion at their preferred intensity. The
sentences selected for the experiment are in English and have a neutral semantic content. In
total, 48 actors and 43 actresses of diferent ages and ethnicity were involved in the experiments,
including 6 elderly with more than 60 years and 85 adults aged between 20 and 59 years. For
the purpose of our analysis, the two groups of subjects are considered separately with a total of
492 signals for elderly, named hereinafter CREMA-D-ELD, and 6950 signals for adults
(CREMAD-ADULT ). For further details of CREMA-D dataset, please refer to the reference manuscript [
        <xref ref-type="bibr" rid="ref15">47</xref>
        ].
EMOVO [
        <xref ref-type="bibr" rid="ref17">49</xref>
        ] is an acted free audio speech emotional dataset based on the Italian language. The
corpus was collected from six young Italian actors (3 male and 3 female) with a mean age of
27.1 (no elderly actors were involved). Similarly to CREMA-D, in the experimental protocol, 14
utterances had to be performed by the actors simulating diferent emotional states. In particular,
for each utterance, 7 afective states were considered: neutral, disgust, fear, anger, joy, surprise
and sadness. The total number of utterances collected in the dataset is 588, with a mean of 98
signals per actor. More details about EMOVO can be found in [
        <xref ref-type="bibr" rid="ref17">49</xref>
        ]
      </p>
      <sec id="sec-2-1">
        <title>In Table 1 the main information about these two datasets are summarized.</title>
        <p>
          In both the selected datasets, the signals are labeled using the six basic emotions defined by
Ekman. In order to use these datasets in our analysis, each emotion has been converted into
its respective sentiment according with the mapping defined in [
          <xref ref-type="bibr" rid="ref18">50</xref>
          ]. In particular, we have
considered anger, fear, disgust and sadness as negative sentiments, happy (or joy) as positive
sentiment and neutral as neutral sentiment. All the EMOVO signals labeled as “surprise” has
been instead excluded from the analysis as dificult to be mapped into a single sentiment class
[
          <xref ref-type="bibr" rid="ref18">50</xref>
          ]. The distribution of the utterances in the three sentiment classes is shown in Figure 1 for
the two datasets considered.
        </p>
        <p>
          Concerning the sentiment analysis, two other datasets are usually adopted in Speech Sentiment
Recognition researches: Multimodal EmotionLines Dataset (MELD) [
          <xref ref-type="bibr" rid="ref18">50</xref>
          ] and CMU Multimodal
Opinion Sentiment and Emotion Intensity (CMU-MOSEI) [
          <xref ref-type="bibr" rid="ref19">51</xref>
          ]. The first [
          <xref ref-type="bibr" rid="ref18">50</xref>
          ] is a data corpus
composed by more then 13000 utterances from 1433 dialogues from the TV-series Friends and
labeled with three sentiment class: negative, positive and neutral. CMU-MOSEI [
          <xref ref-type="bibr" rid="ref19">51</xref>
          ], instead,
contains 23453 annotated video-clips from 250 diferent topics, gathered from online video
sharing websites and labeled with sentiment in Likert scale. Despite both the datasets are
directly labeled with sentiment, they were excluded from our analysis. In particular, concerning
MELD, the dataset has been discarded due to the presence, in several audios, of laugh tracks or
multiple voices overlapping the main actor’s speech. This makes the audio signal very noisy
and makes it dificult to identify which part of the audio is related to the labelled sentiment.
With reference to CMU-MOSEI, instead, the dataset has been excluded from the study because
of the lack of the subject’s age that makes impossible to separate signals collected from elderly
from the one’s collected from young or adults.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Data Adaptation strategies</title>
      <sec id="sec-3-1">
        <title>The proposed analysis considers two research hypotheses:</title>
        <p>• Domain adaptation based on age, training a general Speech Sentiment Recognition model
using speech data collected from English young and adults subjects and adapting this
model on new data collected from English elderly subjects.
• Domain adaptation based on language, trying to refine a pre-trained Speech Sentiment
Recognition model on English young and adults subjects to recognize new data collected
from Italian young and adults people.</p>
        <p>
          In all the experiments performed, the gradient boosted decision trees algorithm implemented as
XGBoost [
          <xref ref-type="bibr" rid="ref20">52</xref>
          ] has been selected as classification model while two diferent instance weighting
domain adaptation strategies have been tested:
• the Kullback-Leiber Important Estimation Procedure (KLIEP) strategy [
          <xref ref-type="bibr" rid="ref21">53</xref>
          ] that assigns a
weight to the training instances during the classifier learning task in order to minimize
the Kullback-Leibler divergence between train and target distributions. In our analysis
we have considered the supervised implementation of this algorithm using “rbf” as Kernel
with two diferent gamma: 0.1 and 1.
• the Transfer AdaBoost for Classification (TrAdaBoost) [
          <xref ref-type="bibr" rid="ref22">54</xref>
          ] is a supervised domain
adaptation strategy that extends boosting-based learning algorithms to the field of transfer
learning. In particular, at each iteration, the algorithm trains a new weak classifier giving
less importance to the training instances poorly predicted in previous iterations while
emphasising the target samples correctly recognized. The final model is the combination
of the last half computed estimators weighted according to their relevance. The number
of iterations selected in our experiments is 10.
        </p>
        <p>The application of these two strategies requires to split the data into three distinct sets: i)
Training (or Source) set made up of a large amount of labeled data used to train the general
model; ii) Target set consisting of few samples belonging to a new but related domain that are
used to adapt the general model to this new data distribution and iii) Test set composed by
data similar to Target set and used to evaluate the model performances. In our experiments,
the definition of these three sets changes according to the research hypothesis considered. In
particular, in multi-age analysis, the data of CREMA-D-ADULT have been used as Training set
while Target and Test sets have been defined as subsets of CREMA-D-ELD. Instead, in
multilanguage analysis, the training of the general model is performed using CREMA-D-ADULT data
while Target and Test sets are both defined as partitions of EMOVO data.</p>
        <p>Diferent validation strategies have been tested to partition the data of CREMA-D-ELD and
EMOVO into Target and Test set:
• Leave One Subject Out (LOSO) Cross Validation strategy, where the folds are partitioned
according to subject and thus, at each iteration, all the data of a single subject are used as
Test set while the data of the remaining subjects are used as Target set.
• Leave One Utterance Out (LOUO) Cross Validation strategy, where the folds are defined
according to the pronounced utterances, thus at each iteration, all the data related to a
single utterance are used to test the model while the data of the remaining utterances are
used as Target set.</p>
        <p>
          To test the performances of our classification models, several well-known evaluation metrics are
computed [
          <xref ref-type="bibr" rid="ref23">55</xref>
          ] including the accuracy, single class F1-score, evaluated as the harmonic mean of
single class precision and recall, and macro F1-score [
          <xref ref-type="bibr" rid="ref24">56</xref>
          ] computed as the unweighted mean of
the single class F1-score.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Model input data</title>
      <p>To apply the strategies of domain adaptation described in the previous section, preprocessing,
feature extraction, and data augmentation to balance the classes have been performed on raw
data. The whole process is depicted in Figure 2.</p>
      <sec id="sec-4-1">
        <title>4.1. Preprocessing</title>
        <p>
          The audio signals of each dataset are preprocessed to extract only the information concerning
the target speaker’s voice. In particular, the audio clips were first converted from stereo to
mono by averaging samples across the two channels. Then, each signal was filtered using a
pass-band Butterworth filter with lower cutof frequency at 300 Hz and upper cutof frequency
at 3000 Hz to removes the spectral components out of the voice frequency range [
          <xref ref-type="bibr" rid="ref25">57</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Feature extraction</title>
        <p>From the pre-processed signals, the eGeMAPS acoustic feature set was extracted using the
python library implementation of openSMILE toolkit [58]. The eGeMAPS feature set (extended
Geneva Minimalistic Acoustic Parameter Set) [59] is a set of audio features proposed for
afective analysis in voice signals. It consists of 25 Low Level Descriptor (LLD) features including
energy, frequency, cepstral, spectral and dynamic parameters. In order to summarize the
variation of these parameters over the time windows, some high level functional features are
extracted using statistical functions as arithmetic mean, standard deviation or percentile.
Applying these statistics, a total of 88 features have been extracted for each considered signal. The
extracted features have been normalized by z-scoring in order to reduce inter signals diferences.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Data Augmentation</title>
        <p>Only for Training (or Source) and Target dataset, the feature extraction step has been followed
by data augmentation. In both the datasets, the cardinality of the negative sentiment class is four
times greater then positive or neutral ones. This is due to an imbalance among the number of
emotions mapped as negative (angry, fear, sadness, disgust) and the number of emotions mapped
as positive (happy) and neutral (neutral) in the selected emotion-sentiment transformation. In
order to create more balanced classes, a two steps procedure have been applied to training
and target data according to the experiment considered . First the majority class have been
under-sampled, discarding randomly half of the negative instances. In this process, the discarded
elements have been selected trying to keep balanced the number of elements for each negative
emotions. Then an oversampling strategy based on SMOTE algorithm [60] has been applied
to increase the number of samples in the two minority classes (positive and neutral). SMOTE
(Synthetic Minority Oversampling TEchnique) [60] is an oversampling method that random
generates new synthetic data for the minority class starting from the original data points. In
particular, at each iteration, the algorithm selects one of the k-nearest neighbors of a random
minority class element and create new artificial elements linear interpolating the two instances
using a random number between zero and one. The procedure is repeated until the cardinality
of the classes is balanced.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and discussion</title>
      <p>The aim of this work is define a classification model able to automatically recognize three
sentiment states (positive, neutral and negative) using acoustic features extracted from speech
when diferent age and language are considered. In particular, two diferent experiments have
been carried out to evaluate the research hypothesis described in Section 3: domain adaptation
on elderly and domain adaptation on language.</p>
      <sec id="sec-5-1">
        <title>5.1. Domain adaptation on elderly</title>
        <p>In the first analysis, a multi-age corpus sentiment classification is considered. As described
in Section 3, the two parts of CREMA-D dataset have been used respectively for Training set
(CREMA-D-ADULT) and Target and Test set (CREMA-D-ELD). For each domain adaptation
strategy, two diferent evaluation methods are tested: LOSO and LOUO. The results achieved in
these experiments are compared with the performances reached by the XGBoost model when no
domain adaptation strategy is applied. In this case, thus, the classifier is trained on
CREMA-DADULT data and tested on the independent dataset CREMA-D-ELD. The classification settings
considered in the analysis are summarized in Table 2. For each of these analyses, Table 3 reported
the classification performance achieved by the XGBoost classifiers in terms of accuracy, macro
F1-score and single class F1-score. The results show how, in case of elderly, the use of domain
adaptation techniques does not significantly increase the performances of the classification
model with reference to the benchmark case without adaptation. A macro F1-score value of
62%, in fact, is achieved both when TrAdaBoost or no domain adaptation is applied. Lower
performances are instead obtained using the KLIEP domain adaptation algorithm with F1-score
value near to 60%. Similar results are reached using both LOSO and LOUO evaluation strategies.
Considering the values of per-class F1-scores reached emerges how, in all the experiments
performed, the Negative class appears easier to be recognized than Neural and Positive ones.
This diference can be due to the presence of a higher number of diferent instances in the
negative class than in the other two classes where several instances were artificially created
using SMOTE data augmentation strategy.</p>
        <p>From these preliminary results, it seems that data adaptation does not increase the performance
of the proposed SER model. This is probably related to several aspects. The elderly here
considered are actors or actresses, and thus they are not so significantly diferent from a
population of young and adult persons. Moreover the elderly are only 6, of which only one is a
female. A more realistic dataset should be consider to proper verify this research question.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Domain adaptation on language</title>
        <p>The second part of our study focused on speech sentiment recognition when
multi-languagecorpus datasets are taken into account. The trials tested for this analysis are summarized in
Table 4. Two diferent datasets were used: the English dataset CREMA-D-ADULT, used to train
the model, and the Italian dataset EMOVO, as Target and Test set. Furthermore, similarly to
elderly, the results obtained varying the domain adaptation technique (KLIEP and TrAdaBoost)
and evaluation strategy (LOSO, LOUO) were compared with the performance reached by the
classification model trained without domain adaptation. The values of accuracy, macro-F1
score and per-class F1-scores achieved in the diferent experiments are reported in Table 5.
From the analysis of the results, it emerges how the best performances in both the validation
strategies were obtained applying the TrAdaBoost domain adaptation method. In particular,
the two macro F1-score values of 44% and 85% generated using respectively LOSO and LOUO
validation strategies outperform the value of 35% reached when no domain adaptation is
considered. Similarly to elderly, the lowest general performances were instead reached applying
the KLIEP domain adaptation strategy with macro F1-score values near to 32% in both the
analysis performed. Another general consideration regards the single classes recognition. In
almost all the trials, the use of domain adaptation techniques allowed to better recognize the
instances of Positive class, reaching often more balanced classification performances in identify
the three sentiments. Nevertheless, the Negative sentiment is still the class better recognized
from all the classification models examined, thus confirming what has already been observed
on the elderly analysis.</p>
        <p>Finally, the last remark concerns the performance diferences between the two validation
strategies applied. In particular, the partition of Target and Test set using utterances allows to
achieve better results than the one based on subjects. This can be explained by the fact that, in
addition to language, the division by utterance also takes into account the diference between
people with regard to personal vocal characteristics or how they express their emotions. Using
this method, data from each of the analyzed subjects appear in each of the folds generated,
allowing the classification model to better learn about vocal timbre diferences or diferences in
the individuals’ personalities. However, it is worth to underline that both the datasets analyzed
are acted, making perhaps more similar how the same subject expresses the same emotion, also
in diferent sentences. For this reason, in future analyzes, it may be necessary to validate the
hypotheses here proposed on new natural datasets collected in real situations.
All the analysis were run on a computer with Intel Core™ i7-7700HQ Processor using 16 GB of
RAM and 2.80 GHz CPU. In both the experiments, the proposed techniques take approximately
110 ms to extract features and classify a new instance lasting about 2 seconds. Not relevant
variations in computational time have been detected when diferent domain adaptation strategies
are applied. The high execution speed of the algorithms could allow their integration into
near real-time systems. In this case, streams of audio directly collected from the speaker
might be divided into segments of about two seconds and sent to the algorithm for processing
and classifying, generating thus a response to the user with a delay of two seconds. The
implementation of such systems implies, however, that the classifier used in the process was
already trained and adapted to the new data. These operations are highly time-consuming
and often require an execution time of minutes to be performed. For this reason, the proposed
domain adaptation techniques seem not suitable in the definition of classifiers continuously
adapting to newly acquired data, appearing instead useful in the development of near real-time
systems based on cyclic updated pre-trained classifiers or batch processing system. Future
analysis will be carried out in this regard.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>
        The sentiment emotion recognition task is still an open field of research, especially when
considering diferent languages and ages. In particular in the case of our interest, Italian elderly,
no datasets are available in the literature. Domain adaptation techniques could partially solve
this lack of data. However our preliminary results indicate that there is the urgency of a more
realistic collection of data, that also faces the need of considering diferent ages. Domain
adaptation techniques seem to better perform in case of cross-language datasets, paving the
way for further researches in this direction. In particular, after a proper data collection, future
experiments could be conducted considering both language and local dialects, particularly
widespread among the elderly population. For what concerns the lack of performance increase
applying domain adaptation models in the case of multi-age corpus, conclusions can not be
drawn, due to the peculiarity of the datasets available (where the collected speeches were
recorded by professional actors) and given the low presence of elderly people. Finally, in the
presented study only audio signals have been taken into account. In the last years, the use of
acoustic or textual features extracted from speech has been often paired with the use of other
data collected from the speakers. In particular, in several literature datasets, visual signals such
as face expressions or body movements, physiological signals or behavioural biometric data
have been collected together with audio to consider a multimodal approach of Speech Emotion
Recognition [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In future works, similar strategies could be applied also in case of elderly Italian
people in order to create more robust and accurate cross-corpus emotional classifiers.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This research is supported by the FONDAZIONE CARIPLO “AMPEL: Artificial intelligence
facing Multidimensional Poverty in ELderly” (Ref. 2020-0232).
agents with spontaneous interactive capabilities, in: Proceedings of the seventh ACM
international conference on Multimedia (Part 1), 1999, pp. 343–351.
[11] A. Alhargan, N. Cooke, T. Binjammaz, Multimodal afect recognition in an interactive
gaming environment using eye tracking and speech signals, in: Proceedings of the 19th
ACM international conference on multimodal interaction, 2017, pp. 479–486.
[12] L.-S. A. Low, N. C. Maddage, M. Lech, L. B. Sheeber, N. B. Allen, Detection of clinical
depression in adolescents’ speech during family interactions, IEEE Transactions on
Biomedical Engineering 58 (2010) 574–586.
[13] F. Al Machot, A. H. Mosa, K. Dabbour, A. Fasih, C. Schwarzlmüller, M. Ali, K. Kyamakya, A
novel real-time emotion detection system from audio streams based on bayesian quadratic
discriminate classifier for adas, in: Proceedings of the Joint INDS’11 &amp; ISTET’11, IEEE,
2011, pp. 1–5.
[14] P. Gupta, N. Rajput, Two-stream emotion recognition for call center monitoring, in: Eighth
Annual Conference of the International Speech Communication Association, Citeseer,
2007.
[15] C. Vaudable, L. Devillers, Negative emotions detection as an indicator of dialogs quality
in call centers, in: 2012 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), IEEE, 2012, pp. 5109–5112.
[16] F. Hegel, T. Spexard, B. Wrede, G. Horstmann, T. Vogt, Playing a diferent imitation
game: Interaction with an empathic android robot, in: 2006 6th IEEE-RAS International
Conference on Humanoid Robots, IEEE, 2006, pp. 56–61.
[17] C. Jones, A. Deeming, Afective human-robotic interaction, in: Afect and emotion in
human-computer interaction, Springer, 2008, pp. 175–185.
[18] M. S. Fahad, A. Ranjan, J. Yadav, A. Deepak, A survey of speech emotion recognition in
natural environment, Digital Signal Processing 110 (2021) 102951.
[19] B. T. Atmaja, A. Sasou, M. Akagi, Survey on bimodal speech emotion recognition from
acoustic and linguistic information fusion, Speech Communication (2022).
[20] A. Thakur, S. Dhull, Speech emotion recognition: A review, Advances in Communication
and Computational Technology (2021) 815–827.
[21] M. N. Stolar, M. Lech, R. S. Bolia, M. Skinner, Real time speech emotion recognition using
rgb image classification and transfer learning, in: 2017 11th International Conference on
Signal Processing and Communication Systems (ICSPCS), IEEE, 2017, pp. 1–8.
[22] G. Boateng, T. Kowatsch, Speech emotion recognition among elderly individuals using
multimodal fusion and transfer learning, in: Companion Publication of the 2020 International
Conference on Multimodal Interaction, 2020, pp. 12–16.
[23] M. Swain, S. Sahoo, A. Routray, P. Kabisatpathy, J. N. Kundu, Study of feature combination
using hmm and svm for multilingual odiya speech emotion recognition, International
Journal of Speech Technology 18 (2015) 387–393.
[24] S. Latif, A. Qayyum, M. Usman, J. Qadir, Cross lingual speech emotion recognition: Urdu
vs. western languages, in: 2018 International Conference on Frontiers of Information
Technology (FIT), IEEE, 2018, pp. 88–93.
[25] T. M. Wani, T. S. Gunawan, S. A. A. Qadri, M. Kartiwi, E. Ambikairajah, A comprehensive
review of speech emotion recognition systems, IEEE Access 9 (2021) 47795–47814.
[26] M. Lugger, M.-E. Janoir, B. Yang, Combining classifiers with diverse feature sets for robust
speaker independent emotion recognition, in: 2009 17th European Signal Processing
Conference, IEEE, 2009, pp. 1225–1229.
[27] B. Schuller, M. Lang, G. Rigoll, Robust acoustic speech emotion recognition by ensembles
of classifiers, in: Tagungsband Fortschritte der Akustik-DAGA# 05, München, 2005.
[28] A. M. Badshah, J. Ahmad, N. Rahim, S. W. Baik, Speech emotion recognition from
spectrograms with deep convolutional neural network, in: 2017 international conference on
platform technology and service (PlatCon), IEEE, 2017, pp. 1–5.
[29] K. Aghajani, I. Esmaili Paeen Afrakoti, Speech emotion recognition using scalogram based
deep structure, International Journal of Engineering 33 (2020) 285–292.
[30] J. Cho, R. Pappagari, P. Kulkarni, J. Villalba, Y. Carmiel, N. Dehak, Deep neural networks
for emotion recognition combining audio and transcripts, arXiv preprint arXiv:1911.00432
(2019).
[31] H. S. Kumbhar, S. U. Bhandari, Speech emotion recognition using mfcc features and lstm
network, in: 2019 5th International Conference On Computing, Communication, Control
And Automation (ICCUBEA), IEEE, 2019, pp. 1–3.
[32] B. T. Atmaja, M. Akagi, Speech emotion recognition based on speech segment using lstm
with attention model, in: 2019 IEEE International Conference on Signals and Systems
(ICSigSys), IEEE, 2019, pp. 40–44.
[33] Q. Jian, M. Xiang, W. Huang, A speech emotion recognition method for the elderly based on
feature fusion and attention mechanism, in: Third International Conference on Electronics
and Communication; Network and Computer Technology (ECNCT 2021), volume 12167,
SPIE, 2022, pp. 398–403.
[34] M. Neumann, N. T. Vu, Improving speech emotion recognition with unsupervised
representation learning on unlabeled speech, in: ICASSP 2019-2019 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 7390–7394.
[35] S. G. Koolagudi, K. S. Rao, Emotion recognition from speech: a review, International
journal of speech technology 15 (2012) 99–117.
[36] F. Ringeval, A. Sonderegger, J. Sauer, D. Lalanne, Introducing the recola multimodal
corpus of remote collaborative and afective interactions, in: 2013 10th IEEE international
conference and workshops on automatic face and gesture recognition (FG), IEEE, 2013, pp.
1–8.
[37] S. Steidl, Automatic classification of emotion related user states in spontaneous children’s
speech, Logos-Verlag Berlin, Germany, 2009.
[38] W. Fan, X. Xu, X. Xing, W. Chen, D. Huang, Lssed: a large-scale dataset and benchmark
for speech emotion recognition, in: ICASSP 2021-2021 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 641–645.
[39] D. Morrison, R. Wang, L. C. De Silva, Ensemble methods for spoken emotion recognition
in call-centres, Speech communication 49 (2007) 98–112.
[40] E. Parada-Cabaleiro, G. Costantini, A. Batliner, A. Baird, B. Schuller, Categorical vs
dimensional perception of italian emotional speech (2018).
[41] V. Hozjan, Z. Kacic, A. Moreno, A. Bonafonte, A. Nogueiras, Interface databases: Design
and collection of a multilingual emotional speech database., in: LREC, 2002.
[42] D. Deliyski, Steve An Xue, Efects of aging on selected acoustic voice parameters:
Preliminary normative data and educational implications, Educational gerontology 27 (2001)
[58] F. Eyben, M. Wöllmer, B. Schuller, Opensmile: the munich versatile and fast open-source
audio feature extractor, in: Proceedings of the 18th ACM international conference on
Multimedia, 2010, pp. 1459–1462.
[59] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers,
J. Epps, P. Laukka, S. S. Narayanan, et al., The geneva minimalistic acoustic parameter
set (gemaps) for voice research and afective computing, IEEE transactions on afective
computing 7 (2015) 190–202.
[60] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: synthetic minority
over-sampling technique, Journal of artificial intelligence research 16 (2002) 321–357.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lange</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. W.</given-names>
            <surname>Heerdink</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. A. Van Kleef</surname>
          </string-name>
          , Reading emotions, reading people:
          <article-title>Emotion perception and inferences drawn from perceived emotions</article-title>
          ,
          <source>Current Opinion in Psychology</source>
          <volume>43</volume>
          (
          <year>2022</year>
          )
          <fpage>85</fpage>
          -
          <lpage>90</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>El Ayadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Kamel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Karray</surname>
          </string-name>
          ,
          <article-title>Survey on speech emotion recognition: Features, classification schemes, and databases</article-title>
          ,
          <source>Pattern recognition 44</source>
          (
          <year>2011</year>
          )
          <fpage>572</fpage>
          -
          <lpage>587</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Swain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Routray</surname>
          </string-name>
          , P. Kabisatpathy, Databases, features
          <article-title>and classifiers for speech emotion recognition: a review</article-title>
          ,
          <source>International Journal of Speech Technology</source>
          <volume>21</volume>
          (
          <year>2018</year>
          )
          <fpage>93</fpage>
          -
          <lpage>120</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.-N.</given-names>
            <surname>Anagnostopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Iliou</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Giannoukos</surname>
          </string-name>
          ,
          <article-title>Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          <volume>43</volume>
          (
          <year>2015</year>
          )
          <fpage>155</fpage>
          -
          <lpage>177</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Q. Zhang,</surname>
          </string-name>
          <article-title>Design of aging smart home products based on radial basis function speech emotion recognition</article-title>
          .,
          <source>Frontiers in Psychology</source>
          <volume>13</volume>
          (
          <year>2022</year>
          )
          <fpage>882709</fpage>
          -
          <lpage>882709</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Akçay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Oğuz</surname>
          </string-name>
          ,
          <article-title>Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers</article-title>
          ,
          <source>Speech Communication</source>
          <volume>116</volume>
          (
          <year>2020</year>
          )
          <fpage>56</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>D. J. France</surname>
          </string-name>
          , R. G. Shiavi,
          <string-name>
            <given-names>S.</given-names>
            <surname>Silverman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Silverman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wilkes</surname>
          </string-name>
          ,
          <article-title>Acoustical properties of speech as indicators of depression and suicidal risk</article-title>
          ,
          <source>IEEE transactions on Biomedical Engineering</source>
          <volume>47</volume>
          (
          <year>2000</year>
          )
          <fpage>829</fpage>
          -
          <lpage>837</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Litman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Forbes-Riley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rotaru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tetreault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Purandare</surname>
          </string-name>
          ,
          <article-title>Using system and user performance features to improve emotion detection in spoken tutoring dialogs</article-title>
          ,
          <source>in: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH</source>
          , volume
          <volume>2</volume>
          ,
          <year>2006</year>
          , pp.
          <fpage>797</fpage>
          -
          <lpage>800</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Cevher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zepf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Klinger</surname>
          </string-name>
          ,
          <article-title>Towards multimodal emotion recognition in german speech events in cars using transfer learning</article-title>
          , arXiv preprint arXiv:
          <year>1909</year>
          .
          <volume>02764</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nakatsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nicholson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tosa</surname>
          </string-name>
          ,
          <article-title>Emotion recognition</article-title>
          and its application to computer 159-168.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sundberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Thörnvik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Söderström</surname>
          </string-name>
          ,
          <article-title>Age and voice quality in professional singers</article-title>
          ,
          <source>Logopedics Phoniatrics Vocology</source>
          <volume>23</volume>
          (
          <year>1998</year>
          )
          <fpage>169</fpage>
          -
          <lpage>176</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>D.</given-names>
            <surname>Verma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mukhopadhyay</surname>
          </string-name>
          ,
          <article-title>Age driven automatic speech emotion recognition system</article-title>
          ,
          <source>in: 2016 International Conference on Computing, Communication and Automation (ICCCA)</source>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>1005</fpage>
          -
          <lpage>1010</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>G.</given-names>
            <surname>Soğancıoğlu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Verkholyak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fedotov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cadée</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Salah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpov</surname>
          </string-name>
          ,
          <article-title>Is everything fine, grandma? acoustic and linguistic modeling for robust elderly speech emotion recognition</article-title>
          , arXiv preprint arXiv:
          <year>2009</year>
          .
          <volume>03432</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Batliner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bergler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.-M.</given-names>
            <surname>Messner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hamilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Amiriparian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Baird</surname>
          </string-name>
          , G. Rizos,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schmitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Stappen</surname>
          </string-name>
          , et al.,
          <article-title>The interspeech 2020 computational paralinguistics challenge: Elderly emotion, breathing &amp; masks (</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>H.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Cooper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Keutmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Gur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nenkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verma</surname>
          </string-name>
          , Crema-d:
          <article-title>Crowdsourced emotional multimodal actors dataset</article-title>
          ,
          <source>IEEE transactions on afective computing 5</source>
          (
          <year>2014</year>
          )
          <fpage>377</fpage>
          -
          <lpage>390</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [48]
          <string-name>
            <surname>M. K. Pichora-Fuller</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Dupuis</surname>
          </string-name>
          , Toronto emotional speech set
          <source>(TESS)</source>
          ,
          <year>2020</year>
          . URL: https: //doi.org/10.5683/SP2/E8H2MF. doi:
          <volume>10</volume>
          .5683/SP2/E8H2MF.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>G.</given-names>
            <surname>Costantini</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Iaderola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paoloni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Todisco</surname>
          </string-name>
          ,
          <article-title>Emovo corpus: an italian emotional speech database</article-title>
          ,
          <source>in: International Conference on Language Resources and Evaluation (LREC</source>
          <year>2014</year>
          ),
          <source>European Language Resources Association (ELRA)</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>3501</fpage>
          -
          <lpage>3504</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>S.</given-names>
            <surname>Poria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hazarika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Naik</surname>
          </string-name>
          , E. Cambria,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          ,
          <article-title>Meld: A multimodal multi-party dataset for emotion recognition in conversations</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>02508</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [51]
          <string-name>
            <surname>A. B. Zadeh</surname>
            ,
            <given-names>P. P.</given-names>
          </string-name>
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Poria</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Cambria</surname>
            ,
            <given-names>L.-P.</given-names>
          </string-name>
          <string-name>
            <surname>Morency</surname>
          </string-name>
          ,
          <article-title>Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph</article-title>
          ,
          <source>in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2018</year>
          , pp.
          <fpage>2236</fpage>
          -
          <lpage>2246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <article-title>Xgboost: A scalable tree boosting system</article-title>
          ,
          <source>in: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>785</fpage>
          -
          <lpage>794</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [53]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sugiyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nakajima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kashima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Buenau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kawanabe</surname>
          </string-name>
          ,
          <article-title>Direct importance estimation with model selection and its application to covariate shift adaptation</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>20</volume>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [54]
          <string-name>
            <given-names>W.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.-R.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Boosting for transfer learning</article-title>
          , volume
          <volume>227</volume>
          ,
          <year>2007</year>
          , pp.
          <fpage>193</fpage>
          -
          <lpage>200</lpage>
          . doi:
          <volume>10</volume>
          .1145/1273496.1273521.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [55]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grandini</surname>
          </string-name>
          , E. Bagli, G. Visani,
          <article-title>Metrics for multi-class classification: an overview</article-title>
          , arXiv preprint arXiv:
          <year>2008</year>
          .
          <volume>05756</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [56]
          <string-name>
            <given-names>Z. C.</given-names>
            <surname>Lipton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Elkan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Naryanaswamy</surname>
          </string-name>
          ,
          <article-title>Optimal thresholding of classifiers to maximize f1 measure</article-title>
          ,
          <source>in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases</source>
          , Springer,
          <year>2014</year>
          , pp.
          <fpage>225</fpage>
          -
          <lpage>239</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [57]
          <string-name>
            <given-names>B.</given-names>
            <surname>Birch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Grifiths</surname>
          </string-name>
          , A. Morgan,
          <article-title>Environmental efects on reliability and accuracy of mfcc based voice recognition for industrial human-robot-interaction</article-title>
          ,
          <source>Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Manufacture</source>
          <volume>235</volume>
          (
          <year>2021</year>
          )
          <fpage>1939</fpage>
          -
          <lpage>1948</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>