Linguistic and Gender Variation in Speech
                               Emotion Recognition Using Spectral Features?

                                 Zachary Dair, Ryan Donovan, and Ruairi O’Reilly[0000−0001−7990−3461]

                                             Munster Technological University, Cork, Ireland
                               zachary.dair@mycit.ie, brendan.donovan@mycit.ie, ruairi.orielly@mtu.ie
                                                             www.mtu.ie

                                   Abstract. This work explores the effect of gender and linguistic-based
                                   vocal variations on the accuracy of emotive expression classification.
                                   Emotive expressions are considered from the perspective of spectral fea-
                                   tures in speech (Mel-frequency Cepstral Coefficient, Melspectrogram,
                                   Spectral Contrast). Emotions are considered from the perspective of Ba-
                                   sic Emotion Theory. A convolutional neural network is utilised to classify
                                   emotive expressions in emotive audio datasets in English, German, and
                                   Italian. Vocal variations for spectral features assessed by (i) a compara-
                                   tive analysis identifying suitable spectral features, (ii) the classification
                                   performance for mono, multi and cross-lingual emotive data and (iii) an
                                   empirical evaluation of a machine learning model to assess the effects
                                   of gender and linguistic variation on classification accuracy. The results
                                   showed that spectral features provide a potential avenue for increasing
                                   emotive expression classification. Additionally, the accuracy of emotive
                                   expression classification was high within mono and cross-lingual emotive
                                   data, but poor in multi-lingual data. Similarly, there were differences in
                                   classification accuracy between gender populations. These results demon-
                                   strate the importance of accounting for population differences to enable
                                   accurate speech emotion recognition.

                                   Keywords: Affective Computing · Speech Emotion Recognition · Ma-
                                   chine Learning · Prosody Analysis · Convolutional Neural Networks.
                          1      Introduction
                          Speech emotion recognition (SER) is the classification of the emotional states of
                          a speaker from speech. These emotional states can either be discrete experiences,
                          such as the basic emotions (Anger, Disgust, Fear, Joy, Sadness, and Surprise),
                          or attributes of emotional states (Arousal, Valence, Dominance). The ability to
                          accurately classify these emotional states through SER enables the provision of
                          tailored services that can adapt to the psychological needs of user groups. SER
                          attempts to classify these emotional states via verbal and non-verbal components
                          of speech. Verbal components of speech are the specific words a speaker says.
                          Verbal SER involves classifying emotional expressions from the word choice or
                          associations, syntax, use of colloquialisms or sarcasm concerning the intent of the
                          speaker [13]. Non-verbal components of speech refer to how the speaker expresses
                          their speech. Non-verbal SER involves classifying emotional expressions from the
                           ?
                               This publication has emanated from research supported in part by a Grant from
                               Science Foundation Ireland under Grant number 18/CRT/6222


Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
2        Z. Dair et al.

speaker’s acoustic features (Pitch, Tone, Timbre, Number of pauses, Loudness,
Speech Rate) while speaking [14].
     Research has shown that non-verbal components provide valuable informa-
tion, distinct from verbal content, in accurately classifying emotional expressions.
For example, non-verbal features are capable of indicating emotional signals
when the verbal content of the speech is neutral [29] [25]. Many non-verbal fea-
tures have been shown to exist across multiple languages, enabling generalisable
analysis [25] [27]. Additionally, non-verbal features have been shown to have a
direct impact on emotion recognition processes in the brain, indicating they are
vital to achieving accurate emotion classification [28]. Overall, the larger amount
of features available for non-verbal SER, in comparison to verbal SER, enables
more accurate emotion classification.
     Traditional non-verbal SER1 approaches recruit human observers, who are
asked to classify emotions from non-verbal content collected from one or sev-
eral speakers. While traditional approaches are useful for relatively small-scale
datasets, as they demonstrate high levels of accuracy [26] [27] [4], they are im-
practical for dealing with large and continuously growing datasets that charac-
terise modern human-computer interactions. In an attempt to scale the benefits
of SER, research and industrial applications have been developed to conduct
SER automatically. These applications range from open-source research tools (for
example, Audeering) to commercial tools (for example, Good Vibrations, Voka-
turi, and deep affect API ) [11] [9], all of which employ machine learning-based
approaches. If SER can be accurately automated, it would facilitate increased
efficacy in human-computer interactions, user modelling, and personalisation
services.
     Despite the potential benefits of automated SER, it has not been widely
implemented in everyday life settings [1]. This is despite reviews of the SER
literature demonstrating that automated SER approaches perform similarly or
even better than traditional approaches [1]. One reason for the limited uptake of
automated SER approaches is that they are not as generalisable and adaptable as
human classifiers, who can use contextual information (for example, the gender
of the speaker and the language of the speech) when classifying speech. These
two factors, gender and language, have been shown to significantly affect the
accuracy of automated SER approaches [17] [23].
     SER models that account for gender differences are more accurate than those
that do not [31]. It is difficult, however, to accurately model for these differences.
While there exists many features of speech that are significantly influenced by
gender, there exists considerable overlap between males and females on these
same features. This overlap makes it difficult to distinguish between the gender
and linguistic features of emotions that reliably indicate emotions and those that
do not. Concerning language, SER models trained on one language significantly
decrease in performance when tested on another language. This contrasts to
human observers in traditional SER, who are capable of detecting emotional
expressions in speech in languages they do not speak [4] [26] [27].
1
    Hereonafter, non-verbal SER is referred to as SER as shorthand.
              Linguistic & Gender Variation in SER using Spectral Features        3

     One potential mechanism for improving the accuracy of SER across differ-
ent gender and linguistic populations are spectral features. Spectral features of
audio are computed by converting a time-based signal into the frequency do-
main. These features represent the distribution of energy across a frequency and
the harmonic components of sound. Components such as pitch changes in the
audio signal, can assist in gender differentiation, and voice quality, can be iden-
tified through voice level as either tense, harsh, breathy, or lax [12]. Aggregating
spectral features’ continuous, qualitative, and spectral data aids differentiation
between genders and linguistic populations thereby enabling more generalisable
and accurate emotive expression classifications.
     This study assesses the effect of spectral features on the performance of auto-
matic SER across different gender and linguistic populations by (i) conducting
a comparative analysis of spectral features to identify a suitable feature set,
(ii) comparing classification performance of SER approaches when using mono-
lingual, multi-lingual, and cross-lingual emotive data, and (iii) empirically eval-
uating the performance of a machine learning (ML) model to assess the effect of
gender and linguistic variations on SER classification accuracy.

2   Related Work
Previous research has examined the differences between gender populations in
acoustic features and their effect on SER performance. Several acoustic fea-
tures have been described as “gender-dependent”. Gender dependent in this
case means that gender significantly influences the expression of that feature
[30]. Pitch is an example of a gender-dependent feature. Females score higher,
on average, than males on pitch [16]. This information, however, is not sufficient
to enable generalisable SER classification, because there is considerable overlap
between males and females on this feature [16]. This is also the case for several
other acoustic features described as “gender-dependent”.
    Similarly, previous research has examined the impact of language in develop-
ing accurate SER models. Human observers can accurately classify emotions in
their native language, and non-native languages [4]. This result has been repli-
cated across various cultures, including indigenous tribes, demonstrating the
generalisability of traditional SER [26] [27]. Similarly, automated SER models
achieve a high level of accuracy in mono-lingual settings, where the models are
trained and evaluated in the same language. The performance of these models,
however, significantly drop when they are tested across languages [10] [21].
    If ML models can differentiate between acoustic features that indicate emo-
tional expressions for each gender and linguistic population reliably, then one
should expect generalisability of SER performance between both populations
[31]. To achieve this, spectral features that reliably indicate emotional experi-
ences are first required.
    A survey [8] analysed features across 17 distinctive datasets. These datasets
comprise 9 languages, both male and female speakers, professional and non-
professional acting sessions, recorded call centre conversations and speech recorded
under simulated and naturally occurring stress situations.
4       Z. Dair et al.

    A pool of acoustic features and SER were analysed in [8] and [1]. Which can
be divided into several categories (Continuous, Qualitative, Spectral and Non-
linear Teager Energy Operated (TEO). The results of both reviews indicated
that Continuous and Spectral features were related to emotional expressions
in speech. A weaker relationship was found between emotional expressions and
Qualitative and TEO features.
    To automatically classify relevant spectral features, a ML methodology is
proposed in [15]. The methodology uses a convolutional neural network (CNN)
and SER analysis. The approach consists of using spectral features such as
Mel-frequency Cepstral Coefficients (MFCCs), Mel-scaled spectrogram, Chro-
magram, Spectral contrast and Tonnetz representation extracted from emotive
audio belonging to three distinct datasets. This work, however, focused on a com-
bined gender classification of emotive expression, inhibits the analysis of vocal
variations between gender. This study extends this work to by analysing the ef-
fect of gender and linguistic populations on the emotive expression classification
of spectral features.

3     Methodology
To assess the effect of spectral features for emotive expression classification across
different gender and linguistic populations a series of experiments were con-
ducted. Which required the identification of a suitable set of gender and/or
language-dependent spectral features. The classification was enabled through
the usage of the CNN described in 3.5, the performance of this CNN will be
compared across experiments that account for linguistic and gender differences
and experiments that do not account for these differences. The gender popula-
tions are male and female and the linguistic populations are English, German,
and Italian.
    Discrete emotional models identify several distinct emotions to be classified.
This study uses Basic Emotion Theory (BET), as the discrete emotional model
for SER. In BET the distinct emotions are Anger, Disgust, Fear, Joy, Sadness,
and Surprise. These six emotions are considered basic as (i) they consistently
correlate with psychological, behavioural, and neurophysiological activity [7],
which makes their objective measurement possible, and (ii) they interact to form
more cognitively and culturally mediated emotions, such as shame or guilt [7].
The BET contrasts from dimensional models of emotions that focus on attributes
of emotions (for example, arousal, valence). BET models have been extensively
used in SER research as they capture a wider range of emotions and are intuitive
to label in comparison to dimensional models [1].
3.1   Emotive Speech Data
Three distinct emotive audio datasets were used (see Table 1) to enable a com-
parative analysis of variations between the data. These variations can originate
from many factors such as the gathering method, emotions exhibited, language,
speaker gender, or sampling rate.
   RAVDESS contains emotional speech data constituting statements and
songs in English. For the proposed work, only speech samples are considered.
              Linguistic & Gender Variation in SER using Spectral Features             5

            Table 1. Details of the emotive expression datasets utilised.

Datasets                 RAVDESS[18]               EMO-DB[3]                EMOVO[5]
Population               12M/12F                   5M/5F                    3M/3F
Actor Ages               21-33                     21-35                    23-30
Professional Actors      Yes                       Yes                      Yes
Language                 English                   German                   Italian
Emotions                 A, F, J, S, D, Sur, N 2   A, F, J, B, S, D, N3     A, F, J, S, D, Sur, N
Total Instances          1440                      535                      588
No. Unique Utterences    2                         10                       14
Sampling Rate            48kHz                     48kHz(16kHz)             48kHz
Human Reported Acc       0.62                      0.86                     0.80
Reported Acc             0.71[15] 0.74[6]          0.82[15]                 0.73[2]


Participants uttered two neutral statements across multiple trials. For each trial,
participants were asked to utter the statement in a manner that conveyed one
of the six basic emotions. The statements were controlled to ensure equal levels
of syllables, word frequency, and familiarity to the speaker.
    EMO-DB contains emotional speech data constituting statements in Ger-
man. Participants uttered ten statements, comprised of everyday language and
syntactic structure, in various lengths to simulate natural speech. Each utter-
ance was evaluated with regards to recognisability and naturalness of emotions
exhibited. The emotive reaction of surprise is not considered.
    EMOVO contains emotional speech data constituting statements in Italian.
The participants uttered fourteen distinct semantically neutral statements. Each
conveys a basic emotion and is spoken naturally. An important consideration in
the creation of this dataset was the presence of all phonemes of the Italian
language and a balanced presence of voiced and unvoiced consonants.

3.2   Feature Selection

In order to identify spectral features indicative of emotive expression, feature
selection is performed. Which identifies information such as pitch, energy, voice
levels and energy distribution for classification. Spectral features were extracted,
using the audio analysis library Librosa [19].
    These features are as follows: Mel-Frequency Cepstral Coefficients (MFCC)
which represents the shape of a signal spectrum, and is achieved by collectively
representing a Mel-frequency cepstrum per frame. Chroma Energy Normalized
(CENS) which represents statistics indicative of normalised values quantifying
tempo, articulation, and pitch deviations. Zero-Crossing Rate (ZCR) which rep-
resents the rate of change in the audio signal from positive to negative or negative
to positive, through zero. Chromagram which represents a transformed signal
spectrum built using the chromatic scale capturing the harmonic and melodic
characteristics. Melspectrogram which represents a spectrogram where the fre-
quency is converted from a linear scale to a Mel-scale, resulting in a spectro-
6       Z. Dair et al.

gram reflective of how humans perceive frequencies, capturing the amplitude of
the signal. Spectral Contrast which represents the energy contrast computed by
comparing the peak energy and valley energy in each band converted from spec-
trogram frames. Tonnetz which represents the tonal features in a 6-dimensional
format, capturing traditional harmonic relationships. (perfect fifth, minor third
and major third).
    In order to identify a set of spectral features that captures sufficient emotive
data, a classifier is trained on these features individually before being trained on
permutations of the features across the datasets to identify the best performing
feature set.
3.3 Linguistic Variation
To evaluate the accuracy of a SER model in relation to linguistic variation three
experiments are conducted across three languages (English, German, Italian) as
follows:
 1. Mono-Lingual - The CNN is used to classify spectral features indicative of
     emotive expression from each language independently. The results will be
     used as a baseline performance of the model’s classification performance in
     a single language.
 2. Multi-Lingual - This experiment consists of three permutations. For each,
     the CNN is first trained on one of the chosen languages and then evaluated
     for SER accuracy against the remaining languages. The intent is to iden-
     tify spectral features capacity to classify emotive expression independent of
     language.
 3. Cross-Lingual - The CNN is trained on an aggregation of all training data.
     This will provide insights into the performance of a CNN model when its
     training corpus contains multiple languages. The intent is to identify the
     generalisability of the chosen spectral features for emotive expression classi-
     fication.
3.4 Cross-Gender Emotion
To assess the effect of gender populations on SER, an experiment was con-
ducted using the spectral features identified in Section. 3.2, comprising of two
steps. In step one, the data is classified using the six basic emotions. This ini-
tial classification provided a baseline accuracy where gender is not specified. In
step two, the same dataset is classified with gender-emotion labels (for exam-
ple, Male-Anger/Female-Anger; Male-Joy/Female-Joy). The intent is to evaluate
if the CNN can identify gender-dependent acoustic elements from the spectral
features.
3.5 Classification Of Emotive Speech
The architecture of the CNN used for classification is depicted in Figure 1. A
CNN was re-implemented, due to the high emotive expression classification per-
formance using spectral features as exhibited in [6]. Optimal hyper-parameters
were identified from a comparative analysis against related approaches. The
model is trained on the extracted spectral features over 150 epochs using a
batch size of 16 and 5-Fold validation is performed. During each epoch a portion
of data is used to evaluate the model. The model performance is evaluated across
the metrics precision, recall, F1-score for each of the six basic emotions.
                Linguistic & Gender Variation in SER using Spectral Features           7


    Fig. 1. An overview of the proposed approach, detailing the CNN layer structure.


4      Results
Feature extraction - The comparative analysis of the spectral features in iso-
lation provides insights into the performance on a per feature basis. MFCC,
Melspectrogram and Spectral Contrast were the highest performing individual
spectral features across the datasets. When combined these features formed a
vector of 155 data-points, and the highest performing permutation of spectral
features as denoted in Table. 2
    Linguistic variation - The results for emotive analysis in mono, multi and
cross-lingual data are denoted in Table. 3, these highlight the considerations
for SER across multiple languages. Mono-lingual and Cross-lingual approaches
achieve high accuracies. A degradation in performance is experienced from multi-
lingual approaches.
    Cross-Gender Emotion - Table. 4 identifies discrepancies in the emotive
classification performance, across both gender and language. This indicates a
degree of vocal variance stemming from the population differences, highlighting
considerations for gender-specific SER approaches.

5      Discussion
Feature extraction - There are several notable results from the feature ex-
traction stage that are worth discussion. Firstly, the results of the comparative
analysis showed that the MFCC feature enabled the highest emotion recogni-
tion accuracy within each dataset. Demonstrating the importance of modelling
for phonetic properties, found within the speech signal shape, in enabling ac-
curate emotion classification across languages. Secondly, the performance of the
Melspectrogram feature varied across the datasets. Melspectrogram performed
highly on EMO-DB and EMOVO and poorly on RAVDESS. This inconsis-
tency was caused by the significantly lower amplitude within the audio data
of RAVDESS compared to EMO-DB and EMOVO denoted in Table. 2. Par-
ticipants in the RAVDESS dataset were given instructions to exhibit emotions
8       Z. Dair et al.

Table 2. The classifiers F1 scores for individual spectral features and the top 4 combi-
nations of spectral features. Results are listed in descending order based on the mean
F1 score for single/multi feature classification across the datasets.

Spectral Features                                        RAVDESS EMO-DB EMOVO Mean

MFCC                                                     0.69        0.71      0.71        0.70
Melspectogram                                            0.44        0.66      0.63        0.58
Spectral Contrast                                        0.39        0.41      0.45        0.42
CENS                                                     0.31        0.44      0.34        0.36
STFT                                                     0.22        0.42      0.37        0.34
ZCR                                                      0.23        0.32      0.15        0.24
Tonnetz                                                  0.22        0.27      0.20        0.23

Spectral Feature Permutations

MFCC, Melspectogram, Spectral Contrast                   0.70        0.76      0.78.       0.75
Spectral Contrast, Melspectogram, MFCC                   0.68        0.76      0.75        0.73
MFCC, STFT, Spectral Contrast                            0.67        0.70      0.69        0.69
STFT, Melspectogram, Spectral Contrast, Tonnetz          0.51        0.67      0.60        0.59

Additional Characteristics

Average Absoulte Amplitude (dB)                          0.07        0.62      0.17        NA


with varying intensity normal and strong, additionally, post-processing proce-
dures are likely the cause for the lower level of amplitude. This has significant
consequence for classification when amplitude is utilised as a measure of emotive
expression. Thirdly, Spectral Contrast in isolation provides greater average ac-
curacy than the remaining individual features. Therefore, the contrast between
peak and valley energy is partially indicative of emotion in emotive speech. The
high performance of these features in isolation indicates their potential suitability
for emotive expression classification. Finally, the results showed that combining
these three features increased accuracies for each data set. This demonstrates
the importance of including such spectral features in SER classifications.
    Linguistic variation - SER accuracy decreased significantly between the
different types of analysis. The high performance on mono-lingual analysis indi-
cates the importance of the selected spectral features for each language. The poor
performance in multi-lingual analysis, however, indicates a lack of universality
between spectral representations of emotions across the datasets. For example,
there was significant variation in the amplitude and pitch range across the lan-
guages, as identified in [20]. Additionally, these differences likely stemmed from
differences in data collection (equipment, volume, actor-microphone distance)
across the datasets, thereby decreasing the performance of the CNN in multi
and cross-lingual analysis. This highlights the need for standardised recording
               Linguistic & Gender Variation in SER using Spectral Features           9

                                      Table 3. c

lassification accuracy per language derived from 5-fold cross-validation, and train/test
        data specified. (X represents data from the corresponding column name)
 Method            Train/Test         English      German      Italian     Combined

 Mono-Ling.        X/X                0.70         0.76        0.78        NA
 Multi-Ling.-Eng. RAVDESS/X           0.70         0.20        0.16        NA
 Multi-Ling.-Ger. EMO-DB/X            0.15         0.76        0.16        NA
 Multi-Ling.-Ital. EMOVO/X            0.17         0.26        0.78        NA
 Cross-Ling.       All/All            NA           NA          NA          0.66


procedures within SER research, and/or additional procedures to normalize lin-
guistic variances.
    Cross-Gender Emotion - There were substantial discrepancies in spectral
representations between the gender populations. As a result, SER performance
between the genders differed significantly across the six basic emotions. These
differences were varied across languages, indicating that there is a generalisable
effect of gender on SER performance when using spectral features, however this
is impacted by linguistic variation. For example, high energy emotions (such
as Anger and Joy) were more clearly detected in German-speaking males than
in other linguistic or gender groups. The German language is characterised by
higher amplitude, energy and harsher speech when articulated within males [20].
This likely contributed to higher accuracy of detecting those emotions from
German-speaking males. Disgust and Fear, in contrast, were more accurately
classified from the speech of females across each language. This may have resulted
from differences in vocals, particularly pitch, between the gender groups [22].
Additionally, female voices tend to articulate in a softer manner [24] conducive
to representing softer, lower amplitude emotions.
    These results likely contributed to the weaker performance of combined gen-
der classification in comparison to gender-specific classifications of emotion.
Combined approaches can account for variations between genders and languages,
however, in certain cases, a specific gender may act as a limiting factor reducing
the overall accuracy.
    Limitations - The major limitations of this work concern sampling issues.
Firstly, the sample size across the three datasets is small (N = 40). Since sample
sizes can exaggerate the variances between populations, then the small sample
size in this study may have exaggerated real differences in spectral features be-
tween gender and linguistic populations. Secondly, the datasets were comparing
linguistic populations across an unequal number of speakers. Different sample
sizes will have higher or lower ranges of variability. Comparing different sample
sizes makes it difficult to determine whether the results stem from real between-
group differences or are the result of a higher level of “noisy” data found in
small sample sizes. These limitations damage the generalisability of the work.
10      Z. Dair et al.

Table 4. The classifiers Precision (Prec.), Recall (Rec.) and F1 scores (F1) per BET
emotion. The results demonstrate classification performance of male and female emo-
tive expression independently and the F1 scores of combined gender emotive speech
classification across each dataset.

                                  Male            Female          Combined
     Dataset       Emotion    Prec. Rec. F1    Prec. Rec. F1    Prec. Rec. F1
     RAVDESS       Anger      0.76 0.76 0.76   0.76 0.84 0.80   0.75 0.68 0.71
     EMO-DB        Anger      0.90 1.00 0.95   0.78 0.78 0.78   0.92 0.86 0.89
     EMOVO         Anger      0.90 0.90 0.90   1.00 0.92 0.96   0.86 0.92 0.89
     RAVDESS       Joy        0.57 0.44 0.50   0.73 0.73 0.73   0.86 0.62 0.72
     EMO-DB        Joy        0.75 0.75 0.75   0.46 0.67 0.55   0.65 0.87 0.74
     EMOVO         Joy        0.83 0.77 0.80   0.60 0.60 0.60   0.88 0.64 0.74
     RAVDESS       Fear       0.59 0.59 0.59   0.85 0.85 0.85   0.83 0.71 0.76
     EMO-DB        Fear       1.00 0.75 0.86   0.80 1.00 0.89   0.67 0.40 0.50
     EMOVO         Fear       0.67 0.80 0.73   0.71 1.00 0.83   0.64 0.69 0.67
     RAVDESS       Sadness    0.50 0.73 0.59   0.81 0.57 0.67   0.49 0.68 0.57
     EMO-DB        Sadness    0.57 0.80 0.67   1.00 1.00 1.00   0.53 1.00 0.69
     EMOVO         Sadness    1.00 1.00 1.00   0.88 1.00 0.93   0.94 0.85 0.89
     RAVDESS       Surprise   0.67 0.48 0.56   0.68 0.79 0.73   0.62 0.85 0.72
     EMOVO         Surprise   0.64 0.70 0.67   0.56 0.45 0.50   0.59 0.84 0.70
     RAVDESS       Disgust    0.60 0.80 0.69   0.73 0.84 0.78   0.77 0.52 0.62
     EMO-DB        Disgust    0.67 0.50 0.57   0.88 0.70 0.78   1.00 0.80 0.89
     EMOVO         Disgust    0.75 0.50 0.60   0.86 0.86 0.86   0.64 0.70 0.67
     RAVDESS       Neutral    0.89 0.84 0.86   0.89 0.86 0.88   0.76 0.84 0.80
     EMO-DB        Neutral    0.67 0.67 0.67   1.00 0.50 0.67   0.83 1.00 0.91
     EMOVO         Neutral    0.86 1.00 0.92   0.83 0.83 0.83   1.00 0.86 0.92


In a follow-up study, the sample size of the overall dataset will be increased and
the number of speakers per linguistic group will be controlled.

6    Conclusion
This work was an exploratory analysis of linguistic and gender variation on the
emotive classification of spectral features. The results showed that features us-
ing the Mel-Scale and representations of amplitude and energy are important in
accurate SER across different gender and linguistic populations. Additionally,
the results showed that higher energy emotions such as Anger, Joy, Surprise
were easier to identify originating from male voices in a high amplitude, harsh
language such as German. Similarly lower energy emotions such as Disgust and
Fear were easier to identify from female voices in each language. These observa-
tions highlight the importance of signal amplitude and energy when analysing
emotion across gender and language.
               Linguistic & Gender Variation in SER using Spectral Features           11

     The performance of emotive expression classification across mono, multi and
cross-lingual data provides insights into linguistic variation of emotive audio.
Mono-lingual approaches are suitable as baselines in comparison to multi and
cross-lingual approaches. A linguistic variance between emotive expression rep-
resented by spectral features can be identified. To overcome these variances,
it is recommended that cross-linguistic approaches, that combine languages for
training, are implemented.
     Future work should explore whether classification accuracy is affected by
(i) model optimization in terms of structure and/or (ii) comprehensive hyper-
parameter tuning (epochs, batch size, optimizers) using a grid search technique.
Additionally, the relationships between linguistics, gender and emotion should
be explored from a vocal, psychological and technical classification perspective,
with a larger sample size to garner further insights into improving SER general-
isability.

References
 1. Akçay, M.B., Oğuz, K.: Speech emotion recognition: Emotional models, databases,
    features, preprocessing methods, supporting modalities, and classifiers. Speech
    Communication 116, 56–76 (2020). https://doi.org/10.1016/j.specom.2019.12.001
 2. Ancilin, J., Milton, A.: Improved speech emotion recognition with Mel
    frequency magnitude coefficient. Applied Acoustics 179, 108046 (2021).
    https://doi.org//10.1016/j.apacoust.2021.108046
 3. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database
    of German emotional speech. In: 9th ECSCT (2005)
 4. Cordaro, D.T., Keltner, D., Tshering, S., Wangchuk, D., Flynn, L.M.: The voice
    conveys emotion in ten globalized cultures and one remote village in Bhutan. Emo-
    tion 16(1), 117 (2016)
 5. Costantini, G., Iadarola, I., Paoloni, A., Todisco, M.: EMOVO corpus: an Italian
    emotional speech database. LREC p. 4 (2014)
 6. Dair, Z., Donovan, R., O’Reilly, R.: Classification of emotive expression using
    verbal and non verbal components of speech. In: 2021 ISSC. pp. 1–8 (2021).
    https://doi.org/10.1109/ISSC52156.2021.9467869
 7. Ekman, P.: Emotions revealed. Bmj 328(Suppl S5) (2004)
 8. El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition:
    Features, classification schemes, and databases. Pattern Recongnition 44(3), 572–
    587 (2011). https://doi.org/10.1016/j.patcog.2010.09.020
 9. Eyben, F., Huber, B., Marchi, E., Schuller, D., Schuller, B.: Real-time robust recog-
    nition of speakers’ emotions and characteristics on mobile platforms. In: 2015 ACII.
    pp. 778–780. IEEE (2015)
10. Feraru, S.M., Schuller, D., et al.: Cross-language acoustic emotion recognition:
    An overview and some tendencies. In: 2015 International Conference on Affective
    Computing and Intelligent Interaction (ACII). pp. 125–131. IEEE (2015)
11. Garcia-Garcia, J.M., Penichet, V.M., Lozano, M.D.: Emotion detection: a tech-
    nology review. In: Proceedings of the XVIII international conference on human
    computer interaction. pp. 1–8 (2017)
12. Gobl, C., Chasaide, A.: The role of voice quality in communicating emo-
    tion, mood and attitude. Speech Communication 40, 189–212 (04 2003).
    https://doi.org/10.1016/S0167-6393(02)00082-1
12      Z. Dair et al.

13. Goddard, C.: Semantic Analysis: A Practical Introduction. Oxford University Press
    (2011)
14. Hsu, J.H., Su, M.H., Wu, C.H., Chen, Y.H.: Speech emotion recognition consider-
    ing nonverbal vocalization in affective conversations. IEEE/ACM Transactions on
    Audio, Speech, and Language Processing 29, 1675–1686 (2021)
15. Issa, D., Fatih Demirci, M., Yazici, A.: Speech emotion recognition
    with deep convolutional neural networks. BSPC 59, 101894 (2020).
    https://doi.org/10.1016/j.bspc.2020.101894
16. Latinus, M., Taylor, M.J.: Discriminating male and female voices: differentiating
    pitch and gender. Brain topography 25(2), 194–204 (2012)
17. Lausen, A., Schacht, A.: Gender differences in the recognition of vocal emotions.
    Frontiers in psychology 9, 882 (2018)
18. Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional
    speech and song (RAVDESS). PloS one 13(5), e0196391 (2018)
19. Brian McFee, Colin Raffel, Dawen Liang, Daniel P.W. Ellis, Matt McVicar, Eric
    Battenberg, Oriol Nieto: Librosa: Audio and Music Signal Analysis in Python.
    In: Proceedings of the 14th Python in Science Conference. pp. 18 – 24 (2015).
    https://doi.org/10.25080/Majora-7b98e3ed-003
20. Mennen, I., Schaeffler, F., Docherty, G.: Pitching it differently : a comparison of
    the pitch ranges of German and English speakers. 16th ICPS pp. 1769–1772 (08
    2007). https://doi.org/20.500.12289/42
21. Neumann, M., et al.: Cross-lingual and multilingual speech emotion recognition on
    English and French. In: 2018 ICASSP. pp. 5769–5773. IEEE (2018)
22. Pépiot, E.: Voice, speech and gender:: Male-female acoustic differences and
    cross-language variation in English and French speakers. corela (2015).
    https://doi.org/10.4000/corela.3783
23. Rajoo, R., Aun, C.C.: Influences of languages in speech emotion recognition: A
    comparative study using malay, english and mandarin languages. In: 2016 ISCAIE.
    pp. 35–39. IEEE (2016)
24. Sandier, B.R.: Women faculty at work in the classroom, or, why it still
    hurts to be a woman in labor. Communication Education 40(1), 6–15 (1991).
    https://doi.org/10.1080/03634529109378821
25. Sauter, D.A., Eisner, F., Calder, A.J., Scott, S.K.: Perceptual cues in nonverbal vo-
    cal expressions of emotion. Quarterly Journal of Experimental Psychology 63(11),
    2251–2272 (2010)
26. Sauter, D.A., Eisner, F., Ekman, P., Scott, S.K.: Cross-cultural recognition of basic
    emotions through nonverbal emotional vocalizations. Proceedings of the National
    Academy of Sciences 107(6), 2408–2412 (2010), publisher: National Acad Sciences
27. Sauter, D.A., Eisner, F., Ekman, P., Scott, S.K.: Emotional vocalizations are recog-
    nized across cultures regardless of the valence of distractors. Psychological science
    26(3), 354–356 (2015)
28. Schirmer, A., Gunter, T.C.: Temporal signatures of processing voiceness and emo-
    tion in sound. Social cognitive and affective neuroscience 12(6), 902–909 (2017)
29. Soleymani, M., Garcia, D., Jou, B., Schuller, B., Chang, S.F., Pantic, M.: A survey
    of multimodal sentiment analysis. Image and Vision Computing 65, 3–14 (2017)
30. Vogt, T., André, E.: Improving automatic emotion recognition from speech via
    gender differentiaion. In: LREC. pp. 1123–1126 (2006)
31. Xu, Z., Meyer, P., Fingscheidt, T.: On the effects of speaker gender in emo-
    tion recognition training data. In: Speech Communication; 13th ITG-Symposium.
    pp. 1–5. VDE (2018)