=Paper= {{Paper |id=Vol-2699/paper27 |storemode=property |title=Evaluation Framework for Context-aware Speaker Recognition in Noisy Smart Living Environments |pdfUrl=https://ceur-ws.org/Vol-2699/paper27.pdf |volume=Vol-2699 |authors=Gianni Fenu,Roberta Galici,Mirko Marras |dblpUrl=https://dblp.org/rec/conf/cikm/FenuGM20 }} ==Evaluation Framework for Context-aware Speaker Recognition in Noisy Smart Living Environments== https://ceur-ws.org/Vol-2699/paper27.pdf
Evaluation Framework for Context-aware Speaker
Recognition in Noisy Smart Living Environments
Gianni Fenu, Roberta Galici and Mirko Marras
Department of Mathematics and Computer Science, University of Cagliari, V. Ospedale 72, 09124 Cagliari, Italy


                                       Abstract
                                       The integration of voice control into connected devices is expected to improve the efficiency and comfort of our daily lives.
                                       However, the underlying biometric systems often impose constraints on the individual or the environment during interaction
                                       (e.g., quiet surroundings). Such constraints have to be surmounted in order to seamlessly recognize individuals. In this paper,
                                       we propose an evaluation framework for speaker recognition in noisy smart living environments. To this end, we designed
                                       a taxonomy of sounds (e.g., home-related, mechanical) that characterize representative indoor and outdoor environments
                                       where speaker recognition is adopted. Then, we devised an approach for off-line simulation of challenging noisy condi-
                                       tions in vocal audios originally collected under controlled environments, by leveraging our taxonomy. Our approach adds a
                                       (combination of) sound(s) belonging to the target environment into the current vocal example. Experiments on a large-scale
                                       public dataset and two state-of-the-art speaker recognition models show that adding certain background sounds to clean
                                       vocal audio leads to a substantial deterioration of recognition performance. In several noisy settings, our findings reveal that
                                       a speaker recognition model might end up to make unreliable decisions. Our framework is intended to help system designers
                                       evaluate performance deterioration and develop speaker recognition models more robust to smart living environments.

                                       Keywords
                                       Deep Learning, Security, Speaker Recognition, Speaker Verification, Noisy Environments, Sound Taxonomy.


1. Introduction                                                                        models that can improve human quality of life.
                                                                                          State-of-the-art speaker recognition matchers exhibit
Speech is a more natural way of interacting with de- impressive accuracy, especially when the voice qual-
vices than tapping screens. This form of interaction is ity is reasonably good [7]. For this reason, they im-
receiving more and more attention, with voice-enabled plicitly or explicitly impose constraints on the envi-
services being used in every aspect of our lives. Speaker ronment, such as being stationary and quiet. Conven-
recognition analyzes the identity of an individual be- tionally, speaker matchers are trained to classify vocal
fore accessing to a service. Unlike speech recogni- examples under idealistic conditions but are expected
tion, which detects spoken words, speaker recognition to operate well in real-world challenging situations.
inspects patterns that distinguish one person’s voice However, their performance sharply degrades when
from another [1]. Recognizing the identity of a speaker audios with substantial background sounds (e.g., traf-
becomes crucial in different scenarios. For instance, fic) are encountered. Enhancing voice data is demand-
voice-enabled devices (e.g., assistants, smartphones) ing since related algorithms do not often explicitly at-
allow home owners to turn on lights, unlock doors, tempt to preserve biometric cues in the data [8, 9, 10].
and listen to music seamlessly [2]. These recognition Existing robust speaker models are being trained on
abilities can prevent unauthorized individuals from us- data which do not cover various levels of interfering
ing devices without the owner’s permission and can sounds and different sound types [11, 12]. Hence, sev-
provide evidence needed to personalize user’s expe- eral questions concerning how much and under which
riences with these devises, even outside the domes- background sounds speaker recognition performance
tic borders [3, 4, 5]. Moreover, speaker recognition degrades and how each type of sound impacts on the
can make lives of older adults’ and people with spe- mechanics of these matchers remain unanswered.
cial needs easier and safer [6]. Hence, it is imperative                                  Our study in this paper is hence organized around
to study and devise data-driven speaker recognition these directions and aims to perform an extensive per-
                                                                                       formance analysis of deep speaker recognition match-
Proceedings of the CIKM 2020 Workshops, October 19-20, Galway,                         ers in a range of noisy living environments. To this
Ireland. Stefan Conrad, Ilaria Tiddi (Eds.)                                            end, we designed and collected a taxonomy of sounds
email: fenu@unica.it (G. Fenu); r.galici1@studenti.unica.it (R.
Galici); mirko.marras@unica.it (M. Marras)
                                                                                       (e.g., footsteps, laughing) that characterize representa-
orcid: 0000-0003-4668-2476 (G. Fenu); 0000-0003-1989-6057 (M.                          tive living ambients where speaker recognition is find-
Marras)                                                                                ing adoption. Then, we depicted an approach that al-
          © 2020 Copyright for this paper by its authors. Use permitted under Creative
          Commons License Attribution 4.0 International (CC BY 4.0).                   lows us to simulate challenging noisy conditions in
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
raw vocal audios by adding sounds of our taxonomy,                optimization task by means of evolutionary algorithms.
according with the environment under consideration.               The authors found that a Bitwise noise can fundamen-
Finally, we experimented with a public dataset, orig-             tally affect recognition patterns during evaluation and,
inally collected in controlled environments, and two              thus, might make it harder to deploy matchers in the
state-of-the-art speaker recognition models, to inspect           real world. Differently, Ko et al. [14] focused on a per-
the impact of background noisy sounds on their per-               formance comparison between acoustic models trained
formance. Our contribution is threefold:                          with and without simulated far-field speech under a
                                                                  real far-field voice dataset. Their experiments showed
      • We design a taxonomy of ambient sounds tai- that acoustic models trained on simulated far-field led
        lored to speaker recognition research, and we to significantly lower error rates in both a distant- and
        provide a dataset of recordings with labeled sound close-talking scenarios. In [15], the authors presented
        sources for each category in our taxonomy.                a feature learning approach, referred as to e-vector, that
      • We propose an evaluation framework for speaker can capture both channel and environment variability.
        recognition benchmarking, enabling easier and Recently, Vincent et al. [16] analyzed the performance
        faster simulation of indoor and outdoor noisy of speaker recognition matchers on the CHiME3 dataset,
        environments in (clean) vocal audios1 .                   which consists of real recordings in noisy environments.
                                                                  Finally, Donahue et al. [17] analyzed the benefits re-
      • Given a large vocal dataset, we perform an ex- sulting from training a speaker recognition matcher
        tensive analysis on the impact of the sounds in with both clean speech data and fake speech data cre-
        our taxonomy on the performance of two state- ated by means of a generative adversarial network.
        of-the-art speaker recognition matchers.                     Though this research has greatly expanded our un-
                                                                  derstanding, past works focused on low-level noises
    Our experiments showed that, even when the back- (e.g., Bitwise) or did not specifically control how and
ground sound volume is low, speaker recognition sys- under which ambient sounds the performance degrades.
tems undergo a substantial deterioration of accuracy. We argue that different background sounds may lead
Only in case of nature-related sounds (e.g., chirping, to fundamentally different impacts and, thus, a clear
wind), the sound impact is negligible. Certain envi- understanding on the extent of this impact is lacking.
ronmental settings lead to error rates five to ten times
higher than error rates achieved in ideal conditions.
    The rest of this paper is organized as follows. Sec- 2.2. Input Audio Quality Enhancement
tion 2 depicts an overview of related works. Then, our Existing literature included audio enhancement algo-
taxonomy and the simulation framework are described rithms that aim to provide audible improvements in
in Section 3. Section 4 presents our experiments. Fi- a sound, without degrading the quality of the origi-
nally, Section 5 provides insights for future work.               nal recording. This type of strategy well fits with the
                                                                  forensic context, where audios may have some kind of
                                                                  background sound disturbance or sound artifact that
2. Related Work                                                   may interfere with the voice of interest. Examples of
Our research lies at the intersection among three per- audio enhancement methods are removing static noise,
spectives, namely studies which analyze the impact eliminating phone-related interference, and clearing
of background sounds on recognition, audio enhance- up random sounds (e.g., dogs barking, bells ringing).
ment algorithms aimed to improve data quality, and For instance, Hou et al. [8] proposed a convolution-
speaker recognition approaches which seek to classify based audio-visual auto-encoder for speech enhance-
noisy vocal data with no pre-processing.                          ment through multi-task learning. In [9], the authors
                                                                  investigated how to improve speech/non-speech de-
                                                                  tection robustness in very noisy environments, includ-
2.1. Explorative Analysis in Noisy                                ing stationary noise and short high-energy noise. Sim-
        Environments                                              ilarly, Afouras et al. [10] proposed an audio-visual neu-
Explorative analyses simply investigate how noisy en- ral network able to isolate a speaker’s voice of interest
vironments influence speaker recognition performance. from other sound interference, using visual informa-
For instance, Qian et al. [13] studied the low-level noisy tion from the target speaker’s lips. However, the de-
                                                                  signed methods do not often attempt to preserve bio-
     1 Code, data, pre-trained models, and documentation are pub- metric cues in the data and depend on the nature of the
licly available at https://mirkomarras.github.io/dl-voice-noise/. sound, which varies according to the context. Hence,
Figure 1: Our taxonomy of sounds characterizing representative environments where speaker recognition is adopted.



our framework becomes a key asset to study recogni- Hence, being able to combine utterances recorded in
tion performance on background sounds against which quiet environments with the background sounds of the
countermeasures have been under-explored.                considered scenario represents a viable alternative. The
                                                         first step to put this idea into practice consists of col-
2.3. Robust Speaker Recognition                          lecting sounds from a wide range of sources and orga-
                                                         nizing them in a hierarchical taxonomy.
Speaker recognition matchers have traditionally relied      Noisy sounds research is challenging due to the lack
on Gaussian mixture models [18], joint factor analy- of labeled audio data. Past works collected sounds from
sis [19], and IVectors [20]. Recently, speaker match- specific environments and resulted in commercial or
ers achieved impressive accuracy thanks to speaker private datasets. Recent contributions have provided
embedding representations extracted from Deep Neu- publicly available datasets of environmental record-
ral Networks trained for (one-shot) speaker classifica- ings [23]. On top of these collections, many studies
tion. Notable examples include CVectors [21], XVec- have been carried out on sound classification [24]. Be-
tors [11], VGGVox- and ResNet-Vectors [12]. More- ing designed for sound classification tasks, existing tax-
over, Kim et al. [22] proposed a deep noise-adaptation onomies can not directly be applied nor combined to
approach that dynamically adapts itself to the opera- simulate scenarios of speaker recognition. For instance,
tional environment. Existing approaches in this area they often include a few classes, sounds of marginal in-
do not make any assumption on the training and test- terest (e.g., gun shots), and are organized according to
ing data, which come from various noisy situations. the sound type. Conversely, for our purposes, a taxon-
Therefore, there is no fine-grained control on how these omy should be designed with situational and contex-
systems perform in specific noisy applicative scenar- tual elements in mind (e.g., grouping sounds based on
ios and the noisy situations are limited by the variety the ambient where they frequently appear).
of recordings including into the considered dataset.        To address these issues, we proposed a compilation
                                                         of environmental sounds from over 50 classes. The se-
                                                         lected sound clips were constructed from recordings
3. The Proposed Framework                                available on the above-mentioned urban sound tax-
                                                         onomies and on Freesound 2 , with a semi-automated
Our framework is composed by a dataset of sounds
                                                         pipeline. Specifically, we first identified a representa-
categorized according to our pre-defined taxonomy and
                                                         tive set of scenarios/environments where speaker recog-
a toolbox that simulates background living scenarios.
                                                         nition is actively used nowadays, and then we filtered
                                                         out the categories of sounds included into existing tax-
3.1. Sound Taxonomy for Speaker                          onomies that are of marginal interest for the selected
      Recognition in Living Scenarios                    scenarios (e.g., fire engine). Then, we introduced new
                                                         sound categories that help to model speaker recog-
Collecting utterances produced in various living sce-
                                                         nition scenarios whose sounds are not present in ex-
narios while keeping trace of the background sounds
in each utterance is challenging and time-consuming.         2 https://freesound.org/
isting sound taxonomies (e.g., dishwasher, footsteps). and a list of vocal audios where that context should
The included classes were arbitrarily selected with the be simulated, a routine changes each vocal audio by
goal of maintaining balance between major types of adding to it the combination of sounds included into
sounds characterizing the selected scenarios and of con- the context definition, with their given volume and
sidering the limitations in the number and diversity probability. For each sound category, the sound to add
of available sound recordings. Freesound was queried can be specified or randomly chosen. Our toolbox and
for common terms related to the considered scenarios. our definition of context provide the necessary level
Search results were verified by annotating fragments of flexibility to simulate real-world scenarios created
that contain events associated with a given scenario. from all the combinations of the taxonomy’s sounds.
Sounds are grouped in two major categories pertain-
ing to indoor and outdoor contexts3 :
                                                                       4. Experiments
     • Indoor category including sounds divided into
       three different categories: Home (e.g., TV, wash- In this section, we assess how much and under which
       ing machine), Voice (e.g., chatting, laughing) background sounds speaker recognition performance
       and Movement (e.g., footsteps, applause).           degrades, how each background sound type impacts
                                                           on model mechanics, and how much the volume level
     • Outdoor category with sounds divided into two of the background sounds leads to models which pro-
       categories: Nature contains different types of vide less accurate predictions. In fact, how each noisy
       sounds, such as atmospheric elements (e.g., rain, context influences the behavior of state-of-the-art ar-
       wind), animal sounds (e.g., dogs, cats, birds), and chitectures, such as VGGVox and XVector, still remains
       sounds associated to plants and vegetation (e.g., under-explored, since their effectiveness has been of-
       leaves). Mechanical includes sounds produced ten evaluated under ideal conditions, with background-
       by ventilation, motorized transports (e.g, cars, sound unlabelled vocal audios, or the same type of vo-
       trains), non-motorized transports (e.g., bicycles), cal audios (e.g, from interviews).
       and other signals (e.g., church bells).

   The collected audios were converted to a unified
                                                                       4.1. Seed Human Voice Dataset
format (16 kHz, mono, wav) to facilitate their process-                Given its large scale and its wide adoption in the liter-
ing with existing audio-programming packages. These                    ature, we simulated applicative contexts into the vocal
sounds were arranged into the taxonomy in Figure 1                     data belonging to the VoxCeleb-1 dataset [12]. This col-
based on the above-mentioned considerations.                           lection consists of short utterances taken from video
                                                                       interviews published on Youtube, including speakers
3.2. Toolbox for Background Living                                     from a wide range of different ethnicities, accents, pro-
     Scenario Simulation                                               fessions and ages, fairly balanced with respect to their
                                                                       gender (i.e., 55% of males). The dataset is split into de-
Our taxonomy is proposed to facilitate the simulation                  velopment and test sets having disjoint speakers. The
of real-world applicative contexts in vocal audio. Thus,               development set has 1, 211 speakers and 143, 768 ut-
on top of this taxonomy, a way to combine vocal au-                    terances, while the test set consists of 40 speakers and
dio and background sounds is needed. To this end, we                   4, 874 utterances. Our study leveraged trial pairs pro-
propose a Python toolbox that can simulate an applica-                 vided by the authors together with the VoxCeleb-1 data4 .
tive context into a vocal audio. Specifically, we define
an applicative context has a set of one or more sound                  4.2. Benchmarked Models
entries taken from the taxonomy. Each entry includes
a string identifier associated to the sound category to                Our analysis benchmarks two state-of-the-art speaker
include (e.g., Home or Voice), a floating-point num-                   recognition architectures: VGGVox [12] and XVector [11].
ber that specifies the volume level of that sound in the               They have received great attention in recent years, and
current context, and a floating-point number in [0,1]                  this motivated us to deepen their robustness in noisy
representing the probability of adding that sound into
                                                                            4 Due to the large amount of comparisons to simulate all con-
a vocal example. Given a context defined as above
                                                                       texts, our study focused on 1, 000 out of 37, 702 VoxCeleb-1 trial pairs
                                                                       and leaves as a future work the extension to the larger VoxCeleb-
    3 Our preliminary analysis considers two disjoint sets of indoor   2. Here, we are more interested in understanding matchers robust-
and outdoor sounds, leaving settings that cross-link sound entities    ness against background sounds, so the accuracy gains with larger
within a graph-based taxonomy as a future work.                        datasets would not substantially affect the findings of our analysis.
environments. VGGVox is based on the VGG-M Convo-           tic representations were extracted and fed into that
lutional Neural Network, with modifications to adapt to     pretrained model to get the speaker embeddings. The
the audio spectrogram input. The last fully-connected       Cosine similarity between the speaker embeddings was
layer is replaced by two layers, a fully connected layer    calculated. This process was repeated for each trial
with support in the frequency domain and an aver-           pair in the set. Finally, given the resulting similarity
age pooling layer with support on the time domain.          scores and the verification labels (i.e., 0 for different-
XVector is a Time Delay Neural Network, which allows        user pairs, 1 for same-user pairs), the Equal Error Rate
neurons to receive signals spanning across multiple         (EER) under that context was computed. The entire
frames. Given a filterbank, the first five layers operate   protocol was repeated with different background sound
on speech frames, with a small temporal context cen-        volume levels, treated as percentages, assumed to be
tered at the current frame. Then, a pooling layer ag-       in [0, 0.05, 0.10, 0.20, 0.30, 0.50, 1, 1.5] (e.g., 1 means that
gregates frame-level outputs and computes mean and          the original volume is kept, 0.5 means that the volume
standard deviation. Finally, two fully-connected layers     of the background sound is reduced by 50%).
aggregate statistics across the time dimension.                This protocol was carried out on 25 contexts com-
                                                            posed by either single categories of the third and fourth
4.3. Model Training and Testing Details                     levels of our taxonomy and their combination.

The code, implemented in Python, ran on a NVIDIA
                                                            4.5. Experimental Results
GPU. The audios were converted to single-channel, 16-
bit streams at a 16kHz sampling rate. We used 512-        Given the considered taxonomy contexts, the VoxCeleb-
point Fast Fourier Transforms. VGGVox received spec-      1 trial pairs, and two pre-trained speaker recognition
trograms of size 512𝑥300, while XVector received fil-     models, we followed the protocol in Section 4.4 to cal-
terbanks of size 300𝑥24. Both representations were        culate the EERs at various background sound volumes.
generated in a sliding window fashion using the ham-         Indoor. Table 1 and 2 report the EERs under various
ming window of width 25𝑚𝑠 and step 10𝑚𝑠, and nor-         combinations of indoor sounds. It can be observed that
malized by subtracting the mean and dividing by the       Voice is the individual sound category that leads to
standard deviation of all frequency components. Each      the highest degradation in performance, with 27 − 30%
model was trained for classification on VoxCeleb-1 dev    of EER at the 1.5 volume ratio. Home and Movement
set using Softmax, with batches of size 64. To keep       showed a similar impact when the volume level was
consistency with respect to the original implementa-      below 1.0, while the former brought more negative ef-
tions of VGGVox and XVector, we used the Adam opti-       fects with volume ratios higher than 1.0. When two
mizer, with an initial learning rate of 0.001, decreased  or more sound categories were combined, EERs easily
by a factor of 10 every 10 epochs, until convergence.     turned out to values above 15%, especially in scenar-
For testing, we considered speaker embeddings of size     ios where both Home and Voice sounds were present.
512. The choice of the architectural parameters was       XVector’s performance substantially decreased as soon
driven by the original model implementations, with-       as sounds were added, while VGGVox showed a more
out any specific adaptation, given that we are inter-     robust behavior against background sounds. It might
ested in benchmarking the original models in noisy        be possible that the changes introduced by the back-
environments rather than tuning/arranging the param-      ground sound in the spectrograms fed into VGGVox
eters to align with our goals.                            had a lower influence on the recognition pattern learnt
                                                          by the convolutional layers during training. On the
4.4. Speaker Recognition Protocols                        other hand, with XVector, the temporal context em-
                                                          ployed at each layer of the network might be highly
Given a pretrained model, a set of trial verification influenced by the changes introduced into the filter-
pairs, and a target context, the protocol worked as fol- banks through the background sound addition.
lows. For each trial pair in the set, we assumed that the
first audio represented the enrolled utterance ideally       Nature Outdoor. Table 3 and 4 report the EERs ob-
collected in controlled environments, while the sec-      tained  when nature-related sounds were added. These
ond audio was the probe provided in the target con-       sounds   showed different degradation patterns from each
text. Hence, the first audio remained unchanged. The      other,  compared    with indoor-related sounds. It can
second audio was changed by adding sounds that char-      be  observed   that models were robust against Plants
                                                          Vegetation at any volume. Conversely, sounds com-
acterize the target context, as explained in Section 3.2.
For both the enrolled and the changed audios, the acous- ing from the AtmosphericElements category led to
                                                                                    Volume
             Sound Combination
                                                             0.05    0.10    0.20     0.30  0.50    1.00    1.50
             Home                                            4.60    5.80    7.80    10.00 12.50   19.30   26.40
             Movement                                        5.00    6.40    7.90    12.60 14.20   16.70   20.50
             Voice                                           8.10   12.20   17.30    18.20 19.40   24.20   27.90
             Home-Movement                                   5.00    6.50   13.00    15.90 23.20   30.00   32.20
             Home-Voice                                     10.10   14.00   20.00    23.10 26.00   34.60   39.10
             Movement-Voice                                 11.00   15.70   14.60    17.00 18.80   23.90   29.40
             Home-Movement-Voice                            14.00   17.00   22.90    26.90 31.80   39.30   42.20

Table 1
VGGVox - Indoor Scenario. EERs achieved by VGGVox under an Indoor Scenario. Bold values show the highest EER at
each volume level. VGGVox led to an EER of 2.20% when no sounds were added to the vocal file.
                                                                                    Volume
             Sound Combination
                                                             0.05    0.10    0.20     0.30  0.50    1.00    1.50
             Home                                           10.49   13.90   19.59    20.99 26.30   32.19   36.40
             Movement                                       12.60   17.80   23.70    26.00 30.70   28.50   30.30
             Voice                                          15.40   18.10   20.59    24.20 25.00   32.49   30.60
             Home-Movement                                  13.50   30.30   36.19    36.70 39.60   37.60   49.09
             Home-Voice                                     17.10   24.09   26.90    29.70 35.60   41.10   40.60
             Movement-Voice                                 21.79   27.90   33.59    34.19 37.90   36.80   39.80
             Home-Movement-Voice                            21.59   30.30   36.19    36.70 39.60   44.20   46.09

Table 2
XVector - Indoor Scenario. EERs achieved by XVector under an Indoor Scenario. Bold values show the highest EER at each
volume level. XVector led to an EER of 6.35% when no sounds were added to the vocal files.
                                                                                    Volume
             Sound Combination
                                                            0.05     0.10    0.20     0.30  0.50    1.00    1.50
             Animals                                        4.40     5.70    8.20    10.60 15.50   22.70   32.50
             AtmosphericElements                            5.80     8.70   14.30    18.10 25.70   30.80   40.00
             PlantsVegetation                               3.10     3.00    3.00     3.50  3.90    6.60    8.30
             Animals-AtmosphericElements                    7.50    11.20   18.60   24.60  31.00   41.30   48.60
             Animals-PlantsVegetation                       3.70     6.00    8.40    11.90 16.60   27.40   35.00
             AtmosphericElements-PlantsVegetation           5.60     9.00   15.00    17.80 24.20   31.90   41.40
             Animals-AtmosphericElements-PlantsVegetation   7.00    11.60   18.30    23.80 31.80   38.60   43.70

Table 3
VGGVox - Nature Outdoor Scenario. EERs achieved by VGGVox under a Nature Outdoor Scenario. Bold values show the
highest EER at each volume level. VGGVox led to 2.20% of EER when no sounds were added to the vocal files.

                                                                                    Volume
             Sound Combination
                                                             0.05    0.10    0.20     0.30  0.50    1.00    1.50
             Animals                                         7.09   10.39   15.60    16.40 20.90   28.20   30.80
             AtmosphericElements                            13.20   19.29   28.90    31.59 34.50   42.00   43.90
             PlantsVegetation                                5.50    5.50    7.79     8.20  9.89   12.90   17.30
             Animals-AtmosphericElements                    17.00   22.19   30.79    35.90 38.60   43.20   44.69
             Animals-PlantsVegetation                        9.29    9.79   17.10    19.70 24.50   31.30   35.00
             AtmosphericElements-PlantsVegetation           14.10   21.69   26.00    32.69 35.60   43.80   45.69
             Animals-AtmosphericElements-PlantsVegetation   15.20   22.79   31.70    35.09 38.90   46.19   47.59

Table 4
XVector - Nature Outdoor Scenario. EERs achieved by XVector under a Nature Outdoor Scenario. Bold values show the
highest EER at each volume level. XVector led to an EER of 6.35% when no sounds were added to the audio files.



the worst EERs, with 40 − 43% of EER at the high-               and characteristics of the nature sounds included into
est volume level. Models suffered from the combina-             our taxonomy, which are shorter and less deafening.
tion of Animals and AtmosphericElements sounds                    Mechanical Outdoor. Table 5 and 6 show us that
(44−48% of EER reached at a volume ratio of 1.5). Com-          mechanical outdoor sounds led to substantial negative
pared with indoor scenarios, both VGGVox and XVec-              impacts on the model performance, except in case of
tor showed here similar degradation patterns. This              SocialSignals sounds, compared with indoor and
behavior might be justified by the intrinsic properties
                                                                                          Volume
             Sound Combination
                                                                   0.05    0.10    0.20     0.30  0.50    1.00    1.50
             MotorizedTransport                                    3.10    4.50    8.70    10.80 16.80   26.10   35.10
             NonMT                                                28.00   27.00   28.40    28.90 25.70   30.10   30.60
             SocialSignals                                         7.00    7.50    9.40    11.00 11.60   16.10   20.20
             Ventilation                                          20.30   20.10   20.90    22.20 25.70   29.80   32.30
             MotorizedTransport-NonMT                             26.60   27.40   29.70    32.70 31.70   39.30   44.90
             MotorizedTransport-SocialSignals                      8.10    9.20   14.00    16.70 22.60   32.10   38.00
             MotorizedTransport-Ventilation                       20.60   22.70   22.70    25.60 30.70   37.40   44.10
             NonMT-SocialSignals                                  30.20   30.10   29.40    28.60 30.00   36.00   37.80
             NonMT-Ventilation                                    35.60   38.00   36.10    39.00 40.50   41.60   43.60
             SocialSignals-Ventilation                            21.40   19.90   25.20    27.00 29.90   35.50   40.20
             MotorizedTransport-NonMT-SocialSignals-Ventilation   37.60   36.30   38.50    38.30 42.40   10.00   48.50

Table 5
VGGVox - Mechanical Outdoor Scenario. EERs achieved by VGGVox under a Mechanical Outdoor Scenario. Bold values
show the highest EER at each volume level. VGGVox led to an EER of 2.20% when no sounds were added to the audio files.

                                                                                          Volume
             Sound Combination
                                                                   0.05    0.10    0.20     0.30  0.50    1.00    1.50
             MotorizedTransport                                    9.30   13.30   20.99    27.20 32.49   40.00   42.10
             NonMT                                                47.50   48.00   53.50    46.90 53.30   50.60   44.39
             SocialSignals                                        10.20    8.69   13.80    14.50 19.40   24.70   29.90
             Ventilation                                          33.59   32.30   34.30    34.80 39.70   44.90   49.80
             MotorizedTransport-NonMT                             49.80   48.69   45.30    51.70 47.59   48.40   50.40
             MotorizedTransport-SocialSignals                     11.79   19.49   22.49    29.50 35.80   43.99   45.69
             MotorizedTransport-Ventilation                       31.79   32.49   36.90    42.00 43.30   49.10   47.50
             NonMT-SocialSignals                                  49.90   50.50   51.60    49.40 49.20   49.40   48.00
             NonMT-Ventilation                                    52.40   49.50   50.10    49.90 48.30   51.30   51.60
             SocialSignals-Ventilation                            39.70   38.70   38.40    42.20 40.50   44.69   50.30
             MotorizedTransport-NonMT-SocialSignals-Ventilation   50.40   49.90   51.50    48.40 49.40   49.70   49.70

Table 6
XVector - Mechanical Outdoor Scenario. EERs achieved by XVector under a Mechanical Outdoor Scenario. Bold values
show the highest EER at each volume level. XVector led to an EER of 6.35% when no sounds were added to the audio files.



nature outdoor settings. Even at low volume levels,                 5. Conclusions and Future Work
it is important to notice that Ventilation sounds
caused substantial degradation in EERs, and this ef-                In this paper, we proposed a taxonomy of labeled back-
fect was amplified when two or more sound categories                ground sound recordings for speaker recognition re-
were combined. Among the most degraded settings,                    search in noisy environments. Then, we devised a sim-
combining NonMT and Ventilation led to EERs of                      ulation framework of indoor and outdoor contexts in
35 − 50% even at a volume ratio of 0.1. Outdoor sounds              vocal audios. Finally, we assessed the impact of the
coming from the Nature and Mechanical categories                    taxonomy sounds on the performance of two speaker
seemed to lead to more overlapping decision bound-                  recognition models. Based on the results, indoor sounds
aries than indoor sounds. For instance, while being                 have a lower impact than outdoor sounds, and outdoor
composed by a different combination of sound types,                 scenarios that involve mechanical sounds are the most
both the MotorizedTransport-NonMT and NonMT-                        challenging, even at low background sound volumes.
SocialSignals settings showed similar EERs at a                        Our work opens to a wide range of research direc-
volume level higher than 0.4. It follows that mixing                tions. We plan to enrich the taxonomy with more cate-
outdoor sounds can hamper more speaker recognition,                 gories and audios organized into an ontological repre-
and each type of outdoor sound significantly impacts                sentation. We will extend our analysis to other mod-
on model effectiveness. Similarly to the indoor sce-                els (e.g., ResNet) and languages beyond English. We
nario, VGGVox was more robust than XVector, possi-                  will also inspect how background sounds and respec-
bly due to its depth in terms of layers.                            tive scenarios affect the internal model dynamics (e.g.,
   Based on our results, under the considered settings,             speaker embeddings). Naturally, we will leverage our
speaker recognition matchers do not appear adequately               framework to devise audio enhancement methods able
reliable. The impact of background sounds on perfor-                to deal with the sounds of our taxonomy and to design
mance depends on the context and the sound.                         novel approaches for more robust speaker recognition.
Acknowledgments                                                  danpur, Deep neural network embeddings for
                                                                 text-indep. speaker verification., in: Interspeech,
This work has been partially supported by the Sar-               2017, pp. 999–1003.
dinian Regional Government, POR FESR 2014-2020 -            [12] A. Nagrani, J. S. Chung, W. Xie, A. Zisserman,
Axis 1, Action 1.1.3, under the project “SPRINT” (D.D.           Voxceleb: Large-scale speaker verification in the
n. 2017 REA, 26/11/2018, CUP F21G18000240009).                   wild, Computer Speech & Language 60 (2020).
                                                            [13] C. Qian, Y. Yu, Z.-H. Zhou, Analyzing evolution-
References                                                       ary optimization in noisy environments, Evolu-
                                                                 tionary computation 26 (2018) 1–41.
 [1] J. H. Hansen, T. Hasan, Speaker recognition by         [14] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, S. Khu-
     machines and humans: A tutorial review, IEEE                danpur, A study on data augm. of reverberant
     Signal processing magazine 32 (2015) 74–99.                 speech for robust speech recogn., in: 2017 IEEE
 [2] A. S. Tulshan, S. N. Dhage, Survey on virtual as-           Inter. Conference on Acoustics, Speech and Sig-
     sistant: Google assistant, siri, cortana, alexa, in:        nal Processing, IEEE, 2017, pp. 5220–5224.
     Intern. Sym. on Signal Processing and Intelligent      [15] X. Feng, B. Richardson, S. Amman, J. R. Glass, An
     Recognition Systems, 2018, pp. 190–201.                     environmental feature representation for robust
 [3] H. Feng, K. Fawaz, K. G. Shin, Continuous au-               speech recognition and for environment identifi-
     thentication for voice assistants, in: Proc. of             cation., in: INTERSPEECH, 2017, pp. 3078–3082.
     the Annual International Conference on Mobile          [16] E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker,
     Computing and Networking, 2017, pp. 343–355.                R. Marxer, An analysis of environment, micro-
 [4] M. Schmidt, P. Braunger, A survey on different              phone and data simulation mismatches in robust
     means of personalized dialog output for an adap-            speech recognition, Computer Speech & Lan-
     tive personal assistant, in: Adjunct Publication            guage 46 (2017) 535–557.
     of the Conference on User Modeling, Adaptation         [17] C. Donahue, B. Li, R. Prabhavalkar, Exploring
     and Personalization, 2018, pp. 75–81.                       speech enhancement with generative adv. net-
 [5] M. Marras, P. Korus, N. D. Memon, G. Fenu, Ad-              works for rob. speech recogn., in: IEEE Inter.
     versarial optimization for dictionary attacks on            Conference on Acoustics, Speech and Signal Pro-
     speaker verification, in: Proc. of the Annual Con-          cessing (ICASSP), IEEE, 2018, pp. 5024–5028.
     ference of the International Speech Communica-         [18] D. A. Reynolds, T. F. Quatieri, R. B. Dunn,
     tion Association, ISCA, 2019, pp. 2913–2917.                Speaker verification using adapted gaussian mix-
 [6] A. Pradhan, K. Mehta, L. Findlater, Accessibility           ture models, Dig. Sig. Processing 10 (2000) 19–41.
     came by accident: Use of voice-controlled intel-       [19] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel,
     ligent personal assistants by people with disabil-          P. Ouellet, Front-end factor analysis for speaker
     ities, in: Proc. of the CHI Conference on Human             verification, IEEE Trans. on Audio, Speech, and
     Factors in Computing Systems, 2018, pp. 1–13.               Language Processing 19 (2011) 788–798.
 [7] S. O. Sadjadi, T. Kheyrkhah, A. Tong, C. S. Green-     [20] A. Kanagasundaram, R. Vogt, D. B. Dean, S. Srid-
     berg, D. A. Reynolds, E. Singer, L. P. Mason, The           haran, M. W. Mason, I-vector based speaker
     2016 nist speaker recognition evaluation., in: In-          recognition on short utterances, in: Proc. Inter-
     terspeech, 2017, pp. 1353–1357.                             speech 2011, 2011, pp. 2341–2344.
 [8] J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W.       [21] Y.-h. Chen, I. Lopez-Moreno, T. N. Sainath, M. Vi-
     Chang, H.-M. Wang, Audio-visual speech en-                  sontai, R. Alvarez, C. Parada, Locally-connected
     hanc. using multimodal deep conv. neural net-               and convolutional neural networks for small
     works, IEEE Transactions on Emerging Topics                 footprint speaker recognition, in: Proc. Inter-
     in Computational Intelligence 2 (2018) 117–128.             speech 2015, 2015, pp. 1136–1140.
 [9] A. Martin, L. Mauuary, Robust speech/non-              [22] S. Kim, B. Raj, I. Lane, Environmental noise em-
     speech detection based on lda-derived parameter             beddings for robust speech recognition, arXiv
     and voicing parameter for speech recognition in             preprint arXiv:1601.02553 (2016).
     noisy environments, Speech communication 48            [23] J. Salamon, C. Jacoby, J. P. Bello, A dataset and
     (2006) 191–206.                                             taxonomy for urban sound, in: Proc. of ACM In-
[10] T. Afouras, J. S. Chung, A. Zisserman, The con-             tern. Conf. on Multimedia, 2014, pp. 1041–1044.
     versation: Deep audio-visual speech enhance-           [24] A. Brown, J. Kang, T. Gjestland,           Towards
     ment, arXiv preprint arXiv:1804.04121 (2018).               standardization in soundscape preference assess-
[11] D. Snyder, D. Garcia-Romero, D. Povey, S. Khu-              ment, Applied acoustics 72 (2011) 387–392.