=Paper=
{{Paper
|id=Vol-2699/paper27
|storemode=property
|title=Evaluation Framework for Context-aware Speaker Recognition in Noisy Smart Living Environments
|pdfUrl=https://ceur-ws.org/Vol-2699/paper27.pdf
|volume=Vol-2699
|authors=Gianni Fenu,Roberta Galici,Mirko Marras
|dblpUrl=https://dblp.org/rec/conf/cikm/FenuGM20
}}
==Evaluation Framework for Context-aware Speaker Recognition in Noisy Smart Living Environments==
Evaluation Framework for Context-aware Speaker Recognition in Noisy Smart Living Environments Gianni Fenu, Roberta Galici and Mirko Marras Department of Mathematics and Computer Science, University of Cagliari, V. Ospedale 72, 09124 Cagliari, Italy Abstract The integration of voice control into connected devices is expected to improve the efficiency and comfort of our daily lives. However, the underlying biometric systems often impose constraints on the individual or the environment during interaction (e.g., quiet surroundings). Such constraints have to be surmounted in order to seamlessly recognize individuals. In this paper, we propose an evaluation framework for speaker recognition in noisy smart living environments. To this end, we designed a taxonomy of sounds (e.g., home-related, mechanical) that characterize representative indoor and outdoor environments where speaker recognition is adopted. Then, we devised an approach for off-line simulation of challenging noisy condi- tions in vocal audios originally collected under controlled environments, by leveraging our taxonomy. Our approach adds a (combination of) sound(s) belonging to the target environment into the current vocal example. Experiments on a large-scale public dataset and two state-of-the-art speaker recognition models show that adding certain background sounds to clean vocal audio leads to a substantial deterioration of recognition performance. In several noisy settings, our findings reveal that a speaker recognition model might end up to make unreliable decisions. Our framework is intended to help system designers evaluate performance deterioration and develop speaker recognition models more robust to smart living environments. Keywords Deep Learning, Security, Speaker Recognition, Speaker Verification, Noisy Environments, Sound Taxonomy. 1. Introduction models that can improve human quality of life. State-of-the-art speaker recognition matchers exhibit Speech is a more natural way of interacting with de- impressive accuracy, especially when the voice qual- vices than tapping screens. This form of interaction is ity is reasonably good [7]. For this reason, they im- receiving more and more attention, with voice-enabled plicitly or explicitly impose constraints on the envi- services being used in every aspect of our lives. Speaker ronment, such as being stationary and quiet. Conven- recognition analyzes the identity of an individual be- tionally, speaker matchers are trained to classify vocal fore accessing to a service. Unlike speech recogni- examples under idealistic conditions but are expected tion, which detects spoken words, speaker recognition to operate well in real-world challenging situations. inspects patterns that distinguish one person’s voice However, their performance sharply degrades when from another [1]. Recognizing the identity of a speaker audios with substantial background sounds (e.g., traf- becomes crucial in different scenarios. For instance, fic) are encountered. Enhancing voice data is demand- voice-enabled devices (e.g., assistants, smartphones) ing since related algorithms do not often explicitly at- allow home owners to turn on lights, unlock doors, tempt to preserve biometric cues in the data [8, 9, 10]. and listen to music seamlessly [2]. These recognition Existing robust speaker models are being trained on abilities can prevent unauthorized individuals from us- data which do not cover various levels of interfering ing devices without the owner’s permission and can sounds and different sound types [11, 12]. Hence, sev- provide evidence needed to personalize user’s expe- eral questions concerning how much and under which riences with these devises, even outside the domes- background sounds speaker recognition performance tic borders [3, 4, 5]. Moreover, speaker recognition degrades and how each type of sound impacts on the can make lives of older adults’ and people with spe- mechanics of these matchers remain unanswered. cial needs easier and safer [6]. Hence, it is imperative Our study in this paper is hence organized around to study and devise data-driven speaker recognition these directions and aims to perform an extensive per- formance analysis of deep speaker recognition match- Proceedings of the CIKM 2020 Workshops, October 19-20, Galway, ers in a range of noisy living environments. To this Ireland. Stefan Conrad, Ilaria Tiddi (Eds.) end, we designed and collected a taxonomy of sounds email: fenu@unica.it (G. Fenu); r.galici1@studenti.unica.it (R. Galici); mirko.marras@unica.it (M. Marras) (e.g., footsteps, laughing) that characterize representa- orcid: 0000-0003-4668-2476 (G. Fenu); 0000-0003-1989-6057 (M. tive living ambients where speaker recognition is find- Marras) ing adoption. Then, we depicted an approach that al- © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). lows us to simulate challenging noisy conditions in CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) raw vocal audios by adding sounds of our taxonomy, optimization task by means of evolutionary algorithms. according with the environment under consideration. The authors found that a Bitwise noise can fundamen- Finally, we experimented with a public dataset, orig- tally affect recognition patterns during evaluation and, inally collected in controlled environments, and two thus, might make it harder to deploy matchers in the state-of-the-art speaker recognition models, to inspect real world. Differently, Ko et al. [14] focused on a per- the impact of background noisy sounds on their per- formance comparison between acoustic models trained formance. Our contribution is threefold: with and without simulated far-field speech under a real far-field voice dataset. Their experiments showed • We design a taxonomy of ambient sounds tai- that acoustic models trained on simulated far-field led lored to speaker recognition research, and we to significantly lower error rates in both a distant- and provide a dataset of recordings with labeled sound close-talking scenarios. In [15], the authors presented sources for each category in our taxonomy. a feature learning approach, referred as to e-vector, that • We propose an evaluation framework for speaker can capture both channel and environment variability. recognition benchmarking, enabling easier and Recently, Vincent et al. [16] analyzed the performance faster simulation of indoor and outdoor noisy of speaker recognition matchers on the CHiME3 dataset, environments in (clean) vocal audios1 . which consists of real recordings in noisy environments. Finally, Donahue et al. [17] analyzed the benefits re- • Given a large vocal dataset, we perform an ex- sulting from training a speaker recognition matcher tensive analysis on the impact of the sounds in with both clean speech data and fake speech data cre- our taxonomy on the performance of two state- ated by means of a generative adversarial network. of-the-art speaker recognition matchers. Though this research has greatly expanded our un- derstanding, past works focused on low-level noises Our experiments showed that, even when the back- (e.g., Bitwise) or did not specifically control how and ground sound volume is low, speaker recognition sys- under which ambient sounds the performance degrades. tems undergo a substantial deterioration of accuracy. We argue that different background sounds may lead Only in case of nature-related sounds (e.g., chirping, to fundamentally different impacts and, thus, a clear wind), the sound impact is negligible. Certain envi- understanding on the extent of this impact is lacking. ronmental settings lead to error rates five to ten times higher than error rates achieved in ideal conditions. The rest of this paper is organized as follows. Sec- 2.2. Input Audio Quality Enhancement tion 2 depicts an overview of related works. Then, our Existing literature included audio enhancement algo- taxonomy and the simulation framework are described rithms that aim to provide audible improvements in in Section 3. Section 4 presents our experiments. Fi- a sound, without degrading the quality of the origi- nally, Section 5 provides insights for future work. nal recording. This type of strategy well fits with the forensic context, where audios may have some kind of background sound disturbance or sound artifact that 2. Related Work may interfere with the voice of interest. Examples of Our research lies at the intersection among three per- audio enhancement methods are removing static noise, spectives, namely studies which analyze the impact eliminating phone-related interference, and clearing of background sounds on recognition, audio enhance- up random sounds (e.g., dogs barking, bells ringing). ment algorithms aimed to improve data quality, and For instance, Hou et al. [8] proposed a convolution- speaker recognition approaches which seek to classify based audio-visual auto-encoder for speech enhance- noisy vocal data with no pre-processing. ment through multi-task learning. In [9], the authors investigated how to improve speech/non-speech de- tection robustness in very noisy environments, includ- 2.1. Explorative Analysis in Noisy ing stationary noise and short high-energy noise. Sim- Environments ilarly, Afouras et al. [10] proposed an audio-visual neu- Explorative analyses simply investigate how noisy en- ral network able to isolate a speaker’s voice of interest vironments influence speaker recognition performance. from other sound interference, using visual informa- For instance, Qian et al. [13] studied the low-level noisy tion from the target speaker’s lips. However, the de- signed methods do not often attempt to preserve bio- 1 Code, data, pre-trained models, and documentation are pub- metric cues in the data and depend on the nature of the licly available at https://mirkomarras.github.io/dl-voice-noise/. sound, which varies according to the context. Hence, Figure 1: Our taxonomy of sounds characterizing representative environments where speaker recognition is adopted. our framework becomes a key asset to study recogni- Hence, being able to combine utterances recorded in tion performance on background sounds against which quiet environments with the background sounds of the countermeasures have been under-explored. considered scenario represents a viable alternative. The first step to put this idea into practice consists of col- 2.3. Robust Speaker Recognition lecting sounds from a wide range of sources and orga- nizing them in a hierarchical taxonomy. Speaker recognition matchers have traditionally relied Noisy sounds research is challenging due to the lack on Gaussian mixture models [18], joint factor analy- of labeled audio data. Past works collected sounds from sis [19], and IVectors [20]. Recently, speaker match- specific environments and resulted in commercial or ers achieved impressive accuracy thanks to speaker private datasets. Recent contributions have provided embedding representations extracted from Deep Neu- publicly available datasets of environmental record- ral Networks trained for (one-shot) speaker classifica- ings [23]. On top of these collections, many studies tion. Notable examples include CVectors [21], XVec- have been carried out on sound classification [24]. Be- tors [11], VGGVox- and ResNet-Vectors [12]. More- ing designed for sound classification tasks, existing tax- over, Kim et al. [22] proposed a deep noise-adaptation onomies can not directly be applied nor combined to approach that dynamically adapts itself to the opera- simulate scenarios of speaker recognition. For instance, tional environment. Existing approaches in this area they often include a few classes, sounds of marginal in- do not make any assumption on the training and test- terest (e.g., gun shots), and are organized according to ing data, which come from various noisy situations. the sound type. Conversely, for our purposes, a taxon- Therefore, there is no fine-grained control on how these omy should be designed with situational and contex- systems perform in specific noisy applicative scenar- tual elements in mind (e.g., grouping sounds based on ios and the noisy situations are limited by the variety the ambient where they frequently appear). of recordings including into the considered dataset. To address these issues, we proposed a compilation of environmental sounds from over 50 classes. The se- lected sound clips were constructed from recordings 3. The Proposed Framework available on the above-mentioned urban sound tax- onomies and on Freesound 2 , with a semi-automated Our framework is composed by a dataset of sounds pipeline. Specifically, we first identified a representa- categorized according to our pre-defined taxonomy and tive set of scenarios/environments where speaker recog- a toolbox that simulates background living scenarios. nition is actively used nowadays, and then we filtered out the categories of sounds included into existing tax- 3.1. Sound Taxonomy for Speaker onomies that are of marginal interest for the selected Recognition in Living Scenarios scenarios (e.g., fire engine). Then, we introduced new sound categories that help to model speaker recog- Collecting utterances produced in various living sce- nition scenarios whose sounds are not present in ex- narios while keeping trace of the background sounds in each utterance is challenging and time-consuming. 2 https://freesound.org/ isting sound taxonomies (e.g., dishwasher, footsteps). and a list of vocal audios where that context should The included classes were arbitrarily selected with the be simulated, a routine changes each vocal audio by goal of maintaining balance between major types of adding to it the combination of sounds included into sounds characterizing the selected scenarios and of con- the context definition, with their given volume and sidering the limitations in the number and diversity probability. For each sound category, the sound to add of available sound recordings. Freesound was queried can be specified or randomly chosen. Our toolbox and for common terms related to the considered scenarios. our definition of context provide the necessary level Search results were verified by annotating fragments of flexibility to simulate real-world scenarios created that contain events associated with a given scenario. from all the combinations of the taxonomy’s sounds. Sounds are grouped in two major categories pertain- ing to indoor and outdoor contexts3 : 4. Experiments • Indoor category including sounds divided into three different categories: Home (e.g., TV, wash- In this section, we assess how much and under which ing machine), Voice (e.g., chatting, laughing) background sounds speaker recognition performance and Movement (e.g., footsteps, applause). degrades, how each background sound type impacts on model mechanics, and how much the volume level • Outdoor category with sounds divided into two of the background sounds leads to models which pro- categories: Nature contains different types of vide less accurate predictions. In fact, how each noisy sounds, such as atmospheric elements (e.g., rain, context influences the behavior of state-of-the-art ar- wind), animal sounds (e.g., dogs, cats, birds), and chitectures, such as VGGVox and XVector, still remains sounds associated to plants and vegetation (e.g., under-explored, since their effectiveness has been of- leaves). Mechanical includes sounds produced ten evaluated under ideal conditions, with background- by ventilation, motorized transports (e.g, cars, sound unlabelled vocal audios, or the same type of vo- trains), non-motorized transports (e.g., bicycles), cal audios (e.g, from interviews). and other signals (e.g., church bells). The collected audios were converted to a unified 4.1. Seed Human Voice Dataset format (16 kHz, mono, wav) to facilitate their process- Given its large scale and its wide adoption in the liter- ing with existing audio-programming packages. These ature, we simulated applicative contexts into the vocal sounds were arranged into the taxonomy in Figure 1 data belonging to the VoxCeleb-1 dataset [12]. This col- based on the above-mentioned considerations. lection consists of short utterances taken from video interviews published on Youtube, including speakers 3.2. Toolbox for Background Living from a wide range of different ethnicities, accents, pro- Scenario Simulation fessions and ages, fairly balanced with respect to their gender (i.e., 55% of males). The dataset is split into de- Our taxonomy is proposed to facilitate the simulation velopment and test sets having disjoint speakers. The of real-world applicative contexts in vocal audio. Thus, development set has 1, 211 speakers and 143, 768 ut- on top of this taxonomy, a way to combine vocal au- terances, while the test set consists of 40 speakers and dio and background sounds is needed. To this end, we 4, 874 utterances. Our study leveraged trial pairs pro- propose a Python toolbox that can simulate an applica- vided by the authors together with the VoxCeleb-1 data4 . tive context into a vocal audio. Specifically, we define an applicative context has a set of one or more sound 4.2. Benchmarked Models entries taken from the taxonomy. Each entry includes a string identifier associated to the sound category to Our analysis benchmarks two state-of-the-art speaker include (e.g., Home or Voice), a floating-point num- recognition architectures: VGGVox [12] and XVector [11]. ber that specifies the volume level of that sound in the They have received great attention in recent years, and current context, and a floating-point number in [0,1] this motivated us to deepen their robustness in noisy representing the probability of adding that sound into 4 Due to the large amount of comparisons to simulate all con- a vocal example. Given a context defined as above texts, our study focused on 1, 000 out of 37, 702 VoxCeleb-1 trial pairs and leaves as a future work the extension to the larger VoxCeleb- 3 Our preliminary analysis considers two disjoint sets of indoor 2. Here, we are more interested in understanding matchers robust- and outdoor sounds, leaving settings that cross-link sound entities ness against background sounds, so the accuracy gains with larger within a graph-based taxonomy as a future work. datasets would not substantially affect the findings of our analysis. environments. VGGVox is based on the VGG-M Convo- tic representations were extracted and fed into that lutional Neural Network, with modifications to adapt to pretrained model to get the speaker embeddings. The the audio spectrogram input. The last fully-connected Cosine similarity between the speaker embeddings was layer is replaced by two layers, a fully connected layer calculated. This process was repeated for each trial with support in the frequency domain and an aver- pair in the set. Finally, given the resulting similarity age pooling layer with support on the time domain. scores and the verification labels (i.e., 0 for different- XVector is a Time Delay Neural Network, which allows user pairs, 1 for same-user pairs), the Equal Error Rate neurons to receive signals spanning across multiple (EER) under that context was computed. The entire frames. Given a filterbank, the first five layers operate protocol was repeated with different background sound on speech frames, with a small temporal context cen- volume levels, treated as percentages, assumed to be tered at the current frame. Then, a pooling layer ag- in [0, 0.05, 0.10, 0.20, 0.30, 0.50, 1, 1.5] (e.g., 1 means that gregates frame-level outputs and computes mean and the original volume is kept, 0.5 means that the volume standard deviation. Finally, two fully-connected layers of the background sound is reduced by 50%). aggregate statistics across the time dimension. This protocol was carried out on 25 contexts com- posed by either single categories of the third and fourth 4.3. Model Training and Testing Details levels of our taxonomy and their combination. The code, implemented in Python, ran on a NVIDIA 4.5. Experimental Results GPU. The audios were converted to single-channel, 16- bit streams at a 16kHz sampling rate. We used 512- Given the considered taxonomy contexts, the VoxCeleb- point Fast Fourier Transforms. VGGVox received spec- 1 trial pairs, and two pre-trained speaker recognition trograms of size 512𝑥300, while XVector received fil- models, we followed the protocol in Section 4.4 to cal- terbanks of size 300𝑥24. Both representations were culate the EERs at various background sound volumes. generated in a sliding window fashion using the ham- Indoor. Table 1 and 2 report the EERs under various ming window of width 25𝑚𝑠 and step 10𝑚𝑠, and nor- combinations of indoor sounds. It can be observed that malized by subtracting the mean and dividing by the Voice is the individual sound category that leads to standard deviation of all frequency components. Each the highest degradation in performance, with 27 − 30% model was trained for classification on VoxCeleb-1 dev of EER at the 1.5 volume ratio. Home and Movement set using Softmax, with batches of size 64. To keep showed a similar impact when the volume level was consistency with respect to the original implementa- below 1.0, while the former brought more negative ef- tions of VGGVox and XVector, we used the Adam opti- fects with volume ratios higher than 1.0. When two mizer, with an initial learning rate of 0.001, decreased or more sound categories were combined, EERs easily by a factor of 10 every 10 epochs, until convergence. turned out to values above 15%, especially in scenar- For testing, we considered speaker embeddings of size ios where both Home and Voice sounds were present. 512. The choice of the architectural parameters was XVector’s performance substantially decreased as soon driven by the original model implementations, with- as sounds were added, while VGGVox showed a more out any specific adaptation, given that we are inter- robust behavior against background sounds. It might ested in benchmarking the original models in noisy be possible that the changes introduced by the back- environments rather than tuning/arranging the param- ground sound in the spectrograms fed into VGGVox eters to align with our goals. had a lower influence on the recognition pattern learnt by the convolutional layers during training. On the 4.4. Speaker Recognition Protocols other hand, with XVector, the temporal context em- ployed at each layer of the network might be highly Given a pretrained model, a set of trial verification influenced by the changes introduced into the filter- pairs, and a target context, the protocol worked as fol- banks through the background sound addition. lows. For each trial pair in the set, we assumed that the first audio represented the enrolled utterance ideally Nature Outdoor. Table 3 and 4 report the EERs ob- collected in controlled environments, while the sec- tained when nature-related sounds were added. These ond audio was the probe provided in the target con- sounds showed different degradation patterns from each text. Hence, the first audio remained unchanged. The other, compared with indoor-related sounds. It can second audio was changed by adding sounds that char- be observed that models were robust against Plants Vegetation at any volume. Conversely, sounds com- acterize the target context, as explained in Section 3.2. For both the enrolled and the changed audios, the acous- ing from the AtmosphericElements category led to Volume Sound Combination 0.05 0.10 0.20 0.30 0.50 1.00 1.50 Home 4.60 5.80 7.80 10.00 12.50 19.30 26.40 Movement 5.00 6.40 7.90 12.60 14.20 16.70 20.50 Voice 8.10 12.20 17.30 18.20 19.40 24.20 27.90 Home-Movement 5.00 6.50 13.00 15.90 23.20 30.00 32.20 Home-Voice 10.10 14.00 20.00 23.10 26.00 34.60 39.10 Movement-Voice 11.00 15.70 14.60 17.00 18.80 23.90 29.40 Home-Movement-Voice 14.00 17.00 22.90 26.90 31.80 39.30 42.20 Table 1 VGGVox - Indoor Scenario. EERs achieved by VGGVox under an Indoor Scenario. Bold values show the highest EER at each volume level. VGGVox led to an EER of 2.20% when no sounds were added to the vocal file. Volume Sound Combination 0.05 0.10 0.20 0.30 0.50 1.00 1.50 Home 10.49 13.90 19.59 20.99 26.30 32.19 36.40 Movement 12.60 17.80 23.70 26.00 30.70 28.50 30.30 Voice 15.40 18.10 20.59 24.20 25.00 32.49 30.60 Home-Movement 13.50 30.30 36.19 36.70 39.60 37.60 49.09 Home-Voice 17.10 24.09 26.90 29.70 35.60 41.10 40.60 Movement-Voice 21.79 27.90 33.59 34.19 37.90 36.80 39.80 Home-Movement-Voice 21.59 30.30 36.19 36.70 39.60 44.20 46.09 Table 2 XVector - Indoor Scenario. EERs achieved by XVector under an Indoor Scenario. Bold values show the highest EER at each volume level. XVector led to an EER of 6.35% when no sounds were added to the vocal files. Volume Sound Combination 0.05 0.10 0.20 0.30 0.50 1.00 1.50 Animals 4.40 5.70 8.20 10.60 15.50 22.70 32.50 AtmosphericElements 5.80 8.70 14.30 18.10 25.70 30.80 40.00 PlantsVegetation 3.10 3.00 3.00 3.50 3.90 6.60 8.30 Animals-AtmosphericElements 7.50 11.20 18.60 24.60 31.00 41.30 48.60 Animals-PlantsVegetation 3.70 6.00 8.40 11.90 16.60 27.40 35.00 AtmosphericElements-PlantsVegetation 5.60 9.00 15.00 17.80 24.20 31.90 41.40 Animals-AtmosphericElements-PlantsVegetation 7.00 11.60 18.30 23.80 31.80 38.60 43.70 Table 3 VGGVox - Nature Outdoor Scenario. EERs achieved by VGGVox under a Nature Outdoor Scenario. Bold values show the highest EER at each volume level. VGGVox led to 2.20% of EER when no sounds were added to the vocal files. Volume Sound Combination 0.05 0.10 0.20 0.30 0.50 1.00 1.50 Animals 7.09 10.39 15.60 16.40 20.90 28.20 30.80 AtmosphericElements 13.20 19.29 28.90 31.59 34.50 42.00 43.90 PlantsVegetation 5.50 5.50 7.79 8.20 9.89 12.90 17.30 Animals-AtmosphericElements 17.00 22.19 30.79 35.90 38.60 43.20 44.69 Animals-PlantsVegetation 9.29 9.79 17.10 19.70 24.50 31.30 35.00 AtmosphericElements-PlantsVegetation 14.10 21.69 26.00 32.69 35.60 43.80 45.69 Animals-AtmosphericElements-PlantsVegetation 15.20 22.79 31.70 35.09 38.90 46.19 47.59 Table 4 XVector - Nature Outdoor Scenario. EERs achieved by XVector under a Nature Outdoor Scenario. Bold values show the highest EER at each volume level. XVector led to an EER of 6.35% when no sounds were added to the audio files. the worst EERs, with 40 − 43% of EER at the high- and characteristics of the nature sounds included into est volume level. Models suffered from the combina- our taxonomy, which are shorter and less deafening. tion of Animals and AtmosphericElements sounds Mechanical Outdoor. Table 5 and 6 show us that (44−48% of EER reached at a volume ratio of 1.5). Com- mechanical outdoor sounds led to substantial negative pared with indoor scenarios, both VGGVox and XVec- impacts on the model performance, except in case of tor showed here similar degradation patterns. This SocialSignals sounds, compared with indoor and behavior might be justified by the intrinsic properties Volume Sound Combination 0.05 0.10 0.20 0.30 0.50 1.00 1.50 MotorizedTransport 3.10 4.50 8.70 10.80 16.80 26.10 35.10 NonMT 28.00 27.00 28.40 28.90 25.70 30.10 30.60 SocialSignals 7.00 7.50 9.40 11.00 11.60 16.10 20.20 Ventilation 20.30 20.10 20.90 22.20 25.70 29.80 32.30 MotorizedTransport-NonMT 26.60 27.40 29.70 32.70 31.70 39.30 44.90 MotorizedTransport-SocialSignals 8.10 9.20 14.00 16.70 22.60 32.10 38.00 MotorizedTransport-Ventilation 20.60 22.70 22.70 25.60 30.70 37.40 44.10 NonMT-SocialSignals 30.20 30.10 29.40 28.60 30.00 36.00 37.80 NonMT-Ventilation 35.60 38.00 36.10 39.00 40.50 41.60 43.60 SocialSignals-Ventilation 21.40 19.90 25.20 27.00 29.90 35.50 40.20 MotorizedTransport-NonMT-SocialSignals-Ventilation 37.60 36.30 38.50 38.30 42.40 10.00 48.50 Table 5 VGGVox - Mechanical Outdoor Scenario. EERs achieved by VGGVox under a Mechanical Outdoor Scenario. Bold values show the highest EER at each volume level. VGGVox led to an EER of 2.20% when no sounds were added to the audio files. Volume Sound Combination 0.05 0.10 0.20 0.30 0.50 1.00 1.50 MotorizedTransport 9.30 13.30 20.99 27.20 32.49 40.00 42.10 NonMT 47.50 48.00 53.50 46.90 53.30 50.60 44.39 SocialSignals 10.20 8.69 13.80 14.50 19.40 24.70 29.90 Ventilation 33.59 32.30 34.30 34.80 39.70 44.90 49.80 MotorizedTransport-NonMT 49.80 48.69 45.30 51.70 47.59 48.40 50.40 MotorizedTransport-SocialSignals 11.79 19.49 22.49 29.50 35.80 43.99 45.69 MotorizedTransport-Ventilation 31.79 32.49 36.90 42.00 43.30 49.10 47.50 NonMT-SocialSignals 49.90 50.50 51.60 49.40 49.20 49.40 48.00 NonMT-Ventilation 52.40 49.50 50.10 49.90 48.30 51.30 51.60 SocialSignals-Ventilation 39.70 38.70 38.40 42.20 40.50 44.69 50.30 MotorizedTransport-NonMT-SocialSignals-Ventilation 50.40 49.90 51.50 48.40 49.40 49.70 49.70 Table 6 XVector - Mechanical Outdoor Scenario. EERs achieved by XVector under a Mechanical Outdoor Scenario. Bold values show the highest EER at each volume level. XVector led to an EER of 6.35% when no sounds were added to the audio files. nature outdoor settings. Even at low volume levels, 5. Conclusions and Future Work it is important to notice that Ventilation sounds caused substantial degradation in EERs, and this ef- In this paper, we proposed a taxonomy of labeled back- fect was amplified when two or more sound categories ground sound recordings for speaker recognition re- were combined. Among the most degraded settings, search in noisy environments. Then, we devised a sim- combining NonMT and Ventilation led to EERs of ulation framework of indoor and outdoor contexts in 35 − 50% even at a volume ratio of 0.1. Outdoor sounds vocal audios. Finally, we assessed the impact of the coming from the Nature and Mechanical categories taxonomy sounds on the performance of two speaker seemed to lead to more overlapping decision bound- recognition models. Based on the results, indoor sounds aries than indoor sounds. For instance, while being have a lower impact than outdoor sounds, and outdoor composed by a different combination of sound types, scenarios that involve mechanical sounds are the most both the MotorizedTransport-NonMT and NonMT- challenging, even at low background sound volumes. SocialSignals settings showed similar EERs at a Our work opens to a wide range of research direc- volume level higher than 0.4. It follows that mixing tions. We plan to enrich the taxonomy with more cate- outdoor sounds can hamper more speaker recognition, gories and audios organized into an ontological repre- and each type of outdoor sound significantly impacts sentation. We will extend our analysis to other mod- on model effectiveness. Similarly to the indoor sce- els (e.g., ResNet) and languages beyond English. We nario, VGGVox was more robust than XVector, possi- will also inspect how background sounds and respec- bly due to its depth in terms of layers. tive scenarios affect the internal model dynamics (e.g., Based on our results, under the considered settings, speaker embeddings). Naturally, we will leverage our speaker recognition matchers do not appear adequately framework to devise audio enhancement methods able reliable. The impact of background sounds on perfor- to deal with the sounds of our taxonomy and to design mance depends on the context and the sound. novel approaches for more robust speaker recognition. Acknowledgments danpur, Deep neural network embeddings for text-indep. speaker verification., in: Interspeech, This work has been partially supported by the Sar- 2017, pp. 999–1003. dinian Regional Government, POR FESR 2014-2020 - [12] A. Nagrani, J. S. Chung, W. Xie, A. Zisserman, Axis 1, Action 1.1.3, under the project “SPRINT” (D.D. Voxceleb: Large-scale speaker verification in the n. 2017 REA, 26/11/2018, CUP F21G18000240009). wild, Computer Speech & Language 60 (2020). [13] C. Qian, Y. Yu, Z.-H. Zhou, Analyzing evolution- References ary optimization in noisy environments, Evolu- tionary computation 26 (2018) 1–41. [1] J. H. Hansen, T. Hasan, Speaker recognition by [14] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, S. Khu- machines and humans: A tutorial review, IEEE danpur, A study on data augm. of reverberant Signal processing magazine 32 (2015) 74–99. speech for robust speech recogn., in: 2017 IEEE [2] A. S. Tulshan, S. N. Dhage, Survey on virtual as- Inter. Conference on Acoustics, Speech and Sig- sistant: Google assistant, siri, cortana, alexa, in: nal Processing, IEEE, 2017, pp. 5220–5224. Intern. Sym. on Signal Processing and Intelligent [15] X. Feng, B. Richardson, S. Amman, J. R. Glass, An Recognition Systems, 2018, pp. 190–201. environmental feature representation for robust [3] H. Feng, K. Fawaz, K. G. Shin, Continuous au- speech recognition and for environment identifi- thentication for voice assistants, in: Proc. of cation., in: INTERSPEECH, 2017, pp. 3078–3082. the Annual International Conference on Mobile [16] E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, Computing and Networking, 2017, pp. 343–355. R. Marxer, An analysis of environment, micro- [4] M. Schmidt, P. Braunger, A survey on different phone and data simulation mismatches in robust means of personalized dialog output for an adap- speech recognition, Computer Speech & Lan- tive personal assistant, in: Adjunct Publication guage 46 (2017) 535–557. of the Conference on User Modeling, Adaptation [17] C. Donahue, B. Li, R. Prabhavalkar, Exploring and Personalization, 2018, pp. 75–81. speech enhancement with generative adv. net- [5] M. Marras, P. Korus, N. D. Memon, G. Fenu, Ad- works for rob. speech recogn., in: IEEE Inter. versarial optimization for dictionary attacks on Conference on Acoustics, Speech and Signal Pro- speaker verification, in: Proc. of the Annual Con- cessing (ICASSP), IEEE, 2018, pp. 5024–5028. ference of the International Speech Communica- [18] D. A. Reynolds, T. F. Quatieri, R. B. Dunn, tion Association, ISCA, 2019, pp. 2913–2917. Speaker verification using adapted gaussian mix- [6] A. Pradhan, K. Mehta, L. Findlater, Accessibility ture models, Dig. Sig. Processing 10 (2000) 19–41. came by accident: Use of voice-controlled intel- [19] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, ligent personal assistants by people with disabil- P. Ouellet, Front-end factor analysis for speaker ities, in: Proc. of the CHI Conference on Human verification, IEEE Trans. on Audio, Speech, and Factors in Computing Systems, 2018, pp. 1–13. Language Processing 19 (2011) 788–798. [7] S. O. Sadjadi, T. Kheyrkhah, A. Tong, C. S. Green- [20] A. Kanagasundaram, R. Vogt, D. B. Dean, S. Srid- berg, D. A. Reynolds, E. Singer, L. P. Mason, The haran, M. W. Mason, I-vector based speaker 2016 nist speaker recognition evaluation., in: In- recognition on short utterances, in: Proc. Inter- terspeech, 2017, pp. 1353–1357. speech 2011, 2011, pp. 2341–2344. [8] J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. [21] Y.-h. Chen, I. Lopez-Moreno, T. N. Sainath, M. Vi- Chang, H.-M. Wang, Audio-visual speech en- sontai, R. Alvarez, C. Parada, Locally-connected hanc. using multimodal deep conv. neural net- and convolutional neural networks for small works, IEEE Transactions on Emerging Topics footprint speaker recognition, in: Proc. Inter- in Computational Intelligence 2 (2018) 117–128. speech 2015, 2015, pp. 1136–1140. [9] A. Martin, L. Mauuary, Robust speech/non- [22] S. Kim, B. Raj, I. Lane, Environmental noise em- speech detection based on lda-derived parameter beddings for robust speech recognition, arXiv and voicing parameter for speech recognition in preprint arXiv:1601.02553 (2016). noisy environments, Speech communication 48 [23] J. Salamon, C. Jacoby, J. P. Bello, A dataset and (2006) 191–206. taxonomy for urban sound, in: Proc. of ACM In- [10] T. Afouras, J. S. Chung, A. Zisserman, The con- tern. Conf. on Multimedia, 2014, pp. 1041–1044. versation: Deep audio-visual speech enhance- [24] A. Brown, J. Kang, T. Gjestland, Towards ment, arXiv preprint arXiv:1804.04121 (2018). standardization in soundscape preference assess- [11] D. Snyder, D. Garcia-Romero, D. Povey, S. Khu- ment, Applied acoustics 72 (2011) 387–392.