1. Introduction

Evaluation Framework for Context-aware Speaker Recognition in Noisy Smart Living Environments

Gianni Fenu

Roberta Galici

Mirko Marras

0 0 Department of Mathematics and Computer Science, University of Cagliari , V. Ospedale 72, 09124 Cagliari , Italy

2017

0 2017 999 1003

The integration of voice control into connected devices is expected to improve the eficiency and comfort of our daily lives. However, the underlying biometric systems often impose constraints on the individual or the environment during interaction (e.g., quiet surroundings). Such constraints have to be surmounted in order to seamlessly recognize individuals. In this paper, we propose an evaluation framework for speaker recognition in noisy smart living environments. To this end, we designed a taxonomy of sounds (e.g., home-related, mechanical) that characterize representative indoor and outdoor environments where speaker recognition is adopted. Then, we devised an approach for of-line simulation of challenging noisy conditions in vocal audios originally collected under controlled environments, by leveraging our taxonomy. Our approach adds a (combination of) sound(s) belonging to the target environment into the current vocal example. Experiments on a large-scale public dataset and two state-of-the-art speaker recognition models show that adding certain background sounds to clean vocal audio leads to a substantial deterioration of recognition performance. In several noisy settings, our findings reveal that a speaker recognition model might end up to make unreliable decisions. Our framework is intended to help system designers evaluate performance deterioration and develop speaker recognition models more robust to smart living environments.

eol>Deep Learning Security Speaker Recognition Speaker Verification Noisy Environments Sound Taxonomy

1. Introduction

models that can improve human quality of life.

State-of-the-art speaker recognition matchers exhibit Speech is a more natural way of interacting with de- impressive accuracy, especially when the voice qualvices than tapping screens. This form of interaction is ity is reasonably good [7]. For this reason, they imreceiving more and more attention, with voice-enabled plicitly or explicitly impose constraints on the enviservices being used in every aspect of our lives. Speaker ronment, such as being stationary and quiet. Convenrecognition analyzes the identity of an individual be- tionally, speaker matchers are trained to classify vocal fore accessing to a service. Unlike speech recogni- examples under idealistic conditions but are expected tion, which detects spoken words, speaker recognition to operate well in real-world challenging situations. inspects patterns that distinguish one person’s voice However, their performance sharply degrades when from another [1]. Recognizing the identity of a speaker audios with substantial background sounds (e.g., trafbecomes crucial in diferent scenarios. For instance, ifc) are encountered. Enhancing voice data is demandvoice-enabled devices (e.g., assistants, smartphones) ing since related algorithms do not often explicitly atallow home owners to turn on lights, unlock doors, tempt to preserve biometric cues in the data [8, 9, 10]. and listen to music seamlessly [2]. These recognition Existing robust speaker models are being trained on abilities can prevent unauthorized individuals from us- data which do not cover various levels of interfering ing devices without the owner’s permission and can sounds and diferent sound types [11, 12]. Hence, sevprovide evidence needed to personalize user’s expe- eral questions concerning how much and under which riences with these devises, even outside the domes- background sounds speaker recognition performance tic borders [3, 4, 5]. Moreover, speaker recognition degrades and how each type of sound impacts on the can make lives of older adults’ and people with spe- mechanics of these matchers remain unanswered. cial needs easier and safer [6]. Hence, it is imperative Our study in this paper is hence organized around to study and devise data-driven speaker recognition these directions and aims to perform an extensive performance analysis of deep speaker recognition matchProceedings of the CIKM 2020 Workshops, October 19-20, Galway, ers in a range of noisy living environments. To this Ireland. Stefan Conrad, Ilaria Tiddi (Eds.) end, we designed and collected a taxonomy of sounds eGmalaicili):;fmeniruk@o.umnaicrraa.ist@(Gu.nFicean.uit);(Mr.g.aMlicair1r@ass)tudenti.unica.it (R. (e.g., footsteps, laughing) that characterize representaorcid: 0000-0003-4668-2476 (G. Fenu); 0000-0003-1989-6057 (M. tive living ambients where speaker recognition is findMarras) ing adoption. Then, we depicted an approach that allows us to simulate challenging noisy conditions in © 2020 Copyright for this paper by its authors. Use permitted under Creative CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmUmoRns WLiceonrsekAsthtriobuptioPnr4o.0cIneteerdnaitniognasl ((CCC EBYU4R.0)-.WS.org) raw vocal audios by adding sounds of our taxonomy, optimization task by means of evolutionary algorithms. according with the environment under consideration. The authors found that a Bitwise noise can fundamenFinally, we experimented with a public dataset, orig- tally afect recognition patterns during evaluation and, inally collected in controlled environments, and two thus, might make it harder to deploy matchers in the state-of-the-art speaker recognition models, to inspect real world. Diferently, Ko et al. [14] focused on a perthe impact of background noisy sounds on their per- formance comparison between acoustic models trained formance. Our contribution is threefold: with and without simulated far-field speech under a real far-field voice dataset. Their experiments showed • We design a taxonomy of ambient sounds tai- that acoustic models trained on simulated far-field led lored to speaker recognition research, and we to significantly lower error rates in both a distant- and provide a dataset of recordings with labeled sound close-talking scenarios. In [15], the authors presented sources for each category in our taxonomy. a feature learning approach, referred as to e-vector, that • We propose an evaluation framework for speaker can capture both channel and environment variability. recognition benchmarking, enabling easier and Recently, Vincent et al. [16] analyzed the performance faster simulation of indoor and outdoor noisy of speaker recognition matchers on the CHiME3 dataset, environments in (clean) vocal audios1. which consists of real recordings in noisy environments.

Finally, Donahue et al. [17] analyzed the benefits re• Given a large vocal dataset, we perform an ex- sulting from training a speaker recognition matcher tensive analysis on the impact of the sounds in with both clean speech data and fake speech data creour taxonomy on the performance of two state- ated by means of a generative adversarial network. of-the-art speaker recognition matchers. Though this research has greatly expanded our understanding, past works focused on low-level noises

Our experiments showed that, even when the back- (e.g., Bitwise) or did not specifically control how and ground sound volume is low, speaker recognition sys- under which ambient sounds the performance degrades. tems undergo a substantial deterioration of accuracy. We argue that diferent background sounds may lead Only in case of nature-related sounds (e.g., chirping, to fundamentally diferent impacts and, thus, a clear wind), the sound impact is negligible. Certain envi- understanding on the extent of this impact is lacking. ronmental settings lead to error rates five to ten times higher than error rates achieved in ideal conditions.

The rest of this paper is organized as follows. Sec- 2.2. Input Audio Quality Enhancement tion 2 depicts an overview of related works. Then, our Existing literature included audio enhancement algotaxonomy and the simulation framework are described rithms that aim to provide audible improvements in in Section 3. Section 4 presents our experiments. Fi- a sound, without degrading the quality of the originally, Section 5 provides insights for future work. nal recording. This type of strategy well fits with the forensic context, where audios may have some kind of 2. Related Work background sound disturbance or sound artifact that may interfere with the voice of interest. Examples of Our research lies at the intersection among three per- audio enhancement methods are removing static noise, spectives, namely studies which analyze the impact eliminating phone-related interference, and clearing of background sounds on recognition, audio enhance- up random sounds (e.g., dogs barking, bells ringing). ment algorithms aimed to improve data quality, and For instance, Hou et al. [8] proposed a convolutionspeaker recognition approaches which seek to classify based audio-visual auto-encoder for speech enhancenoisy vocal data with no pre-processing. ment through multi-task learning. In [9], the authors investigated how to improve speech/non-speech detection robustness in very noisy environments, includ2.1. Explorative Analysis in Noisy ing stationary noise and short high-energy noise. Sim

Environments ilarly, Afouras et al. [10] proposed an audio-visual neuExplorative analyses simply investigate how noisy en- ral network able to isolate a speaker’s voice of interest vironments influence speaker recognition performance. from other sound interference, using visual informaFor instance, Qian et al. [13] studied the low-level noisy tion from the target speaker’s lips. However, the designed methods do not often attempt to preserve bio1Code, data, pre-trained models, and documentation are pub- metric cues in the data and depend on the nature of the licly available at https://mirkomarras.github.io/dl-voice-noise/. sound, which varies according to the context. Hence, our framework becomes a key asset to study recogni- Hence, being able to combine utterances recorded in tion performance on background sounds against which quiet environments with the background sounds of the countermeasures have been under-explored. considered scenario represents a viable alternative. The ifrst step to put this idea into practice consists of col2.3. Robust Speaker Recognition lecting sounds from a wide range of sources and organizing them in a hierarchical taxonomy.

Speaker recognition matchers have traditionally relied Noisy sounds research is challenging due to the lack on Gaussian mixture models [18], joint factor analy- of labeled audio data. Past works collected sounds from sis [19], and IVectors [20]. Recently, speaker match- specific environments and resulted in commercial or ers achieved impressive accuracy thanks to speaker private datasets. Recent contributions have provided embedding representations extracted from Deep Neu- publicly available datasets of environmental recordral Networks trained for (one-shot) speaker classifica- ings [23]. On top of these collections, many studies tion. Notable examples include CVectors [21], XVec- have been carried out on sound classification [24]. Betors [11], VGGVox- and ResNet-Vectors [12]. More- ing designed for sound classification tasks, existing taxover, Kim et al. [22] proposed a deep noise-adaptation onomies can not directly be applied nor combined to approach that dynamically adapts itself to the opera- simulate scenarios of speaker recognition. For instance, tional environment. Existing approaches in this area they often include a few classes, sounds of marginal indo not make any assumption on the training and test- terest (e.g., gun shots), and are organized according to ing data, which come from various noisy situations. the sound type. Conversely, for our purposes, a taxonTherefore, there is no fine-grained control on how these omy should be designed with situational and contexsystems perform in specific noisy applicative scenar- tual elements in mind (e.g., grouping sounds based on ios and the noisy situations are limited by the variety the ambient where they frequently appear). of recordings including into the considered dataset. To address these issues, we proposed a compilation of environmental sounds from over 50 classes. The selected sound clips were constructed from recordings 3. The Proposed Framework available on the above-mentioned urban sound taxOur framework is composed by a dataset of sounds onomies and on Freesound2, with a semi-automated categorized according to our pre-defined taxonomy and pipeline. Specifically, we first identified a representaa toolbox that simulates background living scenarios. tive set of scenarios/environments where speaker recognition is actively used nowadays, and then we filtered out the categories of sounds included into existing tax3.1. Sound Taxonomy for Speaker onomies that are of marginal interest for the selected Recognition in Living Scenarios scenarios (e.g., fire engine). Then, we introduced new sound categories that help to model speaker recognition scenarios whose sounds are not present in ex

Collecting utterances produced in various living sce

narios while keeping trace of the background sounds in each utterance is challenging and time-consuming. isting sound taxonomies (e.g., dishwasher, footsteps). and a list of vocal audios where that context should The included classes were arbitrarily selected with the be simulated, a routine changes each vocal audio by goal of maintaining balance between major types of adding to it the combination of sounds included into sounds characterizing the selected scenarios and of con- the context definition, with their given volume and sidering the limitations in the number and diversity probability. For each sound category, the sound to add of available sound recordings. Freesound was queried can be specified or randomly chosen. Our toolbox and for common terms related to the considered scenarios. our definition of context provide the necessary level Search results were verified by annotating fragments of flexibility to simulate real-world scenarios created that contain events associated with a given scenario. from all the combinations of the taxonomy’s sounds. Sounds are grouped in two major categories pertaining to indoor and outdoor contexts3: 4. Experiments • Indoor category including sounds divided into three diferent categories: Home (e.g., TV, wash- In this section, we assess how much and under which ing machine), Voice (e.g., chatting, laughing) background sounds speaker recognition performance and Movement (e.g., footsteps, applause). degrades, how each background sound type impacts on model mechanics, and how much the volume level • Outdoor category with sounds divided into two of the background sounds leads to models which procategories: Nature contains diferent types of vide less accurate predictions. In fact, how each noisy sounds, such as atmospheric elements (e.g., rain, context influences the behavior of state-of-the-art arwind), animal sounds (e.g., dogs, cats, birds), and chitectures, such as VGGVox and XVector, still remains sounds associated to plants and vegetation (e.g., under-explored, since their efectiveness has been ofleaves). Mechanical includes sounds produced ten evaluated under ideal conditions, with backgroundby ventilation, motorized transports (e.g, cars, sound unlabelled vocal audios, or the same type of votrains), non-motorized transports (e.g., bicycles), cal audios (e.g, from interviews).

and other signals (e.g., church bells).

The collected audios were converted to a unified

format (16 kHz, mono, wav) to facilitate their processing with existing audio-programming packages. These sounds were arranged into the taxonomy in Figure 1 based on the above-mentioned considerations.

3.2. Toolbox for Background Living Scenario Simulation

Our taxonomy is proposed to facilitate the simulation of real-world applicative contexts in vocal audio. Thus, on top of this taxonomy, a way to combine vocal audio and background sounds is needed. To this end, we propose a Python toolbox that can simulate an applicative context into a vocal audio. Specifically, we define an applicative context has a set of one or more sound entries taken from the taxonomy. Each entry includes a string identifier associated to the sound category to include (e.g., Home or Voice), a floating-point number that specifies the volume level of that sound in the current context, and a floating-point number in [0,1] representing the probability of adding that sound into a vocal example. Given a context defined as above

3Our preliminary analysis considers two disjoint sets of indoor

and outdoor sounds, leaving settings that cross-link sound entities within a graph-based taxonomy as a future work.

4.1. Seed Human Voice Dataset

Given its large scale and its wide adoption in the literature, we simulated applicative contexts into the vocal data belonging to the VoxCeleb-1 dataset [12]. This collection consists of short utterances taken from video interviews published on Youtube, including speakers from a wide range of diferent ethnicities, accents, professions and ages, fairly balanced with respect to their gender (i.e., 55% of males). The dataset is split into development and test sets having disjoint speakers. The development set has 1, 211 speakers and 143, 768 utterances, while the test set consists of 40 speakers and 4, 874 utterances. Our study leveraged trial pairs provided by the authors together with the VoxCeleb-1 data4.

4.2. Benchmarked Models Our analysis benchmarks two state-of-the-art speaker

recognition architectures: VGGVox [12] and XVector [11].

They have received great attention in recent years, and this motivated us to deepen their robustness in noisy

4Due to the large amount of comparisons to simulate all contexts, our study focused on 1, 000 out of 37, 702 VoxCeleb-1 trial pairs and leaves as a future work the extension to the larger VoxCeleb2. Here, we are more interested in understanding matchers robustness against background sounds, so the accuracy gains with larger datasets would not substantially afect the findings of our analysis. 4.3. Model Training and Testing Details environments. VGGVox is based on the VGG-M Convo- tic representations were extracted and fed into that lutional Neural Network, with modifications to adapt to pretrained model to get the speaker embeddings. The the audio spectrogram input. The last fully-connected Cosine similarity between the speaker embeddings was layer is replaced by two layers, a fully connected layer calculated. This process was repeated for each trial with support in the frequency domain and an aver- pair in the set. Finally, given the resulting similarity age pooling layer with support on the time domain. scores and the verification labels (i.e., 0 for diferentXVector is a Time Delay Neural Network, which allows user pairs, 1 for same-user pairs), the Equal Error Rate neurons to receive signals spanning across multiple (EER) under that context was computed. The entire frames. Given a filterbank, the first five layers operate protocol was repeated with diferent background sound on speech frames, with a small temporal context cen- volume levels, treated as percentages, assumed to be tered at the current frame. Then, a pooling layer ag- in [0, 0.05, 0.10, 0.20, 0.30, 0.50, 1, 1.5] (e.g., 1 means that gregates frame-level outputs and computes mean and the original volume is kept, 0.5 means that the volume standard deviation. Finally, two fully-connected layers of the background sound is reduced by 50%). aggregate statistics across the time dimension. This protocol was carried out on 25 contexts composed by either single categories of the third and fourth levels of our taxonomy and their combination.

The code, implemented in Python, ran on a NVIDIA 4.5. Experimental Results GPU. The audios were converted to single-channel, 16bit streams at a 16kHz sampling rate. We used 512- Given the considered taxonomy contexts, the VoxCelebpoint Fast Fourier Transforms. VGGVox received spec- 1 trial pairs, and two pre-trained speaker recognition trograms of size 512 300, while XVector received fil- models, we followed the protocol in Section 4.4 to calterbanks of size 300 24. Both representations were culate the EERs at various background sound volumes. generated in a sliding window fashion using the ham- Indoor. Table 1 and 2 report the EERs under various ming window of width 25 and step 10 , and nor- combinations of indoor sounds. It can be observed that malized by subtracting the mean and dividing by the Voice is the individual sound category that leads to standard deviation of all frequency components. Each the highest degradation in performance, with 27 − 30% model was trained for classification on VoxCeleb-1 dev of EER at the 1.5 volume ratio. Home and Movement set using Softmax, with batches of size 64. To keep showed a similar impact when the volume level was consistency with respect to the original implementa- below 1.0, while the former brought more negative eftions of VGGVox and XVector, we used the Adam opti- fects with volume ratios higher than 1.0. When two mizer, with an initial learning rate of 0.001, decreased or more sound categories were combined, EERs easily by a factor of 10 every 10 epochs, until convergence. turned out to values above 15%, especially in scenarFor testing, we considered speaker embeddings of size ios where both Home and Voice sounds were present. 512. The choice of the architectural parameters was XVector’s performance substantially decreased as soon driven by the original model implementations, with- as sounds were added, while VGGVox showed a more out any specific adaptation, given that we are inter- robust behavior against background sounds. It might ested in benchmarking the original models in noisy be possible that the changes introduced by the backenvironments rather than tuning/arranging the param- ground sound in the spectrograms fed into VGGVox eters to align with our goals. had a lower influence on the recognition pattern learnt by the convolutional layers during training. On the 4.4. Speaker Recognition Protocols other hand, with XVector, the temporal context employed at each layer of the network might be highly influenced by the changes introduced into the filterbanks through the background sound addition.

Given a pretrained model, a set of trial verification pairs, and a target context, the protocol worked as follows. For each trial pair in the set, we assumed that the ifrst audio represented the enrolled utterance ideally Nature Outdoor. Table 3 and 4 report the EERs obcollected in controlled environments, while the sec- tained when nature-related sounds were added. These ond audio was the probe provided in the target con- sounds showed diferent degradation patterns from each text. Hence, the first audio remained unchanged. The other, compared with indoor-related sounds. It can second audio was changed by adding sounds that char- be observed that models were robust against Plants acterize the target context, as explained in Section 3.2. Vegetation at any volume. Conversely, sounds comFor both the enrolled and the changed audios, the acous- ing from the AtmosphericElements category led to Home Movement Voice Home-Movement Home-Voice Movement-Voice Home-Movement-Voice Sound Combination Home Movement Voice Home-Movement Home-Voice Movement-Voice Home-Movement-Voice Sound Combination Animals AtmosphericElements PlantsVegetation Animals-AtmosphericElements Animals-PlantsVegetation AtmosphericElements-PlantsVegetation Animals-AtmosphericElements-PlantsVegetation Sound Combination Animals AtmosphericElements PlantsVegetation Animals-AtmosphericElements Animals-PlantsVegetation AtmosphericElements-PlantsVegetation Animals-AtmosphericElements-PlantsVegetation the worst EERs, with 40 − 43% of EER at the highest volume level. Models sufered from the combination of Animals and AtmosphericElements sounds (44−48% of EER reached at a volume ratio of 1.5). Compared with indoor scenarios, both VGGVox and XVector showed here similar degradation patterns. This behavior might be justified by the intrinsic properties and characteristics of the nature sounds included into our taxonomy, which are shorter and less deafening.

Mechanical Outdoor. Table 5 and 6 show us that

mechanical outdoor sounds led to substantial negative impacts on the model performance, except in case of SocialSignals sounds, compared with indoor and Sound Combination MotorizedTransport NonMT SocialSignals Ventilation MotorizedTransport-NonMT MotorizedTransport-SocialSignals MotorizedTransport-Ventilation NonMT-SocialSignals NonMT-Ventilation SocialSignals-Ventilation MotorizedTransport-NonMT-SocialSignals-Ventilation nature outdoor settings. Even at low volume levels, 5. Conclusions and Future Work it is important to notice that Ventilation sounds caused substantial degradation in EERs, and this ef- In this paper, we proposed a taxonomy of labeled backfect was amplified when two or more sound categories ground sound recordings for speaker recognition rewere combined. Among the most degraded settings, search in noisy environments. Then, we devised a simcombining NonMT and Ventilation led to EERs of ulation framework of indoor and outdoor contexts in 35 − 50% even at a volume ratio of 0.1. Outdoor sounds vocal audios. Finally, we assessed the impact of the coming from the Nature and Mechanical categories taxonomy sounds on the performance of two speaker seemed to lead to more overlapping decision bound- recognition models. Based on the results, indoor sounds aries than indoor sounds. For instance, while being have a lower impact than outdoor sounds, and outdoor composed by a diferent combination of sound types, scenarios that involve mechanical sounds are the most both the MotorizedTransport-NonMT and NonMT- challenging, even at low background sound volumes. SocialSignals settings showed similar EERs at a Our work opens to a wide range of research direcvolume level higher than 0.4. It follows that mixing tions. We plan to enrich the taxonomy with more cateoutdoor sounds can hamper more speaker recognition, gories and audios organized into an ontological repreand each type of outdoor sound significantly impacts sentation. We will extend our analysis to other modon model efectiveness. Similarly to the indoor sce- els (e.g., ResNet) and languages beyond English. We nario, VGGVox was more robust than XVector, possi- will also inspect how background sounds and respecbly due to its depth in terms of layers. tive scenarios afect the internal model dynamics (e.g.,

Based on our results, under the considered settings, speaker embeddings). Naturally, we will leverage our speaker recognition matchers do not appear adequately framework to devise audio enhancement methods able reliable. The impact of background sounds on perfor- to deal with the sounds of our taxonomy and to design mance depends on the context and the sound. novel approaches for more robust speaker recognition.

Acknowledgments