<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluation Framework for Context-aware Speaker Recognition in Noisy Smart Living Environments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gianni Fenu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberta Galici</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mirko Marras</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Computer Science, University of Cagliari</institution>
          ,
          <addr-line>V. Ospedale 72, 09124 Cagliari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>0</volume>
      <issue>2017</issue>
      <fpage>999</fpage>
      <lpage>1003</lpage>
      <abstract>
        <p>The integration of voice control into connected devices is expected to improve the eficiency and comfort of our daily lives. However, the underlying biometric systems often impose constraints on the individual or the environment during interaction (e.g., quiet surroundings). Such constraints have to be surmounted in order to seamlessly recognize individuals. In this paper, we propose an evaluation framework for speaker recognition in noisy smart living environments. To this end, we designed a taxonomy of sounds (e.g., home-related, mechanical) that characterize representative indoor and outdoor environments where speaker recognition is adopted. Then, we devised an approach for of-line simulation of challenging noisy conditions in vocal audios originally collected under controlled environments, by leveraging our taxonomy. Our approach adds a (combination of) sound(s) belonging to the target environment into the current vocal example. Experiments on a large-scale public dataset and two state-of-the-art speaker recognition models show that adding certain background sounds to clean vocal audio leads to a substantial deterioration of recognition performance. In several noisy settings, our findings reveal that a speaker recognition model might end up to make unreliable decisions. Our framework is intended to help system designers evaluate performance deterioration and develop speaker recognition models more robust to smart living environments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Deep Learning</kwd>
        <kwd>Security</kwd>
        <kwd>Speaker Recognition</kwd>
        <kwd>Speaker Verification</kwd>
        <kwd>Noisy Environments</kwd>
        <kwd>Sound Taxonomy</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>models that can improve human quality of life.</p>
      <p>State-of-the-art speaker recognition matchers exhibit
Speech is a more natural way of interacting with de- impressive accuracy, especially when the voice
qualvices than tapping screens. This form of interaction is ity is reasonably good [7]. For this reason, they
imreceiving more and more attention, with voice-enabled plicitly or explicitly impose constraints on the
enviservices being used in every aspect of our lives. Speaker ronment, such as being stationary and quiet.
Convenrecognition analyzes the identity of an individual be- tionally, speaker matchers are trained to classify vocal
fore accessing to a service. Unlike speech recogni- examples under idealistic conditions but are expected
tion, which detects spoken words, speaker recognition to operate well in real-world challenging situations.
inspects patterns that distinguish one person’s voice However, their performance sharply degrades when
from another [1]. Recognizing the identity of a speaker audios with substantial background sounds (e.g.,
trafbecomes crucial in diferent scenarios. For instance, ifc) are encountered. Enhancing voice data is
demandvoice-enabled devices (e.g., assistants, smartphones) ing since related algorithms do not often explicitly
atallow home owners to turn on lights, unlock doors, tempt to preserve biometric cues in the data [8, 9, 10].
and listen to music seamlessly [2]. These recognition Existing robust speaker models are being trained on
abilities can prevent unauthorized individuals from us- data which do not cover various levels of interfering
ing devices without the owner’s permission and can sounds and diferent sound types [11, 12]. Hence,
sevprovide evidence needed to personalize user’s expe- eral questions concerning how much and under which
riences with these devises, even outside the domes- background sounds speaker recognition performance
tic borders [3, 4, 5]. Moreover, speaker recognition degrades and how each type of sound impacts on the
can make lives of older adults’ and people with spe- mechanics of these matchers remain unanswered.
cial needs easier and safer [6]. Hence, it is imperative Our study in this paper is hence organized around
to study and devise data-driven speaker recognition these directions and aims to perform an extensive
performance analysis of deep speaker recognition
matchProceedings of the CIKM 2020 Workshops, October 19-20, Galway, ers in a range of noisy living environments. To this
Ireland. Stefan Conrad, Ilaria Tiddi (Eds.) end, we designed and collected a taxonomy of sounds
eGmalaicili):;fmeniruk@o.umnaicrraa.ist@(Gu.nFicean.uit);(Mr.g.aMlicair1r@ass)tudenti.unica.it (R. (e.g., footsteps, laughing) that characterize
representaorcid: 0000-0003-4668-2476 (G. Fenu); 0000-0003-1989-6057 (M. tive living ambients where speaker recognition is
findMarras) ing adoption. Then, we depicted an approach that
allows us to simulate challenging noisy conditions in
© 2020 Copyright for this paper by its authors. Use permitted under Creative
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmUmoRns WLiceonrsekAsthtriobuptioPnr4o.0cIneteerdnaitniognasl ((CCC EBYU4R.0)-.WS.org)
raw vocal audios by adding sounds of our taxonomy, optimization task by means of evolutionary algorithms.
according with the environment under consideration. The authors found that a Bitwise noise can
fundamenFinally, we experimented with a public dataset, orig- tally afect recognition patterns during evaluation and,
inally collected in controlled environments, and two thus, might make it harder to deploy matchers in the
state-of-the-art speaker recognition models, to inspect real world. Diferently, Ko et al. [14] focused on a
perthe impact of background noisy sounds on their per- formance comparison between acoustic models trained
formance. Our contribution is threefold: with and without simulated far-field speech under a
real far-field voice dataset. Their experiments showed
• We design a taxonomy of ambient sounds tai- that acoustic models trained on simulated far-field led
lored to speaker recognition research, and we to significantly lower error rates in both a distant- and
provide a dataset of recordings with labeled sound close-talking scenarios. In [15], the authors presented
sources for each category in our taxonomy. a feature learning approach, referred as to e-vector, that
• We propose an evaluation framework for speaker can capture both channel and environment variability.
recognition benchmarking, enabling easier and Recently, Vincent et al. [16] analyzed the performance
faster simulation of indoor and outdoor noisy of speaker recognition matchers on the CHiME3 dataset,
environments in (clean) vocal audios1. which consists of real recordings in noisy environments.</p>
      <p>Finally, Donahue et al. [17] analyzed the benefits
re• Given a large vocal dataset, we perform an ex- sulting from training a speaker recognition matcher
tensive analysis on the impact of the sounds in with both clean speech data and fake speech data
creour taxonomy on the performance of two state- ated by means of a generative adversarial network.
of-the-art speaker recognition matchers. Though this research has greatly expanded our
understanding, past works focused on low-level noises</p>
      <p>Our experiments showed that, even when the back- (e.g., Bitwise) or did not specifically control how and
ground sound volume is low, speaker recognition sys- under which ambient sounds the performance degrades.
tems undergo a substantial deterioration of accuracy. We argue that diferent background sounds may lead
Only in case of nature-related sounds (e.g., chirping, to fundamentally diferent impacts and, thus, a clear
wind), the sound impact is negligible. Certain envi- understanding on the extent of this impact is lacking.
ronmental settings lead to error rates five to ten times
higher than error rates achieved in ideal conditions.</p>
      <p>The rest of this paper is organized as follows. Sec- 2.2. Input Audio Quality Enhancement
tion 2 depicts an overview of related works. Then, our Existing literature included audio enhancement
algotaxonomy and the simulation framework are described rithms that aim to provide audible improvements in
in Section 3. Section 4 presents our experiments. Fi- a sound, without degrading the quality of the
originally, Section 5 provides insights for future work. nal recording. This type of strategy well fits with the
forensic context, where audios may have some kind of
2. Related Work background sound disturbance or sound artifact that
may interfere with the voice of interest. Examples of
Our research lies at the intersection among three per- audio enhancement methods are removing static noise,
spectives, namely studies which analyze the impact eliminating phone-related interference, and clearing
of background sounds on recognition, audio enhance- up random sounds (e.g., dogs barking, bells ringing).
ment algorithms aimed to improve data quality, and For instance, Hou et al. [8] proposed a
convolutionspeaker recognition approaches which seek to classify based audio-visual auto-encoder for speech
enhancenoisy vocal data with no pre-processing. ment through multi-task learning. In [9], the authors
investigated how to improve speech/non-speech
detection robustness in very noisy environments,
includ2.1. Explorative Analysis in Noisy ing stationary noise and short high-energy noise.
Sim</p>
      <p>Environments ilarly, Afouras et al. [10] proposed an audio-visual
neuExplorative analyses simply investigate how noisy en- ral network able to isolate a speaker’s voice of interest
vironments influence speaker recognition performance. from other sound interference, using visual
informaFor instance, Qian et al. [13] studied the low-level noisy tion from the target speaker’s lips. However, the
designed methods do not often attempt to preserve
bio1Code, data, pre-trained models, and documentation are pub- metric cues in the data and depend on the nature of the
licly available at https://mirkomarras.github.io/dl-voice-noise/. sound, which varies according to the context. Hence,
our framework becomes a key asset to study recogni- Hence, being able to combine utterances recorded in
tion performance on background sounds against which quiet environments with the background sounds of the
countermeasures have been under-explored. considered scenario represents a viable alternative. The
ifrst step to put this idea into practice consists of
col2.3. Robust Speaker Recognition lecting sounds from a wide range of sources and
organizing them in a hierarchical taxonomy.</p>
      <p>Speaker recognition matchers have traditionally relied Noisy sounds research is challenging due to the lack
on Gaussian mixture models [18], joint factor analy- of labeled audio data. Past works collected sounds from
sis [19], and IVectors [20]. Recently, speaker match- specific environments and resulted in commercial or
ers achieved impressive accuracy thanks to speaker private datasets. Recent contributions have provided
embedding representations extracted from Deep Neu- publicly available datasets of environmental
recordral Networks trained for (one-shot) speaker classifica- ings [23]. On top of these collections, many studies
tion. Notable examples include CVectors [21], XVec- have been carried out on sound classification [24].
Betors [11], VGGVox- and ResNet-Vectors [12]. More- ing designed for sound classification tasks, existing
taxover, Kim et al. [22] proposed a deep noise-adaptation onomies can not directly be applied nor combined to
approach that dynamically adapts itself to the opera- simulate scenarios of speaker recognition. For instance,
tional environment. Existing approaches in this area they often include a few classes, sounds of marginal
indo not make any assumption on the training and test- terest (e.g., gun shots), and are organized according to
ing data, which come from various noisy situations. the sound type. Conversely, for our purposes, a
taxonTherefore, there is no fine-grained control on how these omy should be designed with situational and
contexsystems perform in specific noisy applicative scenar- tual elements in mind (e.g., grouping sounds based on
ios and the noisy situations are limited by the variety the ambient where they frequently appear).
of recordings including into the considered dataset. To address these issues, we proposed a compilation
of environmental sounds from over 50 classes. The
selected sound clips were constructed from recordings
3. The Proposed Framework available on the above-mentioned urban sound
taxOur framework is composed by a dataset of sounds onomies and on Freesound2, with a semi-automated
categorized according to our pre-defined taxonomy and pipeline. Specifically, we first identified a
representaa toolbox that simulates background living scenarios. tive set of scenarios/environments where speaker
recognition is actively used nowadays, and then we filtered
out the categories of sounds included into existing
tax3.1. Sound Taxonomy for Speaker onomies that are of marginal interest for the selected
Recognition in Living Scenarios scenarios (e.g., fire engine). Then, we introduced new
sound categories that help to model speaker
recognition scenarios whose sounds are not present in
ex</p>
      <sec id="sec-1-1">
        <title>Collecting utterances produced in various living sce</title>
        <p>narios while keeping trace of the background sounds
in each utterance is challenging and time-consuming.
isting sound taxonomies (e.g., dishwasher, footsteps). and a list of vocal audios where that context should
The included classes were arbitrarily selected with the be simulated, a routine changes each vocal audio by
goal of maintaining balance between major types of adding to it the combination of sounds included into
sounds characterizing the selected scenarios and of con- the context definition, with their given volume and
sidering the limitations in the number and diversity probability. For each sound category, the sound to add
of available sound recordings. Freesound was queried can be specified or randomly chosen. Our toolbox and
for common terms related to the considered scenarios. our definition of context provide the necessary level
Search results were verified by annotating fragments of flexibility to simulate real-world scenarios created
that contain events associated with a given scenario. from all the combinations of the taxonomy’s sounds.
Sounds are grouped in two major categories
pertaining to indoor and outdoor contexts3:
4. Experiments
• Indoor category including sounds divided into
three diferent categories: Home (e.g., TV, wash- In this section, we assess how much and under which
ing machine), Voice (e.g., chatting, laughing) background sounds speaker recognition performance
and Movement (e.g., footsteps, applause). degrades, how each background sound type impacts
on model mechanics, and how much the volume level
• Outdoor category with sounds divided into two of the background sounds leads to models which
procategories: Nature contains diferent types of vide less accurate predictions. In fact, how each noisy
sounds, such as atmospheric elements (e.g., rain, context influences the behavior of state-of-the-art
arwind), animal sounds (e.g., dogs, cats, birds), and chitectures, such as VGGVox and XVector, still remains
sounds associated to plants and vegetation (e.g., under-explored, since their efectiveness has been
ofleaves). Mechanical includes sounds produced ten evaluated under ideal conditions, with
backgroundby ventilation, motorized transports (e.g, cars, sound unlabelled vocal audios, or the same type of
votrains), non-motorized transports (e.g., bicycles), cal audios (e.g, from interviews).</p>
        <p>and other signals (e.g., church bells).</p>
      </sec>
      <sec id="sec-1-2">
        <title>The collected audios were converted to a unified</title>
        <p>format (16 kHz, mono, wav) to facilitate their
processing with existing audio-programming packages. These
sounds were arranged into the taxonomy in Figure 1
based on the above-mentioned considerations.</p>
        <sec id="sec-1-2-1">
          <title>3.2. Toolbox for Background Living</title>
        </sec>
        <sec id="sec-1-2-2">
          <title>Scenario Simulation</title>
          <p>Our taxonomy is proposed to facilitate the simulation
of real-world applicative contexts in vocal audio. Thus,
on top of this taxonomy, a way to combine vocal
audio and background sounds is needed. To this end, we
propose a Python toolbox that can simulate an
applicative context into a vocal audio. Specifically, we define
an applicative context has a set of one or more sound
entries taken from the taxonomy. Each entry includes
a string identifier associated to the sound category to
include (e.g., Home or Voice), a floating-point
number that specifies the volume level of that sound in the
current context, and a floating-point number in [0,1]
representing the probability of adding that sound into
a vocal example. Given a context defined as above</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>3Our preliminary analysis considers two disjoint sets of indoor</title>
        <p>and outdoor sounds, leaving settings that cross-link sound entities
within a graph-based taxonomy as a future work.</p>
        <sec id="sec-1-3-1">
          <title>4.1. Seed Human Voice Dataset</title>
          <p>Given its large scale and its wide adoption in the
literature, we simulated applicative contexts into the vocal
data belonging to the VoxCeleb-1 dataset [12]. This
collection consists of short utterances taken from video
interviews published on Youtube, including speakers
from a wide range of diferent ethnicities, accents,
professions and ages, fairly balanced with respect to their
gender (i.e., 55% of males). The dataset is split into
development and test sets having disjoint speakers. The
development set has 1, 211 speakers and 143, 768
utterances, while the test set consists of 40 speakers and
4, 874 utterances. Our study leveraged trial pairs
provided by the authors together with the VoxCeleb-1 data4.</p>
        </sec>
        <sec id="sec-1-3-2">
          <title>4.2. Benchmarked Models</title>
        </sec>
      </sec>
      <sec id="sec-1-4">
        <title>Our analysis benchmarks two state-of-the-art speaker</title>
        <p>recognition architectures: VGGVox [12] and XVector [11].</p>
        <p>They have received great attention in recent years, and
this motivated us to deepen their robustness in noisy</p>
        <p>4Due to the large amount of comparisons to simulate all
contexts, our study focused on 1, 000 out of 37, 702 VoxCeleb-1 trial pairs
and leaves as a future work the extension to the larger
VoxCeleb2. Here, we are more interested in understanding matchers
robustness against background sounds, so the accuracy gains with larger
datasets would not substantially afect the findings of our analysis.
4.3. Model Training and Testing Details
environments. VGGVox is based on the VGG-M Convo- tic representations were extracted and fed into that
lutional Neural Network, with modifications to adapt to pretrained model to get the speaker embeddings. The
the audio spectrogram input. The last fully-connected Cosine similarity between the speaker embeddings was
layer is replaced by two layers, a fully connected layer calculated. This process was repeated for each trial
with support in the frequency domain and an aver- pair in the set. Finally, given the resulting similarity
age pooling layer with support on the time domain. scores and the verification labels (i.e., 0 for
diferentXVector is a Time Delay Neural Network, which allows user pairs, 1 for same-user pairs), the Equal Error Rate
neurons to receive signals spanning across multiple (EER) under that context was computed. The entire
frames. Given a filterbank, the first five layers operate protocol was repeated with diferent background sound
on speech frames, with a small temporal context cen- volume levels, treated as percentages, assumed to be
tered at the current frame. Then, a pooling layer ag- in [0, 0.05, 0.10, 0.20, 0.30, 0.50, 1, 1.5] (e.g., 1 means that
gregates frame-level outputs and computes mean and the original volume is kept, 0.5 means that the volume
standard deviation. Finally, two fully-connected layers of the background sound is reduced by 50%).
aggregate statistics across the time dimension. This protocol was carried out on 25 contexts
composed by either single categories of the third and fourth
levels of our taxonomy and their combination.</p>
        <p>The code, implemented in Python, ran on a NVIDIA 4.5. Experimental Results
GPU. The audios were converted to single-channel,
16bit streams at a 16kHz sampling rate. We used 512- Given the considered taxonomy contexts, the
VoxCelebpoint Fast Fourier Transforms. VGGVox received spec- 1 trial pairs, and two pre-trained speaker recognition
trograms of size 512 300, while XVector received fil- models, we followed the protocol in Section 4.4 to
calterbanks of size 300 24. Both representations were culate the EERs at various background sound volumes.
generated in a sliding window fashion using the ham- Indoor. Table 1 and 2 report the EERs under various
ming window of width 25 and step 10 , and nor- combinations of indoor sounds. It can be observed that
malized by subtracting the mean and dividing by the Voice is the individual sound category that leads to
standard deviation of all frequency components. Each the highest degradation in performance, with 27 − 30%
model was trained for classification on VoxCeleb-1 dev of EER at the 1.5 volume ratio. Home and Movement
set using Softmax, with batches of size 64. To keep showed a similar impact when the volume level was
consistency with respect to the original implementa- below 1.0, while the former brought more negative
eftions of VGGVox and XVector, we used the Adam opti- fects with volume ratios higher than 1.0. When two
mizer, with an initial learning rate of 0.001, decreased or more sound categories were combined, EERs easily
by a factor of 10 every 10 epochs, until convergence. turned out to values above 15%, especially in
scenarFor testing, we considered speaker embeddings of size ios where both Home and Voice sounds were present.
512. The choice of the architectural parameters was XVector’s performance substantially decreased as soon
driven by the original model implementations, with- as sounds were added, while VGGVox showed a more
out any specific adaptation, given that we are inter- robust behavior against background sounds. It might
ested in benchmarking the original models in noisy be possible that the changes introduced by the
backenvironments rather than tuning/arranging the param- ground sound in the spectrograms fed into VGGVox
eters to align with our goals. had a lower influence on the recognition pattern learnt
by the convolutional layers during training. On the
4.4. Speaker Recognition Protocols other hand, with XVector, the temporal context
employed at each layer of the network might be highly
influenced by the changes introduced into the
filterbanks through the background sound addition.</p>
        <p>Given a pretrained model, a set of trial verification
pairs, and a target context, the protocol worked as
follows. For each trial pair in the set, we assumed that the
ifrst audio represented the enrolled utterance ideally Nature Outdoor. Table 3 and 4 report the EERs
obcollected in controlled environments, while the sec- tained when nature-related sounds were added. These
ond audio was the probe provided in the target con- sounds showed diferent degradation patterns from each
text. Hence, the first audio remained unchanged. The other, compared with indoor-related sounds. It can
second audio was changed by adding sounds that char- be observed that models were robust against Plants
acterize the target context, as explained in Section 3.2. Vegetation at any volume. Conversely, sounds
comFor both the enrolled and the changed audios, the acous- ing from the AtmosphericElements category led to
Home
Movement
Voice
Home-Movement
Home-Voice
Movement-Voice
Home-Movement-Voice
Sound Combination
Home
Movement
Voice
Home-Movement
Home-Voice
Movement-Voice
Home-Movement-Voice
Sound Combination
Animals
AtmosphericElements
PlantsVegetation
Animals-AtmosphericElements
Animals-PlantsVegetation
AtmosphericElements-PlantsVegetation
Animals-AtmosphericElements-PlantsVegetation
Sound Combination
Animals
AtmosphericElements
PlantsVegetation
Animals-AtmosphericElements
Animals-PlantsVegetation
AtmosphericElements-PlantsVegetation
Animals-AtmosphericElements-PlantsVegetation
the worst EERs, with 40 − 43% of EER at the
highest volume level. Models sufered from the
combination of Animals and AtmosphericElements sounds
(44−48% of EER reached at a volume ratio of 1.5).
Compared with indoor scenarios, both VGGVox and
XVector showed here similar degradation patterns. This
behavior might be justified by the intrinsic properties
and characteristics of the nature sounds included into
our taxonomy, which are shorter and less deafening.</p>
      </sec>
      <sec id="sec-1-5">
        <title>Mechanical Outdoor. Table 5 and 6 show us that</title>
        <p>mechanical outdoor sounds led to substantial negative
impacts on the model performance, except in case of
SocialSignals sounds, compared with indoor and
Sound Combination
MotorizedTransport
NonMT
SocialSignals
Ventilation
MotorizedTransport-NonMT
MotorizedTransport-SocialSignals
MotorizedTransport-Ventilation
NonMT-SocialSignals
NonMT-Ventilation
SocialSignals-Ventilation
MotorizedTransport-NonMT-SocialSignals-Ventilation
nature outdoor settings. Even at low volume levels, 5. Conclusions and Future Work
it is important to notice that Ventilation sounds
caused substantial degradation in EERs, and this ef- In this paper, we proposed a taxonomy of labeled
backfect was amplified when two or more sound categories ground sound recordings for speaker recognition
rewere combined. Among the most degraded settings, search in noisy environments. Then, we devised a
simcombining NonMT and Ventilation led to EERs of ulation framework of indoor and outdoor contexts in
35 − 50% even at a volume ratio of 0.1. Outdoor sounds vocal audios. Finally, we assessed the impact of the
coming from the Nature and Mechanical categories taxonomy sounds on the performance of two speaker
seemed to lead to more overlapping decision bound- recognition models. Based on the results, indoor sounds
aries than indoor sounds. For instance, while being have a lower impact than outdoor sounds, and outdoor
composed by a diferent combination of sound types, scenarios that involve mechanical sounds are the most
both the MotorizedTransport-NonMT and NonMT- challenging, even at low background sound volumes.
SocialSignals settings showed similar EERs at a Our work opens to a wide range of research
direcvolume level higher than 0.4. It follows that mixing tions. We plan to enrich the taxonomy with more
cateoutdoor sounds can hamper more speaker recognition, gories and audios organized into an ontological
repreand each type of outdoor sound significantly impacts sentation. We will extend our analysis to other
modon model efectiveness. Similarly to the indoor sce- els (e.g., ResNet) and languages beyond English. We
nario, VGGVox was more robust than XVector, possi- will also inspect how background sounds and
respecbly due to its depth in terms of layers. tive scenarios afect the internal model dynamics (e.g.,</p>
        <p>Based on our results, under the considered settings, speaker embeddings). Naturally, we will leverage our
speaker recognition matchers do not appear adequately framework to devise audio enhancement methods able
reliable. The impact of background sounds on perfor- to deal with the sounds of our taxonomy and to design
mance depends on the context and the sound. novel approaches for more robust speaker recognition.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Acknowledgments</title>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>