<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Text-independent voice recognition based on Siamese networks and fusion embeddings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>R. De Prisco</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>C. Fusco</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M. Iannucci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>D. Malandrino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>R. Zaccagnino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Salerno</institution>
          ,
          <addr-line>Fisciano (SA)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The problem of identifying people from their voices has been the subject of increasing research activities. Interest in this problem is fostered by the important practical applications that voice authentication has. Many solutions exploit neural networks based on i-vectors and, more recently, on x-vectors, which are computed from the input audio signal. In this paper we design and implement a novel voice recognition system based on the fusion of both i-vectors and x-vectors. The recognition is text-independent, that is, the user is recognized regardless of the actual words that are pronounced. We performed preliminary experiments to assess the efectiveness of the proposed solution. Results show that the proposed method achieves performance improvement compared with approaches based on only i-vectors or only x-vectors.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Voice authentication</kwd>
        <kwd>i-vectors</kwd>
        <kwd>x-vectors</kwd>
        <kwd>Siamese networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Identification of individuals through the voice relies on the existence of strictly personal traits [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
in the voice. Voice authentication can be useful in a wide range of applications in real-world
scenarios. For example, it can be used for voice-based authentication of personal smart devices,
and for guaranteeing the transaction security of bank trading and remote payment. In digital
forensics, it has been widely applied for investigations [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">2, 3, 1</xref>
        ], or surveillance and automatic
identity tagging [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The research on this field dates back to at least 1960s [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. During the years,
a number of acoustic features, such as the mel-frequency cepstral coeficients, and template
models have been applied [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Early approaches to the problem can be found in [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ].
      </p>
      <p>
        The development of i-vectors as fixed dimensional front-end features for speaker recognition
tasks was introduced in [
        <xref ref-type="bibr" rid="ref8 ref9">9, 8</xref>
        ], and has provided the state-of-the-art performance for several
years, until the era of Deep Learning. Approaches based on deep learning [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10, 11, 12, 13</xref>
        ] have
significantly increased the performance, especially in noisy environments [ 14, 15], and are based
on x-vectors [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ]. Systems based on x-vectors improve over those based on i-vectors [16].
Both i-vectors and x-vectors are a numerical representation, in the form of a fixed size array of
numbers, of a voice audio signal.
      </p>
      <p>
        Voice authentication can be classified into text-dependent and text-independent.
Textdependent voice authentication involves the use of a fixed sentence. The sentence doesn’t
need to be secret since the recognition is based on the voice and not on the sentence. The
system is called text-dependent because the user is required to pronounce the fixed sentence
in order to be recognized. With text-independent voice authentication, the user is recognized
regardless of the actual words pronounced and thus there is no need to fix a text.
Contribution of this work. In this paper, we propose a novel text-independent voice
recognition system. The novelty is in the combined use of i-vectors and x-vectors. The proposed system
is based on a fusion embedding vector, obtained as a combination of i-vectors and x-vectors. For
brevity, we will refer to fusion embedding vectors just as fusion embeddings. The motivation
behind the use of fusion embeddings is that of bringing together the potential of both
representations. Indeed, although x-vectors have better recognition performance especially on short
spechees and an intrinsic ability to discriminate by definition, i-vectors seem to provide better
results in recognizing the same speaker that records from diferent devices. Moreover, the idea of
combining such embeddings is also motivated by the results obtained in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], where the authors
shown that the combined use of both approaches can lead to better performances. However,
in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] no vectorial combination has been investigated, but, instead two separate recognitions,
one based only on i-vectors and the other based only on x-vectors, are first computed and a
ifnal one is obtained by averaging the scores. In our approach we first average the vectors and
then apply the recognition. This is a fundamental diference.
      </p>
      <p>The system uses a Siamese network which is trained on fusion embedding vectors from a
database of recorded voice audio signals (recorded speeches). The Siamese network is then able
to take as input two fusion embedding vectors and tell whether they are derived from speeches
of the same person. The use of a Siamese network increases the discrimination strength because
such category of Deep Learning Neural Networks are particularly suitable for the computation
of similarity measures or determining relationships/discriminations between two comparable
subjects [17].</p>
      <p>Paper organization. The rest of the paper is organized as follows. In Section 2, we discuss
some relevant works. In Section 3, we provide the needed background about i-vectors, x-vectors,
and Siamese networks. In Section 4, we describe the proposed system. In Section 5, we report
the results obtained from experiments carried out to assess the efectiveness of the proposed
system. Finally, in Section 6 we provide conclusions and directions for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        For several years, most voice recognition systems have been based on the i-vector and the
Probabilistic Linear Discriminant Analysis (PLDA) [18]. With the advent of Deep Learning,
deep speaker embedding has led to significant performance improvements [
        <xref ref-type="bibr" rid="ref12">12, 19, 20, 21</xref>
        ].
Deep speaker embedding uses a speaker identification network to create a speaker-embedding
space. In [22], the i-vector extraction and the PLDA scoring have been jointly derived using
a single deep neural network and the model is trained using a binary cross entropy criterion.
The use of triplet loss in end-to-end speaker recognition has shown improved performances
for short utterances [23]. Wan et. al. [24] proposed a generalized end-to-end loss function
inspired by minimizing the centroid mean of (same) speaker distances while maximizing the
distances between clusters of diferent speakers. In this direction, many architectures based
on convolutional neural networks have been used for frame-level processing, such as, x-vectors
using time delay neural networks to extract the frame-level features [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Later, more advanced
networks, such as ResNets [19], DenseNets [20], and Res2Nets [21], and diferent training loss
functions besides the softmax loss function have been used. For example, additive margin
softmax [25] loss and additive angular margin softmax loss [26] have been introduced. Another
category of deep speaker embedding uses metric learning [27, 28, 29], which is characterized
by distance measures used to guide the embedding network so that the speaker embeddings
have simultaneously large inter-class distance and small intra-class distance. For example,
triplet loss [29], prototypical network loss [28], and angular prototypical loss [27] have been
investigated.
      </p>
      <p>In [30], a novel text-independent method able to combine speaker feature extraction and
speaker classification in only one step, was proposed. The two main aspects of such a method
were the use of a speaker representation consisting in the Mel-frequency spectrogram extracted
from the input audio, in order to benefit from the dependency of the adjacent spectro-temporal
features, and the use of a Siamese convolutional network to perform feature extraction and
speaker classification. Results obtained during experiments showed significant improvement
over conventional classical and DL-based algorithms for forensic cross-device voice recognition.</p>
      <p>The most interesting aspect that emerges from the literature is that the exclusive use of
x-vectors or i-vectors does have advantages both it has also limitations. As we have already
said in Section 1, in this preliminary work we investigate the combined usage of i-vectors and
x-vectors with the goal of bringing their potential together and thus to build a novel feature
speaker embedding. We show that such a combination can be efectively exploited for training
a Siamese network to calculate a similarity score that can be used to recognize the voice of a
speaker that can pronounce any pass-phrases to authenticate.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Background</title>
      <p>In this section, we briefly recall the needed background to understand the proposed approach.
Specifically, first we will provide details about i-vectors and x-vectors, by highlighting strengths
and weaknesses, and then we will discuss Siamese networks.</p>
      <sec id="sec-3-1">
        <title>3.1. Speaker feature extraction with i-vector</title>
        <p>
          Joint Factor Analysis [
          <xref ref-type="bibr" rid="ref8">8, 31, 32</xref>
          ] has represented the state-of-the-art for text independent speaker
detection tasks for several years, due to its powerful in modeling the inter-speaker variability
and in compensating for channel/session variability in the context of the Gaussian Mixture
Model. The first voice recognition system based on Joint Factor Analysis as a feature extractor
was proposed in [33]. The idea was to represent a speaker utterance as a speaker-dependent
supervector combining factors from two distinct spaces, i.e., the speaker space containing speaker
variabilities, and the channel space containing channel variabilities. Thus, the performance was
essentially afected by the speaker and channel variations of utterances.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], the authors proposed a kind of “speaker embedding”, by defining only a single total
variability space, instead of two separate spaces, which contains the speaker and channel
variabilities simultaneously. Given an utterance, the speaker- and channel-dependent Gaussian
Mixture Model supervector, named i-vector, is defined as  =  +   where  is the
speakerand channel-independent supervector,  is a rectangular matrix of low rank, and  is a random
vector having a standard normal distribution. The components of the vector  are the total
factors.
        </p>
        <p>
          We remark that, the main advantage of the i-vectors is that they are not strictly dependent on
the change of the transmission channel or on the variability of the speaker’s vocal characteristics
(such as cadence or accent), since in the proposed model by Dehak et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] based on the Joint
Factor Analysis, both factors are taken into account as a whole, during the modeling phase.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Speaker feature extraction with deep embedding: x-vector</title>
        <p>
          Similar to the i-vector, the x-vector is also a kind of speaker embedding, named deep speaker
embedding, but that discriminatively embeds speakers into a vector space by using a Delay
Neural Network trained in a supervised fashion [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Such a Delay Neural Network computes
speaker embeddings from variable-length acoustic segments and was implemented using the
nnet3 neural network library in the Kaldi Speech Recognition Toolkit1. The features are 30
dimensional Mel Frequency Cepstral Coeficients with a frame-length of 25ms, mean-normalized
over a sliding window of up to 3 seconds. The Delay Neural Network (see Figure 1), is organized
on 2 levels: frame and segment level. The nonlinearities are rectified linear units (ReLUs). The
frame-level consists of 5 layers, through which the audio frames are taken in input and split with
a sliding window of 3 seconds, and given as input to a Time Delay Neural Network [34] which
enable the network to learn the structural information of the signal and the relationship between
the various frames. The statistics pooling layer receives the output of the final frame-level layer
as input, aggregates over the input segment, and computes its mean and standard deviation.
The segment level statistics are concatenated together, passed to two additional hidden layers
with dimension 512 and 300, and finally the soft max output layer maps the x-vector obtained
to the probability of the speaker.
        </p>
        <p>The goal of the Delay Neural Network is to produce embeddings that capture speaker
characteristics over the entire utterance, rather than at the frame-level. Although both layer  and 
after the statistics pooling layer can be used to extract the embedding, usually  is used as the
x-vector.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Siamese networks</title>
        <p>Siamese networks were introduced in [35] to solve the problem of matching hand written
signatures, and were subsequently adapted to other domains such as image and video
processing [36, 37, 38], and Natural Language Processing tasks [39, 40]. A Siamese network is
composed of two identical twins networks that share weights. Such networks pass their output
1https://github.com/kaldi-asr/kaldi/tree/master/egs/sre16/v2
speech
feature
frames
30-d MFCCs 25s
x1,…,xt
statistic
pooling
a
b
output
layer</p>
        <p>Pr(spkri| x1,…,xt)
frame-level layers
segment-level layers
to a similarity module, which computes a “distance” between the two inputs. The distance is
compared to the given target (i.e. whether or not the pair are similar), the loss is calculated, and
the weights are then adjusted.</p>
        <p>Several loss functions can be used to train Siamese networks. In this work we used the triplet
loss function. During the training, instead of taking two inputs, this function takes three inputs:
the anchor, the positive, and the negative. The anchor is the reference input, the positive is an
input that is in the same class as the anchor, while the negative is an input with a diferent class
from the anchor. The idea is to maximize (resp. minimize) the distance between the anchor and
the negative (resp. positive). Formally, the triplet loss can be defined as:</p>
        <p>= ((, ) − (, ) + , 0)
where  is some distance function, and  is a constant named margin; the constant  is used
to decrease the probability that  be 0. The details of the architecture and building blocks of
the twins networks used in this work are provided in Section 4.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. A fusion embedding voice authentication system</title>
      <sec id="sec-4-1">
        <title>4.1. Overview</title>
        <p>In order to perform voice authentication, users of the system have to be registered. During the
registration phase, each user is given a unique username and an enrollment audio file username
is saved into a database. The enrollment audio file is obtained by recording the voice audio
signal of the user while the user is reading a suficiently long (random) text. In order to be
recognized a user provides a username, and is asked to speak; the speech is recorded into a
test audio file . The goal of the voice authentication system is that of telling whether  and
username are from the same speaker. Establishing whether the audio test file comes from the
specified user will be denoted as
 ∼ username.</p>
        <p>find uusername from username</p>
        <p>flac(uusername)
fusion embedding for flac(uusername)
fusion embedding for flac(utest)
save
• username
• flac(uusername)</p>
        <p>Siamese NN fe(uusername)</p>
        <p>S fe(utest)</p>
        <p>Login
enrollment voice signal
uusername</p>
        <p>Registration
record voice</p>
        <p>insert
«username»</p>
        <p>utest
test voice signal
1
fe(flac(uusername)) 2
fe(flac( utest)) front-end</p>
        <p>3</p>
        <p>Are uusername and utest
from the same speaker?
uusername ~ utest back-end
In order to solve the problem we compute both the i-vector and the x-vector of the audio file
and from them we build a new vector, that we call fusion embedding vector. Such vector is then
used with a Siamese network to tell whether  ∼ username. The Siamese network is trained
on the database of registered user.</p>
        <p>Figure 2 summarizes the overall system, which consists of two components: (i) a front-end,
that, given a pair of audio files, the enrollment audio file  and the test audio file , for
each of them computes the fusion embeddings fe() and fe(); (ii) a back-end, which
uses a Siamese Neural Network S to tell whether  ∼ username, taking as input fe()
and fe(). In the following we provide details about how the above is done.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Front-end: fusion speaker embedding generations</title>
        <p>In this section, we describe the front-end of the proposed system focusing on computation of
the fusion speaker vectors.</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Speaker embeddings: i-vector and x-vector extraction</title>
          <p>We used the mnet3 Neural Network library in the Kaldi Speech Recognition Toolkit to extract
from an audio file , the corresponding i-vector iv() and x-vector xv(). This library is one
of the most used for the voice recognition problem. The cepstral features for the i-vectors are
extracted using a 25 ms Hamming window; every 10 ms, 24 Mel Frequency Cepstral Coeficients
were calculated; this 24-dimensional features vector was subjected to feature warping [41]
using a 3-second sliding window; delta and delta-delta coeficients were then calculated using
a 5-frame window to produce 60-dimensional features vectors; then, an energy-based speech
activity detection system selects features corresponding to speech frames; finally, using
genderdependent Universal Background Models containing 2048 Gaussians and two gender-dependent
joint factor analysis configurations the diagonal matrix is added in order to have speaker and
common factors, obtaining one i-vector of 400 total factors. The mnet3 library allows us to use
vectors of size 100, 200 and 400. We experimented will all these 3 sizes and the best results were
obtained with vectors of size 400.</p>
          <p>For the x-vectors the features are 23 dimensional filterbanks with frame-length of 25ms,
meannormalized over a sliding window of up to 3 seconds; the speech activity detection used for the
i-vectors filters out nonspeech frames; the Deep Neural Network configuration is outlined in
Figure 1; it is trained to classify the  speakers in the training data; after training, x-vectors of
size 512 are extracted.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Merging i-vector and x-vector: fusion speaker embedding</title>
          <p>The proposed voice recognition system is based on a fusion embedding vector fe() obtained by
merging the i-vectors with the x-vectors. More in detail, let iv() and xv() be the i-vector
and x-vector, respectively, for a given audio file . By construction, we have |iv()| = 400 and
|xv()| = 512. The idea is to define a new vector fe() obtained as a sort of “mean vector”
between iv() and xv(). However, the diference in the sizes, 400 for the i-vectors and 512 for
the x-vectors, is a problem. There are two natural solutions: adapt the x-vectors to the shape of
the i-vectors cutting 112 entries, or adapt i-vectors to x-vectors, adding somehow 112 entries.
Although the first solution seems more immediate and easy to apply, it actually has two serious
drawbacks: first, it would distort the final mean, given that removing 112 entries would have
roughly meant the loss of 20% of information; second, the Deep Neural Network provided by
Kaldi and used for our experiments is configured for the native x-vector size, i.e., 512 entries,
and so, cutting the x-vectors to obtain vectors of 400 entries would therefore have involved
lengthy and expensive modifications to the entire network structure.</p>
          <p>Hence we opt for the second alternative, that is, add 112 entries to the i-vector. The question
that remains is how to add these 112 missing entries. We have tried the following alternatives,
which assign to each new entry: (i) the average value of the i-vector elements, (ii) the most
recurring value within the i-vector, (iii) the value 0 (zero-padding). From empirical observations
during the experiments conducted we have found that zero-padding is the strategy that generally
achieves the best results. We denote with iv() the vector obtained by zero-padding iv()
with additional 112 entries.</p>
          <p>Notice that, having added zero entries, when calculating the mean vector fe() between
iv() and xv(), the first 400 entries correspond to an efective mean between the two
vectors, while the remaining 112 entries are dominated by the value of x-vector. Since by
doing this we somehow give more weight to the x-vector, we use a weighted mean in order to
rebalance the contribution of the i-vector. Formally:</p>
          <p>
            fe() = W * iv() + W * xv()
where W, W ∈ [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ] are the weights for iv() and xv(), respectively.
          </p>
          <p>From empirical observations during the experiments carried out we have found that best
results are obtained by setting W = 0.6 and W = 0.4. Thus, we build the fusion embedding
vector as</p>
          <p>fe()[] = 0.6 * iv()[] + 0.4 * xv()[]
for each  = 0, . . . , 511.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Back-end</title>
        <sec id="sec-4-3-1">
          <title>4.3.1. Siamese architecture</title>
          <p>The proposed back-end uses a Siamese network S which, given a pair of fusion embeddings
fe() and fe( ), computes a similarity score S (fe(), fe( )). Then, to verify whether 
and  are from the same speaker, the following rule is used by the system:</p>
          <p>
            S (fe(), fe( )) ≥  =⇒  ∼ 
where  ∈ [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ] is the system threshold empirically estimated during the training of the network.
          </p>
          <p>In the following, we will provide details about the architecture and the training of S . We
remark that the architecture has been modified during several experiments carried out to find
the best setting. Figure 3 shows the one achieving the best performances. As explained in
Section 3, a Siamese network consists of two identical subnetworks; in the following we will
discuss the structure of such subnetworks.</p>
          <p>fusion
embedding 512 x 1024</p>
          <p>As can be seen in Figure 3, each subnetwork starts with a Linear layer which takes as input
a fusion embedding, applies a linear transformation to the input data and then, through the
application of a ReLU layer, outputs the input directly if it is positive, otherwise, it outputs
zero. Such a technique is used to overcome the vanishing gradient problem, allowing models
to learn faster and perform better. Then, a Dropout layer, during the training phase, randomly
deactivates some of the elements (2% of the data taken as input), providing a series of advantages
especially in the case where the dataset is small [42]. Following, a sequence of three Linear
layers (the first one with ReLU activation function) expands the dimensionality of the initial
fusion embedding until reaching one vector of 2500 features, which represents the peculiarities
of the voice on which the network must learn to calculate the similarity.</p>
          <p>One of the most interesting advantages of using Siamese networks is the ability to adopt
the One-Shot Learning strategy, shown to be efective in identifying new classes based on one
(or only a few) examples [43]. The idea is to extract rules on previously seen classes, i.e., to
learn patterns and similarities instead of fitting the ML model to fixed classes, in order to be
able of classifying previously unseen classes using one instance. This strategy is very helpful in
the scenario described in Section 4.1. Indeed, it allows us to define a system “calibrated” on a
significant initial set of speakers, i.e, with a back-end exploiting a Siamese network trained on
an initial set of voices provided by a “representative” sample of speakers; a new speaker can be
added to the system without having to retrain the network, but simply by saving a reference
enrollment audio signal, which will be used every time the user needs to be recognized.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.3.2. One-Shot Learning.</title>
          <p>The network S is trained using One-Shot learning. The performance on verifying a new speaker
without to retrain S is evaluated by considering only the enrollment fusion embedding saved
during the registration phase (see Figure 2). Here, we provide details about the process of
establishing the proposed system based on One-Shot learning and about the methodology of
assessing performance for new speaker classes without retraining S . Let  = {1, . . . ,  } be
the dataset of voice audio signals provided by  speakers ( classes), where each  is the class
containing voice audio signals by the ℎ user, for  = 1, . . . ,  . In order to train the network
to recognize also speakers that were not included in the training phase, for each , we train the
network by excluding the samples from . More in details, each of the remaining  − 1 classes
is split into two balanced subsets. Then, the first one is used to generate the training set pairs,
while the second one is used as the evaluation pool of instances. Then, we perform the training
process using the Triplet Loss function (see Section 3).</p>
          <p>For the evaluation, the instances of the excluded set  are split in two balanced subsets 

and  where  represents the “labelled” samples for user , while  represents the “unlabelled”
samples for user , i.e., recognition attempts by user . A set of evaluation pairs  is generated
as follows. Let ′ = {1, . . ., − 1, , +1, . . .,  } be the evaluation set. For each ¯ ∈ 

one element  is randomly chosen from each  ∈ ′, obtaining the  pairs  = {(¯ , 1),
. . ., (¯ , − 1), (¯ , ), (¯ , +1), . . ., (¯ ,  )}. Then, we set  = ⋃︀¯∈ . Observe that
|| =  × | |. The similarity is calculated for the pairs in  and the classification is based on
the pair with the highest similarity (i.e. least distance). To determine the trade-of between the
number of labelled instances of the new speaker class and accuracy, the process is repeated ||
times, i.e., for each instance in . Majority voting is then applied to deduce the instance label;
the class that has been recognized more often, is used as instance label. Algorithm 1 reports the
pseudo-code for the training and evaluation of S described above.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Preliminary experiments</title>
      <p>Data collection. To assess the performance of the proposed voice recognition system, we
collected voice samples from 100 speakers. Specifically, we have recruited 100 students at
the University of Salerno. The sample was 65% male and 35% female, with a mean age of 22.
Participants were informed that the information provided remain confidential. To collect data,
each student was led to a room dedicated to the registration phase. Each speaker was asked to
read one short text for a recording duration of 30 seconds. Then, we split the recorded audio in
several overlapping 10-second fragments using a sliding window of 2 seconds, for a total of
27
28
29 return ⟨ S ,Average(one-shot-accuracy)⟩;
 ← GetPairsFromOtherSpeakers(,{1, . . . , − 1, , +1, . . . ,  };
correct ← 0;
for k = 1 to || do
 ← Voting([], S );
if  ==  then
/* Correct classification
correct ← correct +1;
accuracy ← 100 ;
one-shot-accuracy.append(accuracy);
11 10-second fragments. The voices were registered in wav format then compressed using the
flac compression algorithm to save space on the server.</p>
      <p>At the end of the registration phase, we have collected 11× 100 = 1100 voice audio files lasting
each 10 seconds. Then, for each of them we used the Kaldi library to extract the corresponding
i-vector and x-vector. Finally, for each audio file we generated the fusion embedding as described
in Section 4.2.</p>
      <p>Training and test results. We have trained the network S , with several configurations
for the parameters. Specifically, we tried batchsize ∈ {16, 32, 64, 128}, learning_rate
∈ {0.1, 0.001, 0.0001}, #_epochs ∈ {2, 4, 6, . . ., 18, 20}, and threshold ∈ {0.10, 0.15,
0.20, . . ., 0.90, 0.95, 1.00}. The best result, that is 99% of accuracy, was obtained by setting
batchsize= 32, learning_rate = 0.001, #_epochs= 4, and threshold= 0.15. Thus,
the network S trained with such a configuration has been used the following Testing Phase.
For the Testing phase, we have recruited 20 people not involved in the data collection phase.
The sample was 55% male and 45% female, with a mean age of 21. As done during the data
collection phase, participants were informed that the information provided remain confidential.
Each speaker was asked to provide a username and to read a text for a duration of 10 seconds.
The server registered the voice in wav format, and compressed it using the flac compression
algorithm.</p>
      <p>Several experiments have been conducted in order to evaluate the recognition performance of
the system in three diferent scenarios:
• same-device-scenario: the speaker uses the same device both for the registration
and for the recognition.
• different-device-scenario: registration is made with a device, while recognition
is attempted with a diferent device.
• attack-scenario: Let  be diferent from ; a speaker  claims to be  (that is, provides
username as username).</p>
      <p>The evaluation has been carried out as follows. Each speaker  was requested to try 10 times
each of the three scenarios defined above. That is, the user attempts 10 times the recognition
using his own username and recording the voice with the same device that has been used in the
registration phase; 10 times with his own username but recording the voice through a diferent
device and finally 10 times with the username of another user, without restriction on the device.</p>
      <p>In addition to the similarity computed using S , we have computed the similarity using the
PLDA approach and the cosine distance. Table 1 reports the results obtained. For compactness,
we indicate with S1 the same-device-scenario, S2 the different-device-scenario,
and S3 the attack-scenario. In general, as we can see, when the speaker attempts a
recognition through the same device used during the registration phase (same-device-scenario),
the use of the x-vectors tends to give better results than using i-vectors. Instead, when
the speaker attempts a recognition through using a device other than the one used during
the registration (different-device-scenario), or tries to be recognized as another user
(attack-scenario), the use of the i-vectors tends to give better results than using x-vectors.
The results obtained using fusion embeddings with Siamese networks are better than those
obtained with the other configurations.</p>
      <p>Back-end</p>
      <p>S1
cosine 0.80
PLDA 0.82
Siamese 0.84</p>
      <p>i-vector</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper we have described a study whose goal is that of investigating the advantages
deriving from the combination of i-vectors and x-vectors. The results obtained show that the
approach combining the two types of embedding provide better recognition accuracy. The study
is only a preliminary investigation and research in this direction can be extended in several
ways. First, other more complex embedding merging techniques, for example based on DL
models, can be investigated. Second, the size of the dataset used to train the Siamese network
and to carry out our experiments is small; increasing the size of the dataset, and diversifying
the type of speakers, would provide a better evaluation of the performance, also in terms of
scalability in real-world scenarios. Finally, it is worth to investigate improvements of the model
used in the back-end. For example, define an image-based combination of embeddings and
explore the potential of a Siamese network based on convolution neural networks.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was partially supported by project SERICS (PE00000014) under the NRRP MUR
program funded by the EU - NGEU.
embeddings for speaker recognition, in: 2018 IEEE international conference on acoustics,
speech and signal processing (ICASSP), IEEE, 2018, pp. 5329–5333.
[13] E. Variani, X. Lei, E. McDermott, I. L. Moreno, J. Gonzalez-Dominguez, Deep neural
networks for small footprint text-dependent speaker verification, in: 2014 IEEE international
conference on acoustics, speech and signal processing (ICASSP), IEEE, 2014, pp. 4052–4056.
[14] M. McLaren, L. Ferrer, D. Castan, A. Lawson, The speakers in the wild (sitw) speaker
recognition database., in: Interspeech, 2016, pp. 818–822.
[15] A. Nagrani, J. S. Chung, A. Zisserman, Voxceleb: a large-scale speaker identification
dataset, arXiv preprint arXiv:1706.08612 (2017).
[16] M. McLaren, D. Castan, M. K. Nandwana, L. Ferrer, E. Yilmaz, How to train your speaker
embeddings extractor (2018).
[17] S. Bell, K. Bala, Learning visual similarity for product design with convolutional neural
networks, ACM transactions on graphics (TOG) 34 (2015) 1–10.
[18] P. Kenny, Bayesian speaker verification with, heavy tailed priors, Proc. Odyssey 2010
(2010).
[19] W. Xie, A. Nagrani, J. S. Chung, A. Zisserman, Utterance-level aggregation for speaker
recognition in the wild, in: ICASSP 2019-2019 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 5791–5795.
[20] W. Lin, M.-W. Mak, L. Yi, Learning mixture representation for deep speaker embedding
using attention., in: Odyssey, 2020, pp. 210–214.
[21] B. Desplanques, J. Thienpondt, K. Demuynck, Ecapa-tdnn: Emphasized channel
attention, propagation and aggregation in tdnn based speaker verification, arXiv preprint
arXiv:2005.07143 (2020).
[22] J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Matějka, L. Burget, End-to-end dnn based
speaker recognition inspired by i-vector and plda, in: 2018 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp. 4874–4878.
[23] C. Zhang, K. Koishida, End-to-end text-independent speaker verification with triplet loss
on short utterances., in: Interspeech, 2017, pp. 1487–1491.
[24] L. Wan, Q. Wang, A. Papir, I. L. Moreno, Generalized end-to-end loss for speaker
verification, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, 2018, pp. 4879–4883.
[25] F. Wang, J. Cheng, W. Liu, H. Liu, Additive margin softmax for face verification, IEEE</p>
      <p>Signal Processing Letters 25 (2018) 926–930.
[26] J. Deng, J. Guo, N. Xue, S. Zafeiriou, Arcface: Additive angular margin loss for deep face
recognition, in: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, 2019, pp. 4690–4699.
[27] J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, I. Han, In
defence of metric learning for speaker recognition, arXiv preprint arXiv:2003.11982 (2020).
[28] J. Wang, K.-C. Wang, M. T. Law, F. Rudzicz, M. Brudno, Centroid-based deep metric
learning for speaker recognition, in: ICASSP 2019-2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 3652–3656.
[29] C. Zhang, K. Koishida, J. H. Hansen, Text-independent speaker verification based on triplet
convolutional neural network embeddings, IEEE/ACM Transactions on Audio, Speech,
and Language Processing 26 (2018) 1633–1644.
[30] S. Soleymani, A. Dabouei, S. M. Iranmanesh, H. Kazemi, J. Dawson, N. M. Nasrabadi,
Prosodic-enhanced siamese convolutional neural networks for cross-device
textindependent speaker verification, in: 2018 IEEE 9th international conference on biometrics
theory, applications and systems (BTAS), IEEE, 2018, pp. 1–7.
[31] P. Kenny, G. Boulianne, P. Ouellet, P. Dumouchel, Speaker and session variability in
gmm-based speaker verification, IEEE Transactions on Audio, Speech, and Language
Processing 15 (2007) 1448–1460.
[32] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, P. Dumouchel, A study of interspeaker variability
in speaker verification, IEEE Transactions on Audio, Speech, and Language Processing 16
(2008) 980–988.
[33] N. Dehak, R. Dehak, P. Kenny, N. Brümmer, P. Ouellet, P. Dumouchel, Support vector
machines versus fast scoring in the low-dimensional total variability space for speaker
verification, in: Tenth Annual conference of the international speech communication
association, 2009.
[34] V. Peddinti, D. Povey, S. Khudanpur, A time delay neural network architecture for eficient
modeling of long temporal contexts, in: Sixteenth annual conference of the international
speech communication association, 2015.
[35] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, R. Shah, Signature verification using a"
siamese" time delay neural network, Advances in neural information processing systems
6 (1993).
[36] G. Koch, R. Zemel, R. Salakhutdinov, et al., Siamese neural networks for one-shot image
recognition, in: ICML deep learning workshop, volume 2, Lille, 2015, p. 0.
[37] Y. Yao, X. Wu, W. Zuo, D. Zhang, Learning siamese network with top-down modulation
for visual tracking, in: International Conference on Intelligent Science and Big Data
Engineering, Springer, 2018, pp. 378–388.
[38] R. R. Varior, M. Haloi, G. Wang, Gated siamese convolutional neural network architecture
for human re-identification, in: European conference on computer vision, Springer, 2016,
pp. 791–808.
[39] Y. Benajiba, J. Sun, Y. Zhang, L. Jiang, Z. Weng, O. Biran, Siamese networks for semantic
pattern similarity, in: 2019 IEEE 13th International Conference on Semantic Computing
(ICSC), IEEE, 2019, pp. 191–194.
[40] W. Zhu, T. Yao, J. Ni, B. Wei, Z. Lu, Dependency-based siamese long short-term memory
network for learning sentence representations, PloS one 13 (2018) e0193919.
[41] J. Pelecanos, S. Sridharan, Feature warping for robust speaker verification, in:
Proceedings of 2001 A Speaker Odyssey: The Speaker Recognition Workshop, European Speech
Communication Association, 2001, pp. 213–218.
[42] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. R. Salakhutdinov, Improving
neural networks by preventing co-adaptation of feature detectors, 2012. URL: https://arxiv.
org/abs/1207.0580. doi:10.48550/ARXIV.1207.0580.
[43] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al., Matching networks for one shot
learning, Advances in neural information processing systems 29 (2016).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kinnunen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>An overview of text-independent speaker recognition: From features to supervectors</article-title>
          ,
          <source>Speech communication 52</source>
          (
          <year>2010</year>
          )
          <fpage>12</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Campbell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. M.</given-names>
            <surname>Campbell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-F.</given-names>
            <surname>Bonastre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Matrouf</surname>
          </string-name>
          ,
          <article-title>Forensic speaker recognition</article-title>
          ,
          <source>IEEE Signal Processing Magazine</source>
          <volume>26</volume>
          (
          <year>2009</year>
          )
          <fpage>95</fpage>
          -
          <lpage>103</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Champod</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Meuwly</surname>
          </string-name>
          ,
          <article-title>The inference of identity in forensic speaker recognition</article-title>
          ,
          <source>Speech communication 31</source>
          (
          <year>2000</year>
          )
          <fpage>193</fpage>
          -
          <lpage>203</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Togneri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pullella</surname>
          </string-name>
          ,
          <article-title>An overview of speaker identification: Accuracy and robustness issues</article-title>
          ,
          <source>IEEE circuits and systems magazine 11</source>
          (
          <year>2011</year>
          )
          <fpage>23</fpage>
          -
          <lpage>61</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pruzansky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Mathews</surname>
          </string-name>
          ,
          <article-title>Talker-recognition procedure based on analysis of variance</article-title>
          ,
          <source>The Journal of the Acoustical Society of America</source>
          <volume>36</volume>
          (
          <year>1964</year>
          )
          <fpage>2041</fpage>
          -
          <lpage>2047</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Reynolds</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. F.</given-names>
            <surname>Quatieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Dunn</surname>
          </string-name>
          ,
          <article-title>Speaker verification using adapted gaussian mixture models</article-title>
          ,
          <source>Digital signal processing 10</source>
          (
          <year>2000</year>
          )
          <fpage>19</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W. M.</given-names>
            <surname>Campbell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Sturim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Reynolds</surname>
          </string-name>
          ,
          <article-title>Support vector machines using gmm supervectors for speaker verification</article-title>
          ,
          <source>IEEE signal processing letters</source>
          <volume>13</volume>
          (
          <year>2006</year>
          )
          <fpage>308</fpage>
          -
          <lpage>311</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kenny</surname>
          </string-name>
          , G. Boulianne,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ouellet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dumouchel</surname>
          </string-name>
          ,
          <article-title>Joint factor analysis versus eigenchannels in speaker recognition</article-title>
          ,
          <source>IEEE Transactions on Audio, Speech, and Language Processing</source>
          <volume>15</volume>
          (
          <year>2007</year>
          )
          <fpage>1435</fpage>
          -
          <lpage>1447</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dehak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Kenny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dehak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dumouchel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ouellet</surname>
          </string-name>
          ,
          <article-title>Front-end factor analysis for speaker verification</article-title>
          ,
          <source>IEEE Transactions on Audio, Speech, and Language Processing</source>
          <volume>19</volume>
          (
          <year>2010</year>
          )
          <fpage>788</fpage>
          -
          <lpage>798</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schefer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ferrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>McLaren</surname>
          </string-name>
          ,
          <article-title>A novel scheme for speaker recognition using a phonetically-aware deep neural network, in: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP)</article-title>
          , IEEE,
          <year>2014</year>
          , pp.
          <fpage>1695</fpage>
          -
          <lpage>1699</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Snyder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garcia-Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khudanpur</surname>
          </string-name>
          ,
          <article-title>Deep neural network embeddings for text-independent speaker verification</article-title>
          .,
          <source>in: Interspeech</source>
          , volume
          <year>2017</year>
          ,
          <year>2017</year>
          , pp.
          <fpage>999</fpage>
          -
          <lpage>1003</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Snyder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garcia-Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khudanpur</surname>
          </string-name>
          , X-vectors: Robust dnn
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>