<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Text-independent Speaker Verification Using Convolutional Deep Belief Network and Gaussian Mixture Model</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ivan Rakhmanenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roman Meshcheryakov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Systems Security Tomsk State University of Control Systems and Radioelectronics Tomsk</institution>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>118</fpage>
      <lpage>121</lpage>
      <abstract>
        <p>- There has been much interest in new deep learning approaches for representing and extracting high-level features for audio processing. In this paper convolutional deep belief network was used to generate new speech features for textindependent speaker verification. Structure and parameters of a convolutional deep belief network were described. New high-level speech features were extracted using proposed method. Relevance of speaker verification systems for mobile authentication was considered. Gaussian mixture model and universal background model speaker verification system used for experiments was described. Speaker verification accuracy using extracted features was evaluated on a 50 speaker set and a result is presented. Different layers and combinations of layers of convolutional deep belief network were used as a features for a text-independent speaker verification. High level features extracted by convolutional deep belief network were illustrated and analyzed. Reasons of insufficient verification accuracy were described. High-level features extracted by the third layer could be used for gender recognition.</p>
      </abstract>
      <kwd-group>
        <kwd>speaker verification</kwd>
        <kwd>speech features</kwd>
        <kwd>gmm-ubm system</kwd>
        <kwd>speech processing</kwd>
        <kwd>cdbn</kwd>
        <kwd>feature extraction</kwd>
        <kwd>deep learning</kwd>
        <kwd>neural networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>INTRODUCTION</p>
      <p>There is an active conversion of practical methods that are
used in user authentication systems from classical
passwordbased methods to methods that are based on human biometrics.
The voice, unlike the retina or fingerprints is considered to be
less reliable for person identification or verification. However,
in some cases speaker verification by voice is required.
Particular attention should be devoted to speaker verification
via mobile devices, as they have the microphone and
computing capabilities that are necessary for speaker
verification. Significantly, speaker verification on mobile
devices could be combined with other verification methods,
which allows taking advantage of multifactor authentication.</p>
      <p>The application field of currently developed voice
authentication systems includes multi-factor (biometric)
authentication and access restriction systems, banking account
management systems using voice biometrics in order to give
speaker access to his banking account, national security and
This work was supported by the Ministry of Education and Science of the
Russian Federation within 1.3 federal program «Research and development in
priority areas of scientific-technological complex of Russia for 2014-2020»
(grant agreement № 14.577.21.0172 on October 27, 2015; identifier
RFMEFI57715X0172).
anti-terrorism issues. The use of speaker recognition systems
that have even small possibility of mistake in such a sensitive
application areas could be very dangerous.</p>
      <p>Equal error rate value (EER) is one of the most common
speaker verification accuracy measures used nowadays. EER is
used both for text-dependent and text-independent automatic
voice authentication systems. By now the best speaker
recognition systems are characterized by 3-5% EER values.
This accuracy is insufficient for modern speaker verification
systems because even small probability of false acceptance is
critical. If there are many speakers working with such systems,
then mistakes will occur definitely, and such mistakes are
unacceptable in systems granting access rights to confidential
data or banking accounts.</p>
      <p>
        Generally, low-level speech features are used for speaker
verification, for example mel-frequency cepstral coefficients
[
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1-4</xref>
        ], linear prediction cepstral coefficients [5] and others. But
attempts are made to use higher-level features, for example,
extracting bottleneck features [
        <xref ref-type="bibr" rid="ref5 ref6">6-8</xref>
        ], constructing i-vectors
based on a low-level representation [
        <xref ref-type="bibr" rid="ref10 ref11 ref7 ref8 ref9">9-13</xref>
        ], etc.
      </p>
      <p>Based on how the brain processes incoming visual and
audio signals, it can be assumed that the use of such features
will improve the accuracy of speaker verification systems. In
this paper a convolutional deep belief network (CDBN) was
used to extract higher-level features and Gaussian mixture
model with universal background model (GMM-UBM) system
was used for speaker verification.</p>
      <p>II.</p>
    </sec>
    <sec id="sec-2">
      <title>SPEAKER VERIFICATION MODEL</title>
      <sec id="sec-2-1">
        <title>A. Gaussian Mixture Model</title>
        <p>
          A Gaussian Mixture Model (GMM) is a parametric
probability density function represented as a weighted sum of
Gaussian component densities [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. A GMM with M component
Gaussian densities can be presented by the equation
p(x | λ) = Σ{wi, µi, Σi}
(1)
where x is a D-dimensional continuous-valued data vector (i.e.
measurement or features), wi, i = 1,...,M, are the mixture
weights, and g(x|µi, Σi), i = 1,...,M, are the component Gaussian
densities with mean vector µi and covariance matrix Σi. The
complete GMM is parameterized by the mean vectors,
covariance matrices and mixture weights from all component
densities. Each speaker is represented by his Gaussian mixture
λ for speaker identification task. Gaussian mixture could be
represented by the equation,
λ = {wi, µi, Σi}
(2)
        </p>
        <p>
          There are two reasons for using Gaussian mixture densities
as a representation of speaker identity [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The first reason is
the intuitive notion that the individual component densities of
the GMM may model some underlying set of acoustic classes,
reflecting some general speaker-dependent vocal tract
configurations. The second reason is the empirical observation
that a linear combination of Gaussian basis functions is capable
of representing a large class of sample distributions. A GMM
can form smooth approximations to arbitrarily-shaped
densities.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>B. Universal Background Model</title>
        <p>
          Universal Background Model (UBM) is a GMM trained on
large set of speech samples that was taken from big population
of speakers expected during recognition. Given the data to train
a UBM, there are many approaches that can be used to obtain
the final model. The simplest is to pool all the data to train the
UBM via the EM algorithm. One should be careful that the
pooled data are balanced over the subpopulations within the
data. For example, in using gender-independent data, one
should be sure there is a balance of male and female speech.
Otherwise, the final model will be biased toward the dominant
subpopulation. The same argument can be made for other
subpopulations such as speech from different microphones.
Another approach is to train individual UBMs over the
subpopulations in the data, such as one for male and one for
female speech, and then pool the subpopulation models
together [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>In this paper, parameters for the UBM are trained using the
EM algorithm, and a form of Bayesian adaptation is used for
training speaker models. Number of mixtures used is 256, as
EER is not decreasing for small speaker set when larger
mixture numbers are used. Speaker models are derived by
MAP adaptation, where only means are adapted with relevance
factor r = 10. GMM-UBM system described in this section is
based on MSR Identity Toolbox.</p>
        <p>III.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>CONVOLUTIONAL DEEP BELIEF NETWORK</title>
      <sec id="sec-3-1">
        <title>A. Convolutional Deep Belief Network</title>
        <p>
          The main difference between the convolutional deep belief
network [
          <xref ref-type="bibr" rid="ref12 ref13">14, 15</xref>
          ] and the usual deep belief network [
          <xref ref-type="bibr" rid="ref14">16</xref>
          ] is the
use of a convolutional restricted Boltzmann machine (CRBM)
[
          <xref ref-type="bibr" rid="ref12">14</xref>
          ] as a hidden network layer. The CRBM is similar to the
RBM (restricted Boltzmann machine), but the weights between
the hidden and visible layers are shared among all locations in
the hidden layer. CRBM (Fig. 1) is a feature detector consisting
of three layers - the visible layer V, the detection layer H and
the pooling layer P. The visible units in case of audio
processing are real-valued, and the hidden units are
binaryvalued.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>B. CDBN Structure</title>
        <p>
          The input layer is consisting of an NV × Ch dimensional
array of real-valued units, where NV is the number of windows
to which the audio signal is divided, and Ch is the number of
channels of the spectrum. To construct the hidden layer,
consider K NW × Ch dimensional filter weights WK (also
referred to as “bases”). The hidden layer consists of K groups
of NH × Ch dimensional arrays (where NH = NV-NW+1) with
units in group k sharing the weights Wk. There is also a shared
bias bk for each group and a shared bias c for the visible units.
The energy function of the CRBM (3) can then be defined as
[
          <xref ref-type="bibr" rid="ref13">15</xref>
          ]:
        </p>
        <p>The detection and pooling layers both have K groups of
units, and each group of the pooling layer has NP x NP binary
units. For each k ϵ {1,…,K}, the pooling layer Pk shrinks the
representation of the detection layer Hk by a factor of C along
each dimension, where C is a small integer such as 2 or 3.</p>
        <p>
          The joint and conditional probability distributions are
defined as follows (4-6):
(3)
(4)
(5)
(6)
where *v is a “valid” convolution (5), *f is a “full” convolution
(6) [
          <xref ref-type="bibr" rid="ref13">15</xref>
          ]. For m-dimensional feature vector and n-dimensional
vector “valid” convolution should result in an
(m-n+1)dimensional vector and a “full” convolution should result in an
(m+n-1)-dimensional vector.
        </p>
        <p>
          Since all units in one layer are conditionally independent
given the other layer, inference in the network can be
efficiently performed using block Gibbs sampling [
          <xref ref-type="bibr" rid="ref13">15</xref>
          ].
        </p>
        <p>The use of an additional pooling layer makes it possible to
reduce the amount and detail of the data supplied to the next
hidden layer, which makes it possible to extract higher-level
features in the data. This also reduces the computational load
on following layers and filters out random noise.</p>
        <p>A convolutional deep belief network is a composition of
simple convolutional restricted Boltzmann machines. This fact
allows the hidden layer of each CRBM to serve as a visible
layer for the next CRBM. Thereby, a quick layer-wise training
technique could be performed to train a CDBN. In order to
estimate the gradient, the contrastive divergence approximation
is applied to each sub-network, beginning with the first pair of
layers. The data from the training set is fed to the visible layer
of the CRBM, the following hidden layers take their input from
the output of the next CRBM’s hidden layers.</p>
        <p>IV.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>EXPERIMENTAL EVALUATION</title>
      <sec id="sec-4-1">
        <title>A. Speech Corpus</title>
        <p>The experiments were conducted using speech database
containing collection of speech from 25 male and 25 female
speakers. This speech database includes speech samples of
sentences from science fiction stories. The total length of
speech for each speaker is at least 6 minutes consisting of 50
speech segments of various lengths. Each speaker was recorded
using medium-quality microphone, 8000 Hz sampling rate, 16
Bit sample size.</p>
        <p>All 50 speaker set was divided equally for male and female
speakers on the UBM training set consisting of 30 speakers and
speakers’ training set consisting of 20 speakers. For MAP
adaptation of speakers’ models 40 speech segments was taken.
Remaining 10 utterances of each speaker was used for testing
verification system. Overall, 4000 tests were done for each
feature set, having 10 positive (true speaker) and 190 negative
(imposter) tests for each speaker.</p>
      </sec>
      <sec id="sec-4-2">
        <title>B. Experimental Evaluation</title>
        <p>After training phase, that consists of UBM training and
speakers’ models adapting, starts test phase. For each test
speech segment verification scores (log-likelihood ratios) are
calculated using speaker GMM and UBM models (7). Using
different decision thresholds hypothesized speaker model was
accepted or rejected.</p>
        <p>Λ(X) = log p(X|λhyp)-log p(X|λubm)
(7)</p>
        <p>Two different verification metrics was used for evaluating
speaker verification system: EER and minimum detection cost
function with SRE 2008 parameters (minDCF).</p>
        <p>In order to test speaker verification accuracy using CDBN,
network structure and parameters should be specified. In this
paper CDBN consisting of three connected layers was used for
experimental evaluation. The first and second layers consist of
300 bases, the third of 60. The input layer consists of 80</p>
        <p>TABLE I.</p>
        <p>
          №
1
2
3
4
5
6
7
8
9
neurons (Ch = 80). Spectrogram of the speech is extracted from
the audio, PCA whitening is applied. Lower dimension
spectrogram is fed to the input layer. The data given to the
visible layer is selected by 20 ms windows with a 10 ms offset.
For each base in the hidden layers, the filter dimension NW = 6
and the convolution factor C = 3 was used. Parameters for the
first and second layers of the network were taken from [
          <xref ref-type="bibr" rid="ref13">15</xref>
          ].
Parameters for the third layer were selected by the authors
independently. Using presented parameters CDBN for audio
processing was trained.
        </p>
        <p>As a result of training CDBN, three trained layers of the
network were obtained, the outputs of each of which could be
used as a features for GMM-UBM speaker verification system.
To assess the verification system accuracy using obtained
features, the CDBN layers outputs were fed to the Gaussian
mixture model, which was used as a classifier. Also, for
features of each layer separate UBM was trained.</p>
        <p>
          To test and compare speaker verification system accuracy,
GMM-UBM speaker verification system with parameters and
features from [
          <xref ref-type="bibr" rid="ref15">17</xref>
          ] was used. This features includes 14
melfrequency cepstral coefficients (MFCC) and features obtained
using greedy Add-del algorithm including 13 MFCC, 10 delta
MFCC, 2 double delta MFCC, voicing probability, 1 linear
prediction coefficient (LPC) and 1 line spectral pair (LSP).
        </p>
        <p>Results of the experimental evaluation are given in Table I.
Based on the results, it can be concluded that none of the
feature sets extracted by the CDBN gives an accuracy of
speaker verification system more than a standard feature set
consisting of 14 MFCC. A feature set obtained by the greedy
add-del algorithm shows the best verification accuracy.</p>
        <p>In order to use information of different levels, combinations
of GMM classifiers using separate CDBN layers were used.
Combinations of different feature level classifiers did not
increase the verification accuracy, compared to the classifiers
using separate feature levels.</p>
        <p>TEST ACCURACY FOR SPEAKER VERIFICATION USING</p>
        <p>DIFFERENT FEATURES</p>
        <p>Feature Set Evaluation</p>
        <p>Feature Set
CDBN L1
CDBN L2
CDBN L3
CDBN L1 + CDBN L2
CDBN L1 + CDBN L3
CDBN L2 + CDBN L3
CDBN L1 + CDBN L2 + CDBN L3
MFCС
Greedy Add-del
% EER
2,00
3,50
10,00
2,00
2,00
3,29
2,00
1,00
0,58
minDCF*
100
0,997
1,740
5,765
1,197
1,121
1,926
1,327
0,925
0,623</p>
      </sec>
      <sec id="sec-4-3">
        <title>C. Discussion</title>
        <p>There could be different reasons for low accuracy of
speaker verification system using CDBN features. Deep
learning methods work better on a large dataset used for
training, so a small amount of training speech samples is one of
the possible reasons for the low accuracy of the verification
system. Another reason could be the GMM as a classifier, as it
could not provide an opportunity to show better results.</p>
        <p>Nevertheless, attention should be given to visual
representation of a speech signal and CDBN neurons
activations. Fig. 2 shows spectrogram of a same phrase for a
male and female speaker. There could be seen a significant
difference between male and female speaker saying the same
phrase on a third layer of CDBN (Fig. 3). This fact could be
used for gender recognition, using CDBN outputs as a features.</p>
        <p>V.</p>
        <p>CONCLUSION</p>
        <p>In this paper convolutional deep belief network was used to
generate new speech features for text-independent speaker
verification. New high-level speech features were extracted
using proposed method. Speaker verification system based on</p>
        <p>GMM-UBM speaker verification system was used to assess
speaker verification accuracy using extracted CDBN features.
None of the feature sets extracted by the CDBN gives an
accuracy of speaker verification system more than a standard
feature set consisting of 14 MFCC. Nevertheless, speaker
verification methods using presented features could be
combined with methods using different speech features to
obtain better verification accuracy.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.A.</given-names>
            <surname>Reynolds</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.F.</given-names>
            <surname>Quatieri</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.B.</given-names>
            <surname>Dunn</surname>
          </string-name>
          , “
          <article-title>Speaker verification using adapted Gaussian mixture models,”</article-title>
          <string-name>
            <given-names>Digit. Signal</given-names>
            <surname>Process</surname>
          </string-name>
          ., vol.
          <volume>10</volume>
          , no.
          <issue>1-3</issue>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.S.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.S.</given-names>
            <surname>Thosar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.H.</given-names>
            <surname>Nirmal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.S.</given-names>
            <surname>Pande</surname>
          </string-name>
          , “
          <article-title>A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network,” Adv</article-title>
          . in Pattern Recog.
          <source>(ICAPR)</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jain</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O.P.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , “
          <article-title>Evaluation of MFCC for speaker verification on various windows,” Recent Adv. and Innov</article-title>
          . in Engr.
          <source>(ICRAIE)</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>M.J. Alam</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Kinnunen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kenny</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Ouellet</surname>
            , and
            <given-names>D.</given-names>
          </string-name>
          <article-title>O'Shaughnessy, “Multitaper MFCC and PLP features for speaker verification using ivectors,” Speech Comm</article-title>
          .,
          <year>2013</year>
          , vol.
          <volume>55</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>237</fpage>
          -
          <lpage>251</lpage>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kawakami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kai</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Nakagawa</surname>
          </string-name>
          , “
          <article-title>Speaker Identification by Combining Various Vocal Tract and Vocal Source Features,”</article-title>
          <source>Int. Conf. on Text, Speech, and Dialogue</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>382</fpage>
          -389
          <string-name>
            <given-names>M.</given-names>
            <surname>McLaren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ferrer</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Lawson</surname>
          </string-name>
          , “
          <article-title>Exploring the role of phonetic bottleneck features for speaker and language recognition</article-title>
          ,
          <source>” Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>5575</fpage>
          -
          <lpage>5579</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Scheffer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ferrer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>McLaren</surname>
          </string-name>
          , “
          <article-title>A novel scheme for speaker recognition using a phonetically-aware deep neural network,” Acoustics, Speech and</article-title>
          <string-name>
            <surname>Sig. Process.</surname>
          </string-name>
          (ICASSP),
          <year>2014</year>
          , pp.
          <fpage>1695</fpage>
          -
          <lpage>1699</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Richardson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Reynolds</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Dehak</surname>
          </string-name>
          , “
          <article-title>Deep neural network approaches to speaker and language recognition,” IEEE Sig</article-title>
          . Process. Let.,
          <year>2015</year>
          , vol.
          <volume>22</volume>
          , no.
          <issue>10</issue>
          , pp.
          <fpage>1671</fpage>
          -
          <lpage>1675</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kenny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Stafylakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ouellet</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Alam</surname>
          </string-name>
          , “
          <article-title>Deep neural networks for extracting baum-welch statistics for speaker recognition</article-title>
          ,
          <source>” Proc. Odyssey</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>293</fpage>
          -
          <lpage>298</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Stafylakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kenny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Alam</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Kockmann</surname>
          </string-name>
          , “
          <article-title>Compensation for phonetic nuisance variability in speaker recognition using DNNs,” Odyssey: The Speaker and Lang</article-title>
          .
          <source>Recognition Workshop</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>340</fpage>
          -
          <lpage>345</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>O.</given-names>
            <surname>Kudashev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Novoselov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pekhovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonchik</surname>
          </string-name>
          , and G. Lavrentyeva, “
          <article-title>Usage of DNN in speaker recognition: advantages and problems</article-title>
          ,”
          <source>Int. Symp. on Neural Networks</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>82</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Snyder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garcia-Romero</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          , “
          <article-title>Time delay deep neural network-based universal background models for speaker recognition,” Aut. Speech Recognition and Understanding (ASRU</article-title>
          ),
          <year>2015</year>
          , pp.
          <fpage>92</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ghahabi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Hernando</surname>
          </string-name>
          , “
          <article-title>Deep belief networks for i-vector based speaker recognition,” Acoustics, Speech and</article-title>
          <string-name>
            <surname>Sig. Process.</surname>
          </string-name>
          (ICASSP),
          <year>2014</year>
          , pp.
          <fpage>1700</fpage>
          -
          <lpage>1704</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grosse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ranganath</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          , “
          <article-title>Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations</article-title>
          ,
          <source>” Proc. of the 26th Annual Int. Conf. on Machine Learning</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>609</fpage>
          -
          <lpage>616</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Largman</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          , “
          <article-title>Unsupervised feature learning for audio classification using convolutional deep belief networks</article-title>
          ,
          <source>” Adv, in Neural Info. Process. Systems</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>1096</fpage>
          -
          <lpage>1104</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G.E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Osindero</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.W.</given-names>
            <surname>The</surname>
          </string-name>
          , “
          <article-title>A fast learning algorithm for deep belief nets</article-title>
          ,
          <source>” Neural Computation</source>
          ,
          <year>2006</year>
          , vol.
          <volume>18</volume>
          , no.
          <issue>7</issue>
          , pp.
          <fpage>1527</fpage>
          -
          <lpage>1554</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>I.A.</given-names>
            <surname>Rakhmanenko</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.V.</given-names>
            <surname>Meshcheryakov</surname>
          </string-name>
          , “
          <article-title>Identification features analysis in speech data using GMM-UBM speaker verification system,”</article-title>
          <source>SPIIRAS Proc.,</source>
          <year>2017</year>
          , vol.
          <volume>52</volume>
          , pp.
          <fpage>32</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>