<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Improved Speech Synthesis using Generative Adversarial Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dineshraj Gunasekaran</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gautham V</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eoin Brophy</string-name>
          <email>eoin.brophy7g@mail.dcu.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Infant Research Centre, Cork University Maternity Hospital</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Insight SFI Research Centre for Data Analytics, Dublin City University</institution>
          ,
          <addr-line>Dublin</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Arti cially generated audio and speech have been a major area of machine learning research in recent years. Generative Adversarial Networks (GANs) have been at the forefront of the research progress in this domain. WaveGAN and SpecGAN are some of the current state of the art methods for speech synthesis, that utilise generative modelling to produce synthetic speech signals. Of particular interest, SpecGAN uses spectrograms of speech as training data for the GAN. With current available models like WaveGAN, the structure and linguistic information of the audio is not captured and leads to synthesized audio with high noise. In this paper, we propose a method we call Mel-spectrogram GAN (MSGAN) that instead uses the Mel-Spectrogram of the audio signal as an aid in approximating the human auditory system response more closely than narrow frequency bands. It uses a reconstructed spectrogram with a mel-scale. We demonstrate, using this approach a 23% higher inception score than WaveGAN. Further, we establish an improved version of Conditional MSGAN that quickly learns the data distribution of each of the classes to produce better quality speech and further increases the score by 32%. These results suggest that our Conditional MSGAN architecture is a promising approach for improved speech synthesis using GANs.</p>
      </abstract>
      <kwd-group>
        <kwd>speech synthesis</kwd>
        <kwd>GAN</kwd>
        <kwd>neural network</kwd>
        <kwd>deep learning</kwd>
        <kwd>mel-spectrogram</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The primary motive of this study is to improve audio synthesis by training
Generative Adversarial Networks (GANs) more e ectively and e ciently. Raw
time series speech signals like number narrations and word utterances are used
as input in this investigation, and this allows us to capture the phonetics and
intonations of the speech directly. Natural speech data that is more relevant in
real-world scenarios is used in this analysis to produce speech that has a natural
sounding quality [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In the genres of speech synthesis, TTS (text to speech) is
one of the key areas where many e ective algorithms have been established and
so now the focus is shifting towards voice conversion and style transfer [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], both
of which can be addressed through GAN approaches.
      </p>
      <p>
        One of the ways to obtain more natural sounding audio is by modelling the
probability distribution in a parametric and non-parametric way in the
training stage and relating the synthetic speech to these distributions. The existing
WaveGAN [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is used as a benchmark and improved results have been obtained
using our proposed Mel-Spec GAN (MSGAN) and Conditional Mel-Spec GAN
(CMSGAN) as is further detailed in this report.
      </p>
      <p>
        In the scenario involving audio les, the conversion of the time series to
spectrograms provides the input to these models. Mel-Spectrogram is a type of
spectrogram that is constructed by converting the frequencies into a Mel-scale. This
reduces the loss of information when translating an audio le into an image or
pictorial representation. Humans are better at detecting di erences between the
di erent sounds that are in lower frequencies rather than in the higher decibels.
This Mel-scale was constructed to represent the equal distances in pitch that
will sound equally distinct to the listeners [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Using this Mel-Spectrogram,
MelSpec GAN architecture is implemented in this study to perform speech synthesis
that has been seen as the e cient implementations in the domains of computer
vision. Also, an improved version of this proposed architecture is implemented
by using conditional labels that better the quality of synthetic speech.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Van den Oord et al., 2016 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposed a novel generative model capable of
successfully generating raw audio in the form of speech and music named WaveNet.
WaveNet's experiment on text to speech synthesis was able to synthesize
humanlike speech by mapping linguistics and contours. To deal with long-range
temporal dependencies for audio production, a dilated casual convolution model
was developed with an advantage of storing the input resolution throughout the
network. Saito et al., 2017 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] have performed Statistical Parametric Speech
Synthesis by using GAN. This method successfully alleviates the over smoothening
e ect, that is not being able to capture the nuances in the audio wave, to produce
the optimal results than other conventional models, like the Markov model and
the Gaussian mixture model [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Engel et al., 2019 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] have demonstrated that
GANs are better at e ciently producing audio with faster magnitude than their
autoregressive counterparts. Several key ndings were observed in this paper, for
instance, more coherent waveforms are produced when generating log-magnitude
spectrograms and the phases directly with the GANs. We have harnessed the
power of the GAN architecture in our research to synthesize the audio signals
and evaluated the same using methodology proposed by Engel et al., 2019 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Shen et al., 2017 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposed an entirely neural approach for speech
synthesis using recurrent sequence-to-sequence Tacotron model. Even though features
such as linguistics, phoneme and log fundamental frequency are primary
components of speech, this approach incorporates the Mel-spectrogram, which had
a considerable advantage in reducing the size of WaveNet architecture. Wang et
al., 2018 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposed an architecture, that uses a similar method proposed by
Shen et al., [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], with an extra layer called the "Style Token" layer for measuring
the similarities between the embeddings, created by reference encoder. Style
control includes token identi cation of each attribute, such as pitch, speaking rate
and emotion, and changing its magnitude to animate the speech. On inheriting
the above notion in our proposed research, the network was able to generate
speech in any human-like voice when we train it using Mel-Spectrograms.
      </p>
      <p>
        Jia et al., 2019 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposed a neural network based system for multi-speaker
speech synthesis and TTS methodology. The synthesizer generated not only
highquality speech from the learned speaker but also speakers never seen before. As
the diversity in the training set increases, the synthesizer was able to learn the
variation between each speaker and produce realistic audio. Donahue et al., 2018
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] applied WaveGAN to generate raw audio through unsupervised training.
WaveGAN is capable of synthesizing the second slice of audio waveforms with
global coherence, ideal for generation of sound e ects. The analysis nds that,
without labels, WaveGAN learns to produce intelligible words when trained on
a small-vocabulary speech dataset, and can also synthesize audio from other
domains such as drums, bird vocalizations, and piano. We took inspiration from
this literature and made a network much e cient than these models.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Generative Adversarial Networks</title>
      <p>
        GAN are deep learning models that comprise of two neural networks competing
against one another to generate realistic synthetic samples from its respective
data distribution Pdata(x) [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ]. Typical GAN architecture contains a
generator and a discriminator.
      </p>
      <p>
        The generator G(z) neural network accepts a random variable z (Gaussian
noise) from the prior distribution and maps it to the pseudo-data distribution
through the hidden layers to generate a complex distribution Pg(z). The ultimate
goal of the generator is that the generated distribution Pg(x) and the actual data
distribution Pdata(x) should be as similar as possible [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Therefore, the target
of the generator is to balance G , as shown in equation (1).
      </p>
      <p>G
= arg min Div (Pg; Pdata)</p>
      <p>G
(1)</p>
      <p>
        To calculate the di erence between the two distribution, the original
Generative Adversarial Network maneuver a binary classi er [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] model called
Discriminator D(z). During training, the output of the discriminator should be 1, if the
input is a real sample x. Otherwise, the output is 0. Goodfellow et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] used
binary cross-entropy function to de ne the discriminator, which is popularly used
for problems with binary classi cation. The sample to a discriminator can come
either from the actual distribution Pdata or from the model predicted
distribution Pg. Therefore, the complete object function for discriminator is obtained in
the following equation (2).
V(G; D) = Ex Pdata [log D(x)] + Ex Pg [log(1
D(x))]
(2)
      </p>
      <p>
        By combining the equations, (1) and (2), we get min-max optimization
function (3) of the Generative Adversarial Network. In this min-max game, the
generator tries to delude the discriminator. The generator attempts to maximize
the output of the discriminator when a fake sample is introduced. Instead, the
discriminator tries to minimize the loss by di erentiating between true and false
samples. Speci cally, the discriminator, maximizes V (G, D) while the generator
aims to minimize V (G, D), thus establishing the min-max relationship [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
min max V (G; D) = min max Ex Pdata [log D(x)] + Ez Pz [log(1
      </p>
      <p>G D G D
D(G(z)))]
(3)</p>
      <p>When the generator is training, the discriminator's parameters are xed.
The predicted data from the generator is mapped as a fake sample and given
as an input to the discriminator. The error is determined by the output of
the D(G(z)) discriminator that classi es between the positive sample x from
the real data set and the negative sample generated from the generator G(z).
Finally, the calculated error is used to modify the generator parameters using
backpropagation.
4
4.1</p>
    </sec>
    <sec id="sec-4">
      <title>Methodology</title>
      <sec id="sec-4-1">
        <title>Bene ts of Short-time Fourier Transform</title>
        <p>
          A signal is a variation in a certain quantity over time. For audio, the quantity
that varies is air pressure [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. Typically, the amplitude is measured as a
function of the pressure shift around the microphone or receiver unit that initially
picked up the audio. The amplitude is a function of its transition over a
duration (usually time). A waveform is a visual representation of an audio signal,
typically depicted as a representation of the time series, where the value of the
y-axis is the amplitude of the waveform. However, it is barely a two-dimensional
representation of this dynamic and vibrant audio signal. The waveform itself
does not deliver proper class information i.e it is di cult to distinguish between
the digits. Therefore, there are GAN models that use a waveform (both as an
image and 1-D sequence) to generate data.
        </p>
        <p>
          We are only observing the resulting amplitudes of the measurements of signal
taken over time. Multiple single-frequency sound waves make up an audio signal.
Therefore, we used Fourier Transform(FT), which is another mathematical
representation of the signal processing to extract useful information. The FT is a
numerical method, that decomposes a signal into its frequencies and magnitude.
The original audio signal is broken down into a series of sine and cosine waves
adding up to the original signal [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. In other words, it converts the time domain
signal into a frequency domain signal. The resulting frequency-domain signal is
known as a spectrum.
        </p>
        <p>
          Audio signals such as music and speech, are referred to as non-periodic
signals, because their frequency varies over time. To represent these signals as
a spectrum, several small Fourier transforms are calculated on multiple
windowed fragments of the signal. This is called as the short-time Fourier Transform
(STFT). STFT provides the time-localized frequency information for situations
in which frequency components of a signal vary over time [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. In contrast, the
standard Fourier transform provides the frequency information, averaged over
the entire signal time interval [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Reasons for Using Mel-Spectrogram</title>
        <p>
          A spectrogram is a visual illustration of a signal's frequency spectrum that
changes over time. The STFT is applied over each fragment of the audio signal
to obtain a power spectra of the signal. The power spectrum of a time series
explains how power is distributed in frequency components (energy per unit time)
that compose the signal [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. The power spectrum is stacked on top of each other
to become spectrogram of an audio signal.
        </p>
        <p>
          However, humans can only perceive a small and concentrated range of
frequencies and amplitudes. The calculated spectrogram will not discern between
human-perceivable frequencies. Therefore the y-axis (frequency f) is converted
to a log scale and the color dimension to decibels [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] (log scale of amplitude).
This technique is called Mel-Scale, which approximates the human auditory
system's response more closely than narrow frequency bands. Stevens, Volkmann
and Newman proposed Mel-Scale as a unit of pitch such that equal distances
in pitch sounded equally distant to the listener [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. Therefore a spectrogram
where the frequencies are converted to the Mel-scale in Y-axis is called a
MelSpectrogram.
        </p>
        <p>
          In our approach, the audio wave les of speech command digit dataset are
dynamically represented as images of Mel-Spectrograms and synthesized in GAN
models. Finally, the original audio is then retrieved from an STFT sequence,
by taking an inverse transform of each frame, which is overlapped and added
iteratively. This algorithm of reconstruction of an audio signal from spectrogram
by solving phase recovery is known as Gri n Lim [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. The restoration of the
audio signal will regain its phase with the increasing number of iterations. In
our application, we have used 60 repetitions to bring audio back from
MelSpectrogram.
4.3
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Mel-Spec GAN (MSGAN)</title>
        <p>
          In the eld of Computer Vision, Generative Adversarial Networks have seen
tremendous success in the past years. We can now produce incredibly realistic
images which are indistinguishable from the actual ones, showing how far GAN
technology has indeed progressed. Since audio signals can be represented as
Mel-Spectrogram, it can be incorporated into GANs to synthesize new
MelSpectrogram that in turn can be converted back to audio signals.
The core of the proposed Mel-Spec GAN model takes inspiration from Deep
Convolutional GAN (DCGAN) [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. Unlike a grayscale image, the audio signals
are represented in the form of 3-Dimensional vector frequency bins with
singlechannel input (128,128,1) and a steady sample rate of 22050. Subsequently, each
frequency bin is normalized with mean and standard deviation and rescaled to
[
          <xref ref-type="bibr" rid="ref1">-1, 1</xref>
          ].
        </p>
        <p>
          We incorporated the convolutional layers while designing the models of
generator and discriminator. The discriminator model takes as input one 128×128x1
Mel-spectrogram and outputs a binary prediction of whether the audio is real
or fake. The hidden layers are customized with downsampling blocks
consisting of 2×2 stride 2-Dimensional convolutional layers equipped with a 5x5 lter.
Inspired from the architecture of WaveGAN [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], the generator model is
refashioned by installing 2D transposed convolutional layers and achieved upsampling
without using max-pooling and nearest neighbours.
        </p>
        <p>
          The Recti ed Linear Unit(ReLU) activation function f (x) = x+ = max(0; x)
is implemented in the generator network [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], and the weights are initialized with
a slightly positive initial bias using a truncated normal distribution with zero
mean and 0.02 standard deviation to avoid dead neurons [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. The LeakyReLU
activation function f (x) = max(0; x) + a min(0; x) with a slope(a) of 0.2 is
fashioned in every layer of discriminator except the dense output layer [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. We
regulated the vectors and accelerated the learning process with batch
normalization in all layers except the input and output layer. Ultimately, an Adam
version of stochastic gradient descent optimizer with a learning rate of 0.0002
and a momentum of 0.5 is adapted in both the models.
        </p>
        <p>We used sigmoid activation function f (x) = 1= (1 + e x) with binary
crossentropy loss for the output layer of the discriminator and a tanh activation
function f (x) = (ex e x) = (ex + e x) for the generator to generate a 128x128x1
Mel-spectrogram in the range of -1 to 1.</p>
        <p>The Mel-Spec GAN is trained with a pre-processed input audio wave les
and used the generator to synthesize new plausible spectrogram, which is
reconstructed to audio waveform, using Gri n-Lim algorithm.
4.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Conditional-Mel-Spec GAN (CMSGAN)</title>
        <p>Although Mel-SpecGAN model can fabricate new random plausible audio from
a given domain, there is no other way to monitor the types of audios that are
generated than trying to nd out the strong correlation between the generator's
latent space input and the audio produced.</p>
        <p>
          Therefore, we took the Digit labels (0-9) information into consideration to
forge Conditional-Mel-SpecGAN (CMSGAN). These labels are transmuted into
one-hot encoded vectors y of dimension (data size x 10). By feeding the class label
vector y into the generator and the discriminator, the original Mel-SpecGAN is
expanded to a conditional model [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]. In this way, the improved Mel-SpecGAN
can quickly learn the data distribution of each class independently and generate
the samples in accordance with the given condition label y [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ]. The loss function
of modi ed GAN is depicted in formula (4).
min max V (D; G) = Ex pdata(x)[log D(x j y)] + Ez pz(z)[log(1
        </p>
        <p>G D
D(G(z j y)))]
(4)</p>
        <p>
          To encode the class labels into the discriminator model [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ], we embedded
input class layer to a fully connected dense layer (128x128) with a linear activation
that scales the embedding to the size of the spectrogram (128x128). Furthermore,
the embedded input is reshaped into single activation map (128x128x1) before
concatenating it with the model as an additional feature map.
        </p>
        <p>In the generator model, we concatenated the latent vector and the input
class vector before embedding to the fully connected layer with the activation
of 16384 vectors to match the activations of the unconditional generator model.
The new ten feature-map is appended as one more channel to existing (Nx100),
resulting in (Nx110) feature-maps and up-sampled as in the previous model.</p>
        <p>We then trained the GAN model with latent vector and class label as an
input, and spawn a prediction of whether the input was genuine or counterfeit. We
optimized the GAN by smoothening the real and fake tags of the discriminator
with the intention that the loss does not converge to 0. Both GAN models are
evaluated by using the Inception score as a stopping criterion.
5
5.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental Protocol</title>
      <sec id="sec-5-1">
        <title>SC09 Dataset and The E ects of Background Noise</title>
        <p>
          Our research focuses on the Dataset of Spoken Commands [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ]. Google brain
created this dataset through several speakers recording single words under
unregulated recording environments. We analyzed a subset of the spoken command
"zero" through "nine" and referred to this subset as the Speech Commands
Digits dataset (SC09) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Each recording is one second in duration with di erent
alignment in time. Although this dataset is deliberately similar to the famous
MNIST written digit dataset, we note that SC09 examples (128x128) are far
higher in dimensions than MNIST examples (28x28). These ten words contain
several phonemes, and two are multiple syllables. The training set includes 1850
utterances of every digit, resulting in 5.3 hours of speech [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>
          We calculated the amount of background noise in the underlying dataset,
using a speech quality measure called Signal to Noise Ratio (SNR) [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ]. However,
to compare the waveform directly in the time domain, the synchronization of
the original and distorted signal was necessary. Since we converted time domain
signal to the spectral domain, we can compute SNR using speech parts, typically
between 20 and 30 MS long [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ]. This method is known as Segmented Signal to
Noise Ratio (SegSNR) [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ]. This approach is more reliable than their predecessor
and less sensitive to signal alignments. Certain digits like `Nine', 'Six' and `Five'
have negative SegSNR and are highly prone to errors in synthesis. Therefore,
the ambiguity of alignments, speakers, and recording environments makes this a
challenging modeling dataset [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
5.2
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Pre-Processing and Experimental Setup</title>
        <p>
          We parallelized the pre-processing of the audio signals, while converting them
into Mel-Spectrogram images. The functions from Librosa library were utilized
to load the WAV formatted audio les. In addition, the audio utility package
was constructed using core Librosa functions to compute Mel-Spectrogram and
its Inverse. To calculate STFT, we sampled the audio signals with windows of
size n t=2048, hops of size hop length=512 to transform from the time
domain to the frequency domain. We then took the entire frequency spectrum,
and separated it into n mels=128 evenly spaced frequencies. For each window,
we decomposed the magnitude of the signal into frequencies and scaled the
corresponding frequencies into a log scale. The uniform dimension (128x128) was
maintained for the spectrogram by attributing the di erence in value with -80
decibels. Finally, we normalized the Mel-frequencies with mean and standard
deviation and scaled to the range of [
          <xref ref-type="bibr" rid="ref1">-1,1</xref>
          ]. The Mel-Spectrogram and its
corresponding mean and standard deviation are saved as NPZ le so that the model
can further synthesize it.
        </p>
        <p>We trained our networks, using batches of size 32 on an NVIDIA TESLA
P100 GPU in Google Colab Pro. During our quantitative assessment of SC09,
our Mel-SpecGAN networks converged by their evaluation criteria (Inception
score) within 5 hours of training (around 80K epochs) and produced
speechlike audio after 30k epochs. Our Conditional Mel-SpecGAN networks converged
more quickly, within 3 hours (about 50k epochs) and produced better results
with much higher Inception score.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Evaluation Methodology</title>
      <sec id="sec-6-1">
        <title>6.1 Inception Score</title>
        <p>
          Tim Salimans et al [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ]. proposed the Inception Score, which is an empirical
metric for measuring the quality, and the semantic discriminability of the image
generated by the GAN models [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The performance of the Generative
Adversarial Networks is monitored by a pre-trained deep learning image classi cation
model, which is incorporated to classify the generated images and uses the
conditional probability as a base to calculate the Inception Score [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ]. The Inception
Score has a minimum value of 1.0 and a maximum value of the number of classes
provided by the classi cation model.
        </p>
        <p>
          To measure the Inception Score, we trained an audio classi er model on
MelSpectrogram features of Speech Command Digit Dataset. The pre-processed and
normalized Mel-Frequencies were used as an input to the network and the
onehot encoded labels, as the output class vectors. We built our classi er network
with four layers of 2D convolutional and pooling, followed by two layers of dense
activations, projecting the result to a softmax layer with ten classes [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The
network was compiled with categorical cross-entropy along with Ada-Delta
optimizer. We ran upto 50 epochs with early stopping on the minimum negative
log-likelihood of the testing dataset and achieved the accuracy of 99.82% and
saved the model for evaluating the GAN. To calculate the Inception Score, we
rst used our pre-trained deep learning Mel-Spectrogram classi er model to
estimate the conditional probability of generated audio spectrograms (p(yjx)). After
that, the marginal probability was calculated as the average of the conditional
probabilities for the spectrograms in the group (p(y)).
        </p>
        <p>KL divergence = p(y j x)?(log(p(y j x))
log(p(y)))
(5)</p>
        <p>
          We combined these metrics and calculated the Kullback{Leibler divergence
(KL divergence) for each spectrogram as the product of conditional probability
with the log of the same minus the log of the marginal likelihood [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ] as
exhibited in the formula(5). Finally, the nal Inception score is computed by
taking the exponent of the summation of the KL divergence and averaged over
all classes.
6.2
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>Quality Human Evaluation</title>
        <p>
          To support the algorithmic evaluation of the model, we incorporated another
evaluation metric called Mean Opinion Score (MOS). International
Telecommunication Union (ITU) de nes the Mean Opinion Score (MOS) as a numerical
ranking of the human-judged overall performance of the system quality (voice
or video) [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ]. MOS is calculated on the scale of 1 (lowest perceived quality) to
5 (highest perceived quality), which is the arithmetic mean of individual values
of human-scored parameters.
        </p>
        <p>We measured the ability of human annotators by creating a survey on
Amazon Mechanical Turk to rank the generated speech. We identi ed our best
MelSpec GAN (MSGAN) and Conditional Mel-Spec GAN (CMSGAN) models by
their core evaluation metrics and produced random samples. We created 20
batches of each digit (labelled by Classi er Model) for CMSGAN model and
200 random digits samples for MSGAN model. We customized a form-based
layout using Crowd-HTML to develop the UI for the survey and linked with
MTurk services for the human-evaluation. We asked the 100 annotators to
assign subjective values of 1-5 for sound quality and reported the score in Table
1.
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Results and Discussion</title>
      <p>
        We implemented the MSGAN and CMSGAN models with Inception Score as
evaluation and stopping criteria. The Mel-Spec GAN achieved the Inception
Score of 5.76 while improving the same, GAN with label conditions gave a
substantial score of 7.64. We compared the scores with other similar implementations
of Adversarial Audio Synthesis Networks, like WaveGAN and SpecGAN [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], as
shown in the Table 1. To endorse the assessment, we validated the models using
a crowd-sourced experiment. While the MSGAN averaged a MOS of 3.01,
CMSGAN achieved a higher MOS of 3.74. By leveraging the Mel-Spectrograms and
label embedding, the speech quality of the proposed model improved by 63.6%.
In this paper, we synthesized the speech signals using various GAN models and
introduced a novel methodology for generating audio waveform. We used
frequency domain representation of the audio signal and converted wave les to
Mel-Spectrogram image les, using the short-time Fourier transform. By
incorporating ideas from image synthesizing DCGAN, we customized the Mel-Spec
GAN architecture by leveraging the Mel-Spectrogram of the audio signal. The
generated Mel-Spectrograms are converted back to the actual waveform, using
fast Gri n Lim algorithm by solving conversion loss. We then improved the
proposed architecture by embedding label as an input, so that the audio tags
condition the output. Not only did the Inception Score of the model improve
from 5.7 to 7.6, but also, the model converged quickly to achieve extraordinary
results.
      </p>
      <p>In its current form, Mel-Spec GANs can be used for real-time speech
synthesis. In our future work, we plan to extend the potential of GANs to operate
on variable and longer length audios and multiple accents. By providing a
template for speech synthesis and Mel-spectrogram generation models to serve on
speech signals, we hope that this research will catalyze future audio-synthesis
experiments of GANs.
9</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>This work is supported in part by by Science Foundation Ireland (Grant Nos.
SFI/12/RC/2289 P2 and 17/RC-PhD/3482). We gratefully acknowledge the
support of NVIDIA Corporation with the donation of the Titan Xp used for this
research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Saito</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Takamichi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saruwatari</surname>
          </string-name>
          , H.:
          <article-title>Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks</article-title>
          .
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          .
          <volume>26</volume>
          ,
          <fpage>84</fpage>
          -
          <lpage>96</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Pasini</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>MelGAN-VC: Voice Conversion and Audio Style Transfer on arbitrarily long samples using Spectrograms</article-title>
          , https://arxiv.org/abs/
          <year>1910</year>
          .03713.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Donahue</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McAuley</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Puckette</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Synthesizing Audio with GANs</article-title>
          . ICLR, https://openreview.net/forum?id=r1RwYIJPM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Biswas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solanki</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Speaker recognition: an enhanced approach to identify singer voice using neural network</article-title>
          .
          <source>International Journal of Speech Technology</source>
          . (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Oord</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dieleman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalchbrenner</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senior</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>WaveNet: A Generative Model for Raw Audio</article-title>
          , https://arxiv.org/abs/1609.03499.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Saito</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Takamichi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saruwatari</surname>
          </string-name>
          , H.:
          <article-title>Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks</article-title>
          .
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          .
          <volume>26</volume>
          ,
          <fpage>84</fpage>
          -
          <lpage>96</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Awad</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khanna</surname>
          </string-name>
          , R.: Hidden Markov Model. E cient Learning Machines.
          <volume>81</volume>
          -
          <fpage>104</fpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Engel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gulrajani</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Donahue</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          : GANSynth: Adversarial Neural Audio Synthesis, https://arxiv.org/abs/
          <year>1902</year>
          .08710.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.J.</given-names>
          </string-name>
          <string-name>
            <surname>Schuster</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Jaitly</surname>
            ,
            <given-names>N .</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Skerrv-Ryan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Saurous</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          <string-name>
            <surname>Agiomvrgiannakis</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions</article-title>
          .
          <source>IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , pp
          <fpage>4779</fpage>
          -
          <lpage>4783</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stanton</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Skerry-Ryan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Battenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Shor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Saurous</surname>
          </string-name>
          , R.: Style Tokens:
          <article-title>Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis</article-title>
          , https://arxiv.org/abs/
          <year>1803</year>
          .09017.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Transfer Learning from Speaker Veri cation to Multispeaker Text-To-Speech Synthesis</article-title>
          , https://arxiv.org/abs/
          <year>1806</year>
          .04558.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Donahue</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McAuley</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Puckette</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Adversarial Audio Synthesis, https://arxiv.org/abs/
          <year>1802</year>
          .04208.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Creswell</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>White</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumoulin</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arulkumaran</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sengupta</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bharath</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Generative Adversarial Networks: An Overview</article-title>
          .
          <source>IEEE Signal Processing Magazine</source>
          .
          <volume>35</volume>
          ,
          <fpage>53</fpage>
          -
          <lpage>65</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pouget-Abadie</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mirza</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Warde-Farley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ozair</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Generative Adversarial Networks</article-title>
          , https://arxiv.org/abs/1406.2661.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oerlemans</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lao</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lew</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Deep learning for visual understanding: A review</article-title>
          .
          <source>Neurocomputing</source>
          .
          <volume>187</volume>
          ,
          <fpage>27</fpage>
          -
          <lpage>48</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>L.: The</given-names>
          </string-name>
          <string-name>
            <surname>Cross-Entropy Method</surname>
          </string-name>
          : A Uni ed Approach to Combinatorial Optimization,
          <string-name>
            <surname>Monte-Carlo Simulation</surname>
          </string-name>
          , and
          <article-title>Machine Learning</article-title>
          .
          <source>Technometrics</source>
          .
          <volume>48</volume>
          ,
          <fpage>147</fpage>
          -
          <lpage>148</lpage>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Lan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>You</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zeng</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Generative Adversarial Networks and Its Applications in Biomedical Informatics</article-title>
          .
          <source>Frontiers in Public Health. 8</source>
          , (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Allen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Chapter 1. Sounds and Signals</article-title>
          . In Think DSP:
          <article-title>Digital Signal Processing in Python, First edition</article-title>
          ., pp
          <volume>1</volume>
          {
          <fpage>11</fpage>
          , Sebastopol, CA:
          <string-name>
            <given-names>O</given-names>
            <surname>'Reilly Media</surname>
          </string-name>
          ,
          <string-name>
            <surname>Inc</surname>
          </string-name>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Kehtarnavaz</surname>
          </string-name>
          ,N.:
          <source>Digital Signal Processing System Design. Amsterdam</source>
          . Academic Press, (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20. van den Bogaert, B.:
          <article-title>When Frequencies Change in Time; Towards the Wavelet Transform</article-title>
          .
          <source>Data Handling in Science and Technology</source>
          .
          <volume>33</volume>
          -
          <fpage>55</fpage>
          (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Moorer</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A note on the implementation of audio processing by short-term fourier transform</article-title>
          .
          <source>2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)</source>
          .
          <article-title>(</article-title>
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Vetterling</surname>
            ,
            <given-names>W. T.</given-names>
          </string-name>
          , &amp; Press, W. H.:
          <article-title>Numerical recipes in Fortran: the art of scienti c computing</article-title>
          (Vol.
          <volume>1</volume>
          ). Cambridge University Press (
          <year>1992</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Understanding the Mel Spectrogram</article-title>
          , https://medium.com
          <article-title>/analyticsvidhya/understanding-the-mel-spectrogram-fca2afa2ce53.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Stevens</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Volkmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Newman</surname>
          </string-name>
          , E.:
          <article-title>A Scale for the Measurement of the Psychological Magnitude Pitch</article-title>
          .
          <source>The Journal of the Acoustical Society of America. 8</source>
          ,
          <fpage>185</fpage>
          -
          <lpage>190</lpage>
          (
          <year>1937</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Gri n</surname>
          </string-name>
          , D., Jae Lim:
          <article-title>Signal estimation from modi ed short-time Fourier transform</article-title>
          .
          <source>IEEE Transactions on Acoustics, Speech, and Signal Processing</source>
          .
          <volume>32</volume>
          ,
          <fpage>236</fpage>
          -
          <lpage>243</lpage>
          (
          <year>1984</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Metz</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chintala</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks</article-title>
          , https://arxiv.org/abs/1511.06434.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Deep Learning</article-title>
          .Cambridge, MA: MIT Press, pp.
          <fpage>175</fpage>
          -
          <lpage>250</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , L.:
          <article-title>Can We Generate Good Samples for Hyperspectral Classi cation? | A Generative Adversarial Network Based Method</article-title>
          .
          <source>IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium</source>
          . (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Ramachandran</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zoph</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>Searching for Activation Functions</article-title>
          . IEEE. (
          <year>2017</year>
          ), https://arxiv.org/abs/1710.05941.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Mirza</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Osindero</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Conditional Generative Adversarial Nets, https://arxiv.org/abs/1411.1784.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , L.:
          <article-title>Can We Generate Good Samples for Hyperspectral Classi cation? | A Generative Adversarial Network Based Method</article-title>
          .
          <source>IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium</source>
          . (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Denton</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chintala</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szlam</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fergus</surname>
          </string-name>
          , R.:
          <article-title>Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks</article-title>
          , https://arxiv.org/abs/1506.05751.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33. 3.
          <string-name>
            <surname>Warden</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition</article-title>
          , https://arxiv.org/abs/
          <year>1804</year>
          .03209.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Prodeus</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Didkovskyi</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Didkovska</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kotvytskyi</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motorniuk</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khrapachevskyi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Objective and Subjective Assessment of the Quality and Intelligibility of Noised Speech</article-title>
          . 2018
          <string-name>
            <surname>International Scienti</surname>
          </string-name>
          c-Practical Conference Problems of Infocommunications. Science and
          <string-name>
            <surname>Technology (PIC S&amp;T)</surname>
          </string-name>
          .
          <article-title>(</article-title>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <surname>Mohamed</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Objective Speech Quality Measures, http://www.irisa.fr/armor/lesmembres/Mohamed/Thesis/node94.html.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36.
          <string-name>
            <surname>Salimans</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaremba</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cheung</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Improved Techniques for Training GANs</article-title>
          , https://arxiv.org/abs/1606.03498.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          37.
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanhoucke</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , Io e, S.,
          <string-name>
            <surname>Shlens</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wojna</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Rethinking the Inception Architecture for Computer Vision</article-title>
          .
          <source>2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          .
          <article-title>(</article-title>
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          38.
          <string-name>
            <surname>Kullback</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leibler</surname>
          </string-name>
          , R.:
          <article-title>On Information and Su ciency</article-title>
          .
          <source>The Annals of Mathematical Statistics</source>
          .
          <volume>22</volume>
          ,
          <fpage>79</fpage>
          -
          <lpage>86</lpage>
          (
          <year>1951</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          39.
          <string-name>
            <surname>\ITU-T Rec</surname>
          </string-name>
          . P.10/G.
          <volume>100</volume>
          (
          <issue>11</issue>
          /
          <year>2017</year>
          )
          <article-title>Vocabulary for performance, qualityof service and quality of experience</article-title>
          .
          <source>"</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>