Improved Speech Synthesis using Generative Adversarial Networks Dineshraj Gunasekaran1[0000−0003−3479−960X] , Gautham 1[0000−0002−3553−1608] Venkatraj , Eoin Brophy2[0000−0002−6486−5746] , and Tomas Ward1[0000−0002−6173−6607] 1 Insight SFI Research Centre for Data Analytics, Dublin City University, Dublin 2 Infant Research Centre, Cork University Maternity Hospital, Ireland {dineshraj.gunasekaran2,gautham.venkatraj2,eoin.brophy7}@mail.dcu.ie tomas.ward@dcu.ie Abstract. Artificially generated audio and speech have been a major area of machine learning research in recent years. Generative Adversarial Networks (GANs) have been at the forefront of the research progress in this domain. WaveGAN and SpecGAN are some of the current state of the art methods for speech synthesis, that utilise generative modelling to produce synthetic speech signals. Of particular interest, SpecGAN uses spectrograms of speech as training data for the GAN. With current avail- able models like WaveGAN, the structure and linguistic information of the audio is not captured and leads to synthesized audio with high noise. In this paper, we propose a method we call Mel-spectrogram GAN (MS- GAN) that instead uses the Mel-Spectrogram of the audio signal as an aid in approximating the human auditory system response more closely than narrow frequency bands. It uses a reconstructed spectrogram with a mel-scale. We demonstrate, using this approach a 23% higher incep- tion score than WaveGAN. Further, we establish an improved version of Conditional MSGAN that quickly learns the data distribution of each of the classes to produce better quality speech and further increases the score by 32%. These results suggest that our Conditional MSGAN ar- chitecture is a promising approach for improved speech synthesis using GANs. Keywords: speech synthesis · GAN · neural network · deep learning · mel-spectrogram 1 Introduction The primary motive of this study is to improve audio synthesis by training Generative Adversarial Networks (GANs) more effectively and efficiently. Raw time series speech signals like number narrations and word utterances are used as input in this investigation, and this allows us to capture the phonetics and intonations of the speech directly. Natural speech data that is more relevant in real-world scenarios is used in this analysis to produce speech that has a natural sounding quality [1]. In the genres of speech synthesis, TTS (text to speech) is Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 Dineshraj Gunasekaran, Gautham Venkatraj, Eoin Brophy, and Tomas Ward one of the key areas where many effective algorithms have been established and so now the focus is shifting towards voice conversion and style transfer [2], both of which can be addressed through GAN approaches. One of the ways to obtain more natural sounding audio is by modelling the probability distribution in a parametric and non-parametric way in the train- ing stage and relating the synthetic speech to these distributions. The existing WaveGAN [3] is used as a benchmark and improved results have been obtained using our proposed Mel-Spec GAN (MSGAN) and Conditional Mel-Spec GAN (CMSGAN) as is further detailed in this report. In the scenario involving audio files, the conversion of the time series to spec- trograms provides the input to these models. Mel-Spectrogram is a type of spec- trogram that is constructed by converting the frequencies into a Mel-scale. This reduces the loss of information when translating an audio file into an image or pictorial representation. Humans are better at detecting differences between the different sounds that are in lower frequencies rather than in the higher decibels. This Mel-scale was constructed to represent the equal distances in pitch that will sound equally distinct to the listeners [4]. Using this Mel-Spectrogram, Mel- Spec GAN architecture is implemented in this study to perform speech synthesis that has been seen as the efficient implementations in the domains of computer vision. Also, an improved version of this proposed architecture is implemented by using conditional labels that better the quality of synthetic speech. 2 Related Work Van den Oord et al., 2016 [5] proposed a novel generative model capable of suc- cessfully generating raw audio in the form of speech and music named WaveNet. WaveNet’s experiment on text to speech synthesis was able to synthesize human- like speech by mapping linguistics and contours. To deal with long-range tem- poral dependencies for audio production, a dilated casual convolution model was developed with an advantage of storing the input resolution throughout the network. Saito et al., 2017 [6] have performed Statistical Parametric Speech Syn- thesis by using GAN. This method successfully alleviates the over smoothening effect, that is not being able to capture the nuances in the audio wave, to produce the optimal results than other conventional models, like the Markov model and the Gaussian mixture model [7]. Engel et al., 2019 [8] have demonstrated that GANs are better at efficiently producing audio with faster magnitude than their autoregressive counterparts. Several key findings were observed in this paper, for instance, more coherent waveforms are produced when generating log-magnitude spectrograms and the phases directly with the GANs. We have harnessed the power of the GAN architecture in our research to synthesize the audio signals and evaluated the same using methodology proposed by Engel et al., 2019 [8]. Shen et al., 2017 [9] proposed an entirely neural approach for speech synthe- sis using recurrent sequence-to-sequence Tacotron model. Even though features such as linguistics, phoneme and log fundamental frequency are primary com- ponents of speech, this approach incorporates the Mel-spectrogram, which had Improved Speech Synthesis using GAN 3 a considerable advantage in reducing the size of WaveNet architecture. Wang et al., 2018 [10] proposed an architecture, that uses a similar method proposed by Shen et al., [9], with an extra layer called the ”Style Token” layer for measuring the similarities between the embeddings, created by reference encoder. Style con- trol includes token identification of each attribute, such as pitch, speaking rate and emotion, and changing its magnitude to animate the speech. On inheriting the above notion in our proposed research, the network was able to generate speech in any human-like voice when we train it using Mel-Spectrograms. Jia et al., 2019 [11] proposed a neural network based system for multi-speaker speech synthesis and TTS methodology. The synthesizer generated not only high- quality speech from the learned speaker but also speakers never seen before. As the diversity in the training set increases, the synthesizer was able to learn the variation between each speaker and produce realistic audio. Donahue et al., 2018 [12] applied WaveGAN to generate raw audio through unsupervised training. WaveGAN is capable of synthesizing the second slice of audio waveforms with global coherence, ideal for generation of sound effects. The analysis finds that, without labels, WaveGAN learns to produce intelligible words when trained on a small-vocabulary speech dataset, and can also synthesize audio from other domains such as drums, bird vocalizations, and piano. We took inspiration from this literature and made a network much efficient than these models. 3 Generative Adversarial Networks GAN are deep learning models that comprise of two neural networks competing against one another to generate realistic synthetic samples from its respective data distribution Pdata (x) [13, 14]. Typical GAN architecture contains a genera- tor and a discriminator. The generator G(z) neural network accepts a random variable z (Gaussian noise) from the prior distribution and maps it to the pseudo-data distribution through the hidden layers to generate a complex distribution Pg (z). The ultimate goal of the generator is that the generated distribution Pg (x) and the actual data distribution Pdata (x) should be as similar as possible [15]. Therefore, the target of the generator is to balance G∗, as shown in equation (1). G∗ = arg min Div (Pg , Pdata ) (1) G To calculate the difference between the two distribution, the original Genera- tive Adversarial Network maneuver a binary classifier [16] model called Discrim- inator D(z). During training, the output of the discriminator should be 1, if the input is a real sample x. Otherwise, the output is 0. Goodfellow et al. [14] used binary cross-entropy function to define the discriminator, which is popularly used for problems with binary classification. The sample to a discriminator can come either from the actual distribution Pdata or from the model predicted distribu- tion Pg . Therefore, the complete object function for discriminator is obtained in the following equation (2). 4 Dineshraj Gunasekaran, Gautham Venkatraj, Eoin Brophy, and Tomas Ward V(G, D) = Ex Pdata [log D(x)] + Ex Pg [log(1 − D(x))] (2) By combining the equations, (1) and (2), we get min-max optimization func- tion (3) of the Generative Adversarial Network. In this min-max game, the gen- erator tries to delude the discriminator. The generator attempts to maximize the output of the discriminator when a fake sample is introduced. Instead, the discriminator tries to minimize the loss by differentiating between true and false samples. Specifically, the discriminator, maximizes V (G, D) while the generator aims to minimize V (G, D), thus establishing the min-max relationship [17]. min max V (G, D) = min max Ex Pdata [log D(x)] + Ez Pz [log(1 − D(G(z)))] G D G D (3) When the generator is training, the discriminator’s parameters are fixed. The predicted data from the generator is mapped as a fake sample and given as an input to the discriminator. The error is determined by the output of the D(G(z)) discriminator that classifies between the positive sample x from the real data set and the negative sample generated from the generator G(z). Finally, the calculated error is used to modify the generator parameters using backpropagation. 4 Methodology 4.1 Benefits of Short-time Fourier Transform A signal is a variation in a certain quantity over time. For audio, the quantity that varies is air pressure [18]. Typically, the amplitude is measured as a func- tion of the pressure shift around the microphone or receiver unit that initially picked up the audio. The amplitude is a function of its transition over a dura- tion (usually time). A waveform is a visual representation of an audio signal, typically depicted as a representation of the time series, where the value of the y-axis is the amplitude of the waveform. However, it is barely a two-dimensional representation of this dynamic and vibrant audio signal. The waveform itself does not deliver proper class information i.e it is difficult to distinguish between the digits. Therefore, there are GAN models that use a waveform (both as an image and 1-D sequence) to generate data. We are only observing the resulting amplitudes of the measurements of signal taken over time. Multiple single-frequency sound waves make up an audio signal. Therefore, we used Fourier Transform(FT), which is another mathematical rep- resentation of the signal processing to extract useful information. The FT is a numerical method, that decomposes a signal into its frequencies and magnitude. The original audio signal is broken down into a series of sine and cosine waves adding up to the original signal [20]. In other words, it converts the time domain signal into a frequency domain signal. The resulting frequency-domain signal is known as a spectrum. Improved Speech Synthesis using GAN 5 Audio signals such as music and speech, are referred to as non-periodic sig- nals, because their frequency varies over time. To represent these signals as a spectrum, several small Fourier transforms are calculated on multiple win- dowed fragments of the signal. This is called as the short-time Fourier Transform (STFT). STFT provides the time-localized frequency information for situations in which frequency components of a signal vary over time [19]. In contrast, the standard Fourier transform provides the frequency information, averaged over the entire signal time interval [21]. 4.2 Reasons for Using Mel-Spectrogram A spectrogram is a visual illustration of a signal’s frequency spectrum that changes over time. The STFT is applied over each fragment of the audio signal to obtain a power spectra of the signal. The power spectrum of a time series ex- plains how power is distributed in frequency components (energy per unit time) that compose the signal [22]. The power spectrum is stacked on top of each other to become spectrogram of an audio signal. However, humans can only perceive a small and concentrated range of fre- quencies and amplitudes. The calculated spectrogram will not discern between human-perceivable frequencies. Therefore the y-axis (frequency f) is converted to a log scale and the color dimension to decibels [23] (log scale of amplitude). This technique is called Mel-Scale, which approximates the human auditory sys- tem’s response more closely than narrow frequency bands. Stevens, Volkmann and Newman proposed Mel-Scale as a unit of pitch such that equal distances in pitch sounded equally distant to the listener [24]. Therefore a spectrogram where the frequencies are converted to the Mel-scale in Y-axis is called a Mel- Spectrogram. In our approach, the audio wave files of speech command digit dataset are dynamically represented as images of Mel-Spectrograms and synthesized in GAN models. Finally, the original audio is then retrieved from an STFT sequence, by taking an inverse transform of each frame, which is overlapped and added iteratively. This algorithm of reconstruction of an audio signal from spectrogram by solving phase recovery is known as Griffin Lim [25]. The restoration of the audio signal will regain its phase with the increasing number of iterations. In our application, we have used 60 repetitions to bring audio back from Mel- Spectrogram. 4.3 Mel-Spec GAN (MSGAN) In the field of Computer Vision, Generative Adversarial Networks have seen tremendous success in the past years. We can now produce incredibly realistic images which are indistinguishable from the actual ones, showing how far GAN technology has indeed progressed. Since audio signals can be represented as Mel-Spectrogram, it can be incorporated into GANs to synthesize new Mel- Spectrogram that in turn can be converted back to audio signals. 6 Dineshraj Gunasekaran, Gautham Venkatraj, Eoin Brophy, and Tomas Ward The core of the proposed Mel-Spec GAN model takes inspiration from Deep Convolutional GAN (DCGAN) [26]. Unlike a grayscale image, the audio signals are represented in the form of 3-Dimensional vector frequency bins with single- channel input (128,128,1) and a steady sample rate of 22050. Subsequently, each frequency bin is normalized with mean and standard deviation and rescaled to [-1, 1]. We incorporated the convolutional layers while designing the models of gen- erator and discriminator. The discriminator model takes as input one 128×128x1 Mel-spectrogram and outputs a binary prediction of whether the audio is real or fake. The hidden layers are customized with downsampling blocks consist- ing of 2×2 stride 2-Dimensional convolutional layers equipped with a 5x5 filter. Inspired from the architecture of WaveGAN [3], the generator model is refash- ioned by installing 2D transposed convolutional layers and achieved upsampling without using max-pooling and nearest neighbours. The Rectified Linear Unit(ReLU) activation function f (x) = x+ = max(0, x) is implemented in the generator network [29], and the weights are initialized with a slightly positive initial bias using a truncated normal distribution with zero mean and 0.02 standard deviation to avoid dead neurons [27]. The LeakyReLU activation function f (x) = max(0, x) + a · min(0, x) with a slope(a) of 0.2 is fashioned in every layer of discriminator except the dense output layer [28]. We regulated the vectors and accelerated the learning process with batch normal- ization in all layers except the input and output layer. Ultimately, an Adam version of stochastic gradient descent optimizer with a learning rate of 0.0002 and a momentum of 0.5 is adapted in both the models. We used sigmoid activation function f (x) = 1/ (1 + e−x ) with binary cross- entropy loss for the output layer of the discriminator and a tanh activation func- tion f (x) = (ex − e−x ) / (ex + e−x ) for the generator to generate a 128x128x1 Mel-spectrogram in the range of -1 to 1. The Mel-Spec GAN is trained with a pre-processed input audio wave files and used the generator to synthesize new plausible spectrogram, which is recon- structed to audio waveform, using Griffin-Lim algorithm. 4.4 Conditional-Mel-Spec GAN (CMSGAN) Although Mel-SpecGAN model can fabricate new random plausible audio from a given domain, there is no other way to monitor the types of audios that are generated than trying to find out the strong correlation between the generator’s latent space input and the audio produced. Therefore, we took the Digit labels (0-9) information into consideration to forge Conditional-Mel-SpecGAN (CMSGAN). These labels are transmuted into one-hot encoded vectors y of dimension (data size x 10). By feeding the class label vector y into the generator and the discriminator, the original Mel-SpecGAN is expanded to a conditional model [30]. In this way, the improved Mel-SpecGAN can quickly learn the data distribution of each class independently and generate the samples in accordance with the given condition label y [31]. The loss function of modified GAN is depicted in formula (4). Improved Speech Synthesis using GAN 7 min max V (D, G) = Ex∼pdata (x) [log D(x | y)] + Ez∼pz (z) [log(1 − D(G(z | y)))] G D (4) To encode the class labels into the discriminator model [32], we embedded in- put class layer to a fully connected dense layer (128x128) with a linear activation that scales the embedding to the size of the spectrogram (128x128). Furthermore, the embedded input is reshaped into single activation map (128x128x1) before concatenating it with the model as an additional feature map. In the generator model, we concatenated the latent vector and the input class vector before embedding to the fully connected layer with the activation of 16384 vectors to match the activations of the unconditional generator model. The new ten feature-map is appended as one more channel to existing (Nx100), resulting in (Nx110) feature-maps and up-sampled as in the previous model. We then trained the GAN model with latent vector and class label as an in- put, and spawn a prediction of whether the input was genuine or counterfeit. We optimized the GAN by smoothening the real and fake tags of the discriminator with the intention that the loss does not converge to 0. Both GAN models are evaluated by using the Inception score as a stopping criterion. 5 Experimental Protocol 5.1 SC09 Dataset and The Effects of Background Noise Our research focuses on the Dataset of Spoken Commands [33]. Google brain created this dataset through several speakers recording single words under un- regulated recording environments. We analyzed a subset of the spoken command ”zero” through ”nine” and referred to this subset as the Speech Commands Dig- its dataset (SC09) [12]. Each recording is one second in duration with different alignment in time. Although this dataset is deliberately similar to the famous MNIST written digit dataset, we note that SC09 examples (128x128) are far higher in dimensions than MNIST examples (28x28). These ten words contain several phonemes, and two are multiple syllables. The training set includes 1850 utterances of every digit, resulting in 5.3 hours of speech [12]. We calculated the amount of background noise in the underlying dataset, using a speech quality measure called Signal to Noise Ratio (SNR) [34]. However, to compare the waveform directly in the time domain, the synchronization of the original and distorted signal was necessary. Since we converted time domain signal to the spectral domain, we can compute SNR using speech parts, typically between 20 and 30 MS long [35]. This method is known as Segmented Signal to Noise Ratio (SegSNR) [35]. This approach is more reliable than their predecessor and less sensitive to signal alignments. Certain digits like ‘Nine’, ’Six’ and ‘Five’ have negative SegSNR and are highly prone to errors in synthesis. Therefore, the ambiguity of alignments, speakers, and recording environments makes this a challenging modeling dataset [12]. 8 Dineshraj Gunasekaran, Gautham Venkatraj, Eoin Brophy, and Tomas Ward 5.2 Pre-Processing and Experimental Setup We parallelized the pre-processing of the audio signals, while converting them into Mel-Spectrogram images. The functions from Librosa library were utilized to load the WAV formatted audio files. In addition, the audio utility package was constructed using core Librosa functions to compute Mel-Spectrogram and its Inverse. To calculate STFT, we sampled the audio signals with windows of size n fft=2048, hops of size hop length=512 to transform from the time do- main to the frequency domain. We then took the entire frequency spectrum, and separated it into n mels=128 evenly spaced frequencies. For each window, we decomposed the magnitude of the signal into frequencies and scaled the cor- responding frequencies into a log scale. The uniform dimension (128x128) was maintained for the spectrogram by attributing the difference in value with -80 decibels. Finally, we normalized the Mel-frequencies with mean and standard deviation and scaled to the range of [-1,1]. The Mel-Spectrogram and its corre- sponding mean and standard deviation are saved as NPZ file so that the model can further synthesize it. We trained our networks, using batches of size 32 on an NVIDIA TESLA P100 GPU in Google Colab Pro. During our quantitative assessment of SC09, our Mel-SpecGAN networks converged by their evaluation criteria (Inception score) within 5 hours of training (around 80K epochs) and produced speech- like audio after 30k epochs. Our Conditional Mel-SpecGAN networks converged more quickly, within 3 hours (about 50k epochs) and produced better results with much higher Inception score. 6 Evaluation Methodology 6.1 Inception Score Tim Salimans et al [36]. proposed the Inception Score, which is an empirical metric for measuring the quality, and the semantic discriminability of the image generated by the GAN models [12]. The performance of the Generative Adver- sarial Networks is monitored by a pre-trained deep learning image classification model, which is incorporated to classify the generated images and uses the con- ditional probability as a base to calculate the Inception Score [37]. The Inception Score has a minimum value of 1.0 and a maximum value of the number of classes provided by the classification model. To measure the Inception Score, we trained an audio classifier model on Mel- Spectrogram features of Speech Command Digit Dataset. The pre-processed and normalized Mel-Frequencies were used as an input to the network and the one- hot encoded labels, as the output class vectors. We built our classifier network with four layers of 2D convolutional and pooling, followed by two layers of dense activations, projecting the result to a softmax layer with ten classes [12]. The network was compiled with categorical cross-entropy along with Ada-Delta op- timizer. We ran upto 50 epochs with early stopping on the minimum negative log-likelihood of the testing dataset and achieved the accuracy of 99.82% and Improved Speech Synthesis using GAN 9 saved the model for evaluating the GAN. To calculate the Inception Score, we first used our pre-trained deep learning Mel-Spectrogram classifier model to esti- mate the conditional probability of generated audio spectrograms (p(y|x)). After that, the marginal probability was calculated as the average of the conditional probabilities for the spectrograms in the group (p(y)). KL divergence = p(y | x)? (log(p(y | x)) − log(p(y))) (5) We combined these metrics and calculated the Kullback–Leibler divergence (KL divergence) for each spectrogram as the product of conditional probability with the log of the same minus the log of the marginal likelihood [36] [38] as exhibited in the formula(5). Finally, the final Inception score is computed by taking the exponent of the summation of the KL divergence and averaged over all classes. 6.2 Quality Human Evaluation To support the algorithmic evaluation of the model, we incorporated another evaluation metric called Mean Opinion Score (MOS). International Telecommu- nication Union (ITU) defines the Mean Opinion Score (MOS) as a numerical ranking of the human-judged overall performance of the system quality (voice or video) [39]. MOS is calculated on the scale of 1 (lowest perceived quality) to 5 (highest perceived quality), which is the arithmetic mean of individual values of human-scored parameters. We measured the ability of human annotators by creating a survey on Ama- zon Mechanical Turk to rank the generated speech. We identified our best Mel- Spec GAN (MSGAN) and Conditional Mel-Spec GAN (CMSGAN) models by their core evaluation metrics and produced random samples. We created 20 batches of each digit (labelled by Classifier Model) for CMSGAN model and 200 random digits samples for MSGAN model. We customized a form-based layout using Crowd-HTML to develop the UI for the survey and linked with MTurk services for the human-evaluation. We asked the 100 annotators to as- sign subjective values of 1-5 for sound quality and reported the score in Table 1. 7 Results and Discussion We implemented the MSGAN and CMSGAN models with Inception Score as evaluation and stopping criteria. The Mel-Spec GAN achieved the Inception Score of 5.76 while improving the same, GAN with label conditions gave a sub- stantial score of 7.64. We compared the scores with other similar implementations of Adversarial Audio Synthesis Networks, like WaveGAN and SpecGAN [12], as shown in the Table 1. To endorse the assessment, we validated the models using a crowd-sourced experiment. While the MSGAN averaged a MOS of 3.01, CMS- GAN achieved a higher MOS of 3.74. By leveraging the Mel-Spectrograms and label embedding, the speech quality of the proposed model improved by 63.6%. 10 Dineshraj Gunasekaran, Gautham Venkatraj, Eoin Brophy, and Tomas Ward Table 1. Comparing Inception Scores of the models S.No Network Inception Score MOS 1 Real (train) 9.18 2 Real (test) 8.01 4.02 3 WaveGAN (Phase shuffle n=2) [12] 4.67 4 SpecGAN [12] 6.03 5 Mel-Spec GAN (proposed) 5.76 3.01 6 Conditional Mel-Spec GAN (improved) 7.64 3.74 8 Conclusion In this paper, we synthesized the speech signals using various GAN models and introduced a novel methodology for generating audio waveform. We used fre- quency domain representation of the audio signal and converted wave files to Mel-Spectrogram image files, using the short-time Fourier transform. By incor- porating ideas from image synthesizing DCGAN, we customized the Mel-Spec GAN architecture by leveraging the Mel-Spectrogram of the audio signal. The generated Mel-Spectrograms are converted back to the actual waveform, using fast Griffin Lim algorithm by solving conversion loss. We then improved the proposed architecture by embedding label as an input, so that the audio tags condition the output. Not only did the Inception Score of the model improve from 5.7 to 7.6, but also, the model converged quickly to achieve extraordinary results. In its current form, Mel-Spec GANs can be used for real-time speech syn- thesis. In our future work, we plan to extend the potential of GANs to operate on variable and longer length audios and multiple accents. By providing a tem- plate for speech synthesis and Mel-spectrogram generation models to serve on speech signals, we hope that this research will catalyze future audio-synthesis experiments of GANs. 9 Acknowledgements This work is supported in part by by Science Foundation Ireland (Grant Nos. SFI/12/RC/2289 P2 and 17/RC-PhD/3482). We gratefully acknowledge the sup- port of NVIDIA Corporation with the donation of the Titan Xp used for this research. References 1. Saito, Y., Takamichi, S., Saruwatari, H.: Statistical Parametric Speech Synthesis In- corporating Generative Adversarial Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 26, 84-96 (2018). Improved Speech Synthesis using GAN 11 2. Pasini, M.: MelGAN-VC: Voice Conversion and Audio Style Transfer on arbitrarily long samples using Spectrograms, https://arxiv.org/abs/1910.03713. 3. Donahue, C., McAuley, J., Puckette, M.: Synthesizing Audio with GANs. ICLR, https://openreview.net/forum?id=r1RwYIJPM. 4. Biswas, S., Solanki, S.: Speaker recognition: an enhanced approach to identify singer voice using neural network. International Journal of Speech Technology. (2020). 5. Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbren- ner, N., Senior, A., Kavukcuoglu, K.: WaveNet: A Generative Model for Raw Audio, https://arxiv.org/abs/1609.03499. 6. Saito, Y., Takamichi, S., Saruwatari, H.: Statistical Parametric Speech Synthesis In- corporating Generative Adversarial Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 26, 84-96 (2018). 7. Awad, M., Khanna, R.: Hidden Markov Model. Efficient Learning Machines. 81-104 (2015). 8. Engel, J., Agrawal, K., Chen, S., Gulrajani, I., Donahue, C., Roberts, A.: GAN- Synth: Adversarial Neural Audio Synthesis, https://arxiv.org/abs/1902.08710. 9. Shen, J. Pang, R. Weiss, R.J. Schuster,M. Jaitly,N .Yang,Z. Zhang,Y. Wang,Y. Skerrv-Ryan, R. Saurous,R.A. Agiomvrgiannakis,Y. Wu,Y.:Natural TTS Synthe- sis by Conditioning Wavenet on MEL Spectrogram Predictions.IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 4779-4783 (2018). 10. Wang, Y., Stanton, D., Zhang, Y., Skerry-Ryan, R., Battenberg, E., Shor, J., Xiao, Y., Ren, F., Jia, Y., Saurous, R.: Style Tokens: Unsuper- vised Style Modeling, Control and Transfer in End-to-End Speech Synthesis, https://arxiv.org/abs/1803.09017. 11. Jia, Y., Zhang, Y., Weiss, R., Wang, Q., Shen, J., Ren, F., Chen, Z., Nguyen, P., Pang, R., Moreno, I., Wu, Y.: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis, https://arxiv.org/abs/1806.04558. 12. Donahue, C., McAuley, J., Puckette, M.: Adversarial Audio Synthesis, https://arxiv.org/abs/1802.04208. 13. Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.: Generative Adversarial Networks: An Overview. IEEE Signal Processing Mag- azine. 35, 53-65 (2018). 14. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative Adversarial Networks, https://arxiv.org/abs/1406.2661. 15. Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S., Lew, M.: Deep learning for visual understanding: A review. Neurocomputing. 187, 27-48 (2016). 16. Deng, L.: The Cross-Entropy Method: A Unified Approach to Combinatorial Opti- mization, Monte-Carlo Simulation, and Machine Learning. Technometrics. 48, 147- 148 (2006). 17. Lan, L., You, L., Zhang, Z., Fan, Z., Zhao, W., Zeng, N., Chen, Y., Zhou, X.: Generative Adversarial Networks and Its Applications in Biomedical Informatics. Frontiers in Public Health. 8, (2020). 18. Allen,D.: Chapter 1. Sounds and Signals. In Think DSP: Digital Signal Processing in Python, First edition., pp 1–11, Sebastopol, CA: O’Reilly Media, Inc (2016). 19. Kehtarnavaz,N.: Digital Signal Processing System Design. Amsterdam. Academic Press, (2008). 20. van den Bogaert, B.: When Frequencies Change in Time; Towards the Wavelet Transform. Data Handling in Science and Technology. 33-55 (2000). 12 Dineshraj Gunasekaran, Gautham Venkatraj, Eoin Brophy, and Tomas Ward 21. Moorer, J.: A note on the implementation of audio processing by short-term fourier transform. 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). (2017). 22. Vetterling, W. T., & Press, W. H.: Numerical recipes in Fortran: the art of scientific computing (Vol. 1). Cambridge University Press (1992). 23. Roberts, L.: Understanding the Mel Spectrogram, https://medium.com/analytics- vidhya/understanding-the-mel-spectrogram-fca2afa2ce53. 24. Stevens, S., Volkmann, J., Newman, E.: A Scale for the Measurement of the Psy- chological Magnitude Pitch. The Journal of the Acoustical Society of America. 8, 185-190 (1937). 25. Griffin, D., Jae Lim: Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing. 32, 236-243 (1984). 26. Radford, A., Metz, L., Chintala, S.: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, https://arxiv.org/abs/1511.06434. 27. Goodfellow, I., Bengio, Y. and Courville, A.: Deep Learning.Cambridge, MA: MIT Press, pp.175-250 (2017). 28. Xu, Y., Du, B., Zhang, L.: Can We Generate Good Samples for Hyperspectral Classification? — A Generative Adversarial Network Based Method. IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium. (2018). 29. Ramachandran, P., Zoph, B., Le, Q.: Searching for Activation Functions. IEEE. (2017), https://arxiv.org/abs/1710.05941. 30. Mirza, M., Osindero, S.: Conditional Generative Adversarial Nets, https://arxiv.org/abs/1411.1784. 31. Xu, Y., Du, B., Zhang, L.: Can We Generate Good Samples for Hyperspectral Classification? — A Generative Adversarial Network Based Method. IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium. (2018). 32. Denton, E., Chintala, S., Szlam, A., Fergus, R.: Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks, https://arxiv.org/abs/1506.05751. 33. 3.Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition, https://arxiv.org/abs/1804.03209. 34. Prodeus, A., Didkovskyi, V., Didkovska, M., Kotvytskyi, I., Motorniuk, D., Khra- pachevskyi, A.: Objective and Subjective Assessment of the Quality and Intelligi- bility of Noised Speech. 2018 International Scientific-Practical Conference Problems of Infocommunications. Science and Technology (PIC S&T). (2018). 35. Mohamed, S.: Objective Speech Quality Measures, http://www.irisa.fr/armor/lesmembres/Mohamed/Thesis/node94.html. 36. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved Techniques for Training GANs, https://arxiv.org/abs/1606.03498. 37. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the Incep- tion Architecture for Computer Vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016). 38. Kullback, S., Leibler, R.: On Information and Sufficiency. The Annals of Mathe- matical Statistics. 22, 79-86 (1951). 39. “ITU-T Rec. P.10/G.100 (11/2017) Vocabulary for performance, qualityof service and quality of experience.” (2017).