Deep Learning Models for Ukrainian Text to Speech Synthesis Serhii Kondratiuk a,b, Danylo Hartvih c, Iurii Krak a,b, Olexander Barmak d, and Vladislav Kuznetsov a a Glushkov Cybernetics Institute, Kyiv, 40, Glushkov ave., 03187, Ukraine b Taras Shevchenko National University of Kyiv, Kyiv, 64/13, Volodymyrska str., 01601, Ukraine c Technical University of Munich, Arcis str. 21, München D-80333, Germany d Khmelnytskyi National University, 11, Institutes str., Khmelnytskyi, 29016, Ukraine Abstract The developed technology performs text to speech generation for the Ukrainian language. Implemented technology at first performs mel-spectrogram prediction using character sequence as an input and then based on composed acoustic features, the final audio signal is reconstructed. As a part of research, an overview and comparison of existing text to speech technologies for the Ukrainian language was conducted. For both stages deep learning architecture models were used – Tacotron2 for mel-spectrogram generation and Parallel WaveGan for voice audio generation. The research also contains a comparison with other architectures and vocoders, such as FastSpeech2 and WaveNet. Also, a grapheme-to-phoneme translation was improved in order to get better pronunciation quality for Ukrainian language. Experiments contain the usage of pre-trained models on English and other languages, in order to leverage transfer learning techniques. Research also contains a review and analysis of existing open-source datasets for the Ukrainian language. A mean opinion score metric was proposed and used in order to evaluate the final solution, with a statistically significant variety of generated testing samples and participants, based on defined criteria of the metric grade. Experiments show dependency on the training vocabulary of the dataset and further development implies augmentation of the dataset with different topics. The best configuration for all deep learning models was found based on training and testing results are also shown in the research, with hardware requirements needed to reproduce the training. Experiments also show the satisfying quality of generated voice, trained on a specially collected and processed dataset of the single male voice. Testing experiments 20 people participated, while each person evaluated about 100 generated sound samples of length from 40 to 60 seconds score of 4.02 of mean opinion score metric was achieved. Keywords 1 Text to speech, deep learning synthesis, tacotron2, mel-spectrogram, parallel wavegan 1. Introduction Text to speech has lots of applications, it is a widespread technology, which can be used to help people with a wide range of disabilities. They can also be used in entertainment production, to make voice acting production cheaper. Multiple high-quality text-to-speech solutions exist, such as [1], [2], however, they commonly provide high quality for very widespread languages. Speech generation for less spread languages, such as Ukrainian, is a harder task, whilst the demand is only growing. It’s important to provide an open-source framework that contains a pipeline to train new voice models for the Ukrainian language in a convenient way. Also, it’s vital to prove that the pipeline is providing high- quality results with test voice samples. Let note that for more effective speech generation need I ntelITSIS’2023: 4th International Workshop on Intelligent Information Technologies & Systems of Information Security, March 22–24, 2023, Khmelnytskyi, Ukraine EMAIL: sergey.kondrat1990@gmail.com (S. Kondratiuk); danil.gartvig@gmail.com (D. Hartvih); iurii.krak@knu.ua (I. Krak); аlexander.barmak@gmail.com (O. Barmak); kuznetsow.wlad@gmail.com (V. Kuznetsov) ORCID: 0000-0002-5048-2576(S. Kondratiuk); 0000-0002-4593-411X (D. Hartvih); 0000-0002-8043-0785 (I. Krak); 0000-0003-0739-9678 (O. Barmak); 0000-0002-1068-769X (V. Kuznetsov) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) investigate cognitive linguistic analysis [3], including of phonetic analysis and grammar structure of language [4]-[6], speech signals marking and segmentation [7]. There are only a few existing solutions for Ukrainian language Text to Speech (TTS) generation. The majority of them [8]-[10], are formant synthesis models (so-called rule-based synthesizers). Those approaches have the advantage of low resource usage and high-speed synthesis, at the cost of generating artificial, robot-like speech. They are commonly used as screen readers for visually impaired people. There are also two deep learning solutions for the Ukrainian language: WaveNet-based service from Google Could Text-to-Speech [11] and Nuance Vocalizer TTS [12]. However, they are proprietary and it’s not possible to train a new voice with those cloud providers. Speech synthesis from the text in a single end-to-end step is a very sophisticated task, such a model would be hard to train and interpret. Commonly synthesis is split into two stages. The first stage is to train a synthesizer (which is also known as Encoder-Decoder architecture) on character sequence in order to predict mel-spectrograms [13] (a low-level acoustic representation). The second stage is to train a neural vocoder (waveform generator) model, which uses acoustic features from the previous step to reconstruct audio signals (final voice sample). In order to get a reduced dimensionality view, we opted to make use of short time Fourier transform, or also known as STFT. This applies to some data, for instance, audio or other signals. Using such non-linear transform, we can view in detains the underlying nature of the data, focusing mostly on low frequency and more pronounced details of the signal, rather than high-frequency details, that may contain some artifacts due to discretization process or size of the window. Thus, the purpose of this work is to develop information technology for the generation of text to speech for Ukrainian language. A technology is proposed that, using a sequence of speech symbols, performs the prediction of mel-spectrograms and, based on the compose acoustic characteristics, transform it into a voice signal. To implement these stages of the research, it is proposed to use modern models of deep learning architecture - Tacotron2 for the generation of a mel-spectrogram and Parallel WaveGan for the generation of voice audio. Comparison with other architectures and vocoders such as FastSpeech2 and WaveNet showed better results of the proposed approach. Experiments show the dependence on the training dictionary of the data set, and further development involves supplementing the data set with various topics. Grapheme-phoneme translation has also been improved for better pronunciation quality in Ukrainian. The structure of the article consists of the following sections: introduction; an overview of existing approaches to machine learning of textual information using different architectures of deep learning models, indicating their advantages and disadvantages; detailed description of the proposed approach; description of datasets for training procedures; conducting experiments and analyzing the obtained results; conclusions with an assessment of the obtained results and plans for further research. 2. Related Works Predicting audio signals directly from text with one step is a complex and sophisticated task. That's why deep learning synthesis is splinted into two stages. In the first stage, the synthesizer used preprocessed character sequences to predict mel-spectrograms (a low-level acoustic representation). In the second stage, the neural vocoder model uses acoustic features (mel-spectrograms) to reconstruct audio signals. We used a STFT on a scale of 50ms along the studied data. The parameters passed to the method are the following: hop is 12.5 milliseconds, the used function to display the spectrogram is Hann (you can see the overall results on the Fig. 1). We also used an additional transformation, so as the spectrogram was scaled in the logarithmic scale with 80 channels in order to achieve a proper mel- cepstral diagram of the input data. There is commonly known fact, that humans perceive pitch (either musical or sound) differently in low and high range of the perceived scale. The lower sound (or "note") is easily distinguishable from the higher, so as the lower the pitch, the higher the ability to distinguish two pitches. For instance, humans would likely distinguish tones of heavy electric machinery (which lay in relationship times per feeding electric grid frequency which is 50/60Hz), from the sounds of electric pulse modulation electric AC-DC adaptors (some of them may have a range well beyond the 10000+ Hz, that can't be heard by elderly people or people with some hearing problems). Based on thoughtful study, there was a solution found in order to organize a sound spectra. Circa 1937, a team of scientists including Stevens, Volkmann, as well as Newmann suggested a metric in order to measure distances between different pitches of the sound spectra. In order to do so they studied, how the different pitches relate one to another to a human. This metric, now commonly known as mel- scale (see Fig. 2), allows to plot the frequencies in such manner as the human would plot them; it also defined an easy forward and backward transformation so as it would be possible to convert STFT diagram to a MEL diagram and vice versa. In the proposed research we perform a mathematical operation on frequencies to convert them to the mel scale. Figure 1: Fast Fourier transformation visualization Logarithmic transformation of the spectrogram into a mel-scale is well known approach or technique for sound signals dimensionality reduction. Despite the fact it neglects some information of the signal in the complex time-frequency scale, it is useful for some denoising techniques, and hence can find various areas of usage. We must emphasize, since the inverse transformation exists, the result of the backward transformation is not equal to the original signal, since the transformation is lossy. To overcome this problem, different approaches were proposed; for instance, in papers [14,15] one can find an approach of using Griffin Lim or neuro voice-coders in order to obtain more accurate representation from a raw signal. The mel spectrogram, with applied range of techniques discussed above, can be seen on Fig. 3. Figure 2: Mel scale Figure 3: Mel spectrogram Two (most popular/state-of-the-art) synthesizers are Tacotron2 [16] by Google and Fastspeech2 [17] by Facebook. Tacotron2 is an encoder-attention-decoder neural network, where encoder converts input sequence to a hidden feature representation. (Encoder has 3 convolutional layers and Bidirectional Long Short- Term Memory (LSTM)). After that attention summarizes the encoded sequence to a fixed-length vector which the decoder consumes to predict audio features (mel-spectrogram). (Decoder is an autoregressive recurrent neural network used with 2 Fully Connected layers as Pre-Net and 5 Convolutional Layers as Post-Net). Since the model is only capable of predicting fixed-length output, in parallel it also predicts stop- token which allows the model to terminate generation earlier. Original Tacotron2 is a single-speaker model, meaning that one model is able to generate only one voice. But there is an improved version of Tacontron that has the speaker’s voice embedding as second input [21]. Tacotron architecture is shown at Fig. 4. Let's discuss the FastSpeech architecture more in detail. It consists of transformers, as well as encoder-decoder blocks. This architecture has a variance adaptor block, as well as 4 transformer sub- blocks within each encoder and decoder. The main purpose of an encoder is to transform the data from the original dimension into a hidden one. The variance module can be used, in order to process such complex signal as speech, using different parameters, for instance pitch, duration, energy density and others. Figure 4: Tacotron architecture In order to correctly reconstruct the original signal from the hidden transformation, the decoder learns the hidden features and the inverse transformation function as well, using a variance module as well. There are many implementations and modifications of this method, including FastPitch, as well as MultiSpeech; while being based on FastSpeech architecture, they differ in purposes - the first is more suited to pitch prediction, while the latest for utterances that have multiple participants. You can see the architecture of FastSpeech more in detail on Fig. 5. Figure 5: Fastspeech2 architecture Let's emphasize on FastSpeech advantages, based upon the provided Fig.5. Firstly, it has a straightforward architecture, consisting of an encoder and decoder, as well as transformer, that learn the transformation of data, minimizing the error. Despite the fact that such architecture found in autoencoders may eliminate some information due to denoising effect of autoencoders in general, it may be helpful and precise enough for tasks of sound and speech analysis. We also want to emphasize, that this architecture is used not only to learn the transformation not only to transform the signal into a hidden feature-space, but also to perform fast Fourier transforms along the input signal. Hence, in order to perform such a task, the transformer makes use of self- attention layer, as well as convolution. The arrangement of the blocks is dictated by the level of detail of features on each layer of the network. While the layer closer to the input encode the low-level pitch features, the latest encode the features of phonetic features and specgram level as well. In order to adapt to a speech, one can fine tune hyperparameters so as to adapt better to different length of the utterances and phonemes as well. In order to further discuss the architecture, let's focus on some of its parameters, depicted above. Since a phoneme can be encoded by multiple pieces of the specgram, in general the utterance is generally shorter than a sequence itself. Hence, the length hyperparameter restricts the overall number of specgrams used to encode each phoneme. This parameter is also used to implement dynamic transformation of timescale of the hidden sequences so as they correspond to the desired length. One way to control this parameter is to check the speed of the speech per fixed amount of time so as it can control the length parameter "on the fly". It can also be used to segment unnecessary parts of the signal, that likely don't contain useful signal. The key idea between this technique is an automatic regulator, that feeds the estimated length of each phoneme and hence controls the "speed" of the transformation, which obviously won't work in other cases, if it was a fixed value. The regulator relies upon a block, consisting of conv and shallow layer, as it shown above. The predictor for a sequence regulation is trained upon the error ratio and is connected to the FFT transformation block (on the bottom side, left image). Hence, we can say, that the model, being controlled by a speaker tempo, relies strongly upon the attention between the forward and inverse encoding parts of the model. 3. Proposed technique We need to underscore some problem may happen during the text conversion phase, if the speaker is badly heard, especially, in noisy environment. Sometimes the phonemes in the sequence can be mistaken with others, for instance the transition from one sound into another. We suggest treating this situation carefully, since sometimes the same letters (written) can have similar sound or corresponding phoneme. Hence, it is very important to catch not only the raw transcription, but being able to interpret if there can be a margin for an error, like mispronunciation. The phoneme does not have an independent lexical or grammatical meaning but serves to distinguish and identify significant units of the language (morphemes and words). In the Ukrainian language, there are 33 graphemes, of which 21 are consonants and 12 vowels. The phonetic system of the modern Ukrainian language has 38 basic phonemes: 6 vowels and 32 consonants; additionally, determines 10 double consonant phonemes in the peripheral subsystem. Figure 6: TTS system architecture We used Encoder-Decoder Recurrent Neural Network (RNN) [19] with Attention [20] used to convert grapheme representation of words to phonemes. Encoder-Decoder RNN is usually used for machine translation, and since the grapheme to phoneme conversion task is similar to translation, it is naturally a good solution. Encoder and Decoder consist of one layer each with bidirectional LSTM and hidden dimension size of 256, while decoder part also has an Attention layer. High-level architecture of our system is showed at Fig. 6. Let's discuss the limitations of the model. Obviously, one should be aware of the fixed length of the input audio sequence. Since the model is affected by attention between the forward and inverse encoding, as we noted before, it is also important to carefully process the utterances of different length. Since the regulator adapts not only to the timescale of the speaker, it also affects the timescale of the utterance in some manner, which may be a difficult problem to overcome. Hence, we could discover a problem where we speak about same utterances, but having absolutely different timescale and hence some phonemes may be represented by various number of the sequences, even if they represent the same phrase. As a consequence, we suggest addressing this problem with detail in both the task of decoding of sequences, with same level of attention, as it done in more complex tasks, such as text-to-text or audio-per-audio translation. Fig. 7 shows attention layers used in the technology. Figure 7: Attention layer What are the main differences of Tacotron2 from the discussed here Fastspeech2 architecture in terms how it fits everything up? Obviously, the key is what it drives up: while the Tacotron benefits from autoregression to generate specgrams, the Fastspeech relies on the self-attention in FFT blocks used in the forward and inverse encoding [18]. As consequence, we must remember that the FastSpeech2 is not all-way fit method: it should be trained with more attention, hence it has carefully to predict the duration of the segments, but also to define the specgrams, calculated from the input audio; if the duration is not predicted correctly, we can observe the effect of the overlapping, which may drastically decrease the accuracy of the model. As shown in [21], Fastspeech2 overcomes some controllability issues of Tacotron2, however, their metrics show comparable performance. In implemented technology, we ended up with Tacotron2 because it was easier to integrate with other parts of the pipeline. Since the model is only capable of predicting fixed-length output, in parallel it also predicts stop-token which allows the model to terminate generation earlier. Original Tacotron2 is a single-speaker model, meaning that one model is able to generate only one voice. There is an enhanced version of Tacontron that has the speaker’s voice embedding as second input, which leads to a minor drop in quality, in this research we ended up using the original Tacotron2 implementation. One of the problems with Tacotron2 is the instability of the gate layer, which is responsible for stopping generation, and if it’s not working properly, the model decoder will continue to generate mel- spectrogram frames until it reaches the limit (max-length). To solve this problem, we add the End Of Sentence (EOS) symbol at the end of each input sequence. Another problem is the instability of the mechanism of attention in the original implementation of the Tacotron2. In order to overcome the issue we suggest the following approach: at first, we apply the dial guided attention, so as the model can learn faster. In order to control the training process, we experimentally figured out the desirable number of iterations (in the literature, one can find that this number is varying and likely around, 20000). As a result, the penalty on the representation attention matrix is made when it is distinct from a diagonal. We utilized, as a part of text-to-speech synthesis system, a voice coder- decoder, which generates the audio from the specgrams. Using such models, as proposed by Griffin Lim, we can generate a transformation from an inverse (MEL) representation to a raw audio data (which was not possible in early implementations of logarithmic transformations, due to loss of complex phase from the STFT specgram) using a network to study it from data. Also, we want to emphasize, that instead of using decoder by itself, we can proceed to it via latent feature space representation and then to audio. We tried various types of voice coders-decoders, including MelGan, Hi-Fi GAN, WaveNet [22-25]; however, none of them did not fully answer our needs. As a result, we ended up with Parallel WaveGan. The architecture of it is depicted on the Fig. 8. Figure 8: Vocoder schema used In order to perform our task, as we noted before, we used Parallel WaveGAN. The main benefit of this method is utilization of relatively small GAN, that is always good if the system makes use of a lot of time- and memory- consuming libraries. Hence, using such method, we can avoid the obvious drawbacks we noticed in Tacotron2 architecture and benefit from loss functions, that can better represent the features of the human speech, such as distribution of the voice information within the output signal, as well as using different resolutions of specgrams. We can see, as other positive effect is density distillation, which is not used in other models we discussed before. 4. Datasets Tacotron2 training required dataset in special format LjSpeech [26]: length of audio segment from 2 to 20 seconds and corresponding transcription for all audio files. There is only one opensource dataset for Ukrainian language that has the correct format for Text-to-Speech task: M-AILABS Speech Dataset. M-AILABS Speech Dataset consists of audio books splitted in short segments with transcription, it has several speakers but for Tacotron2 training we choose only one with the best audio quality and at least 15 hours of data. Moreover, we preprocessed all audio: noise reduction performed with Fourier Transformation, volume level normalization, conversion to mono channel and 22050 Hz sampling rate. Split the dataset into 3 parts: N samples for training with total duration N, M samples (M hours) for validation and L samples (L hours) for testing. We transform transcriptions into the phoneme sequence using the model described above (Grapheme-to-Phoneme [27] translation). Since we predicting mel-spectrograms, raw audios also were transformed. To train universal Vocoder we used the opensource dataset Common Voice [28] which also has Ukrainian language. It is a multi-speaker dataset created by Mozzila, it has 615 unique voices and 56 hours of validated data. Also, for better results we mixed Common Voice with 10% of audios from the dataset used for Tacotron2. All audios have been converted to 22050 Hz sampling rate. Split the dataset into 3 parts: N samples for training with total duration N, M samples (M hours) for validation and L samples (L hours) for testing. We transformed raw waveforms (audios) into mel-spectrograms, which is input data to the vocoder model. 5. Experiments, results and discussions Training and inference were performed using a Google Colab [29], using such hardware: • GPU: Nvidia P100 16GB • CPU: 2x Intel Xeon CPU @ 2.20GHz • RAM: 13 GB Tacotron2 was trained for 10 days, with 125,000 iterations and batchsize 32 with next architecture: • emb_hidden: 512 • encoder a. 5 conv layers b. lstm: 256 units • decoder a. 2 prenet layers b. 1 lstm c. lstm units: 1024 d. attention dim 128 e. postnet with 5 conv layers Fastspeech2 was trained for 35 hours, with 200,000 iterations and batchsize 32 with architecture: • emb_hidden: 384 • encoder a. 4 hidden layers b. 2 attention heads • decoder a. 4 hidden layers b. 2 attention heads • variant prediction a. 2 conv layers b. filter 256 c. kernel size 3 d. dropout 0.5 Parallel WaveGAN was trained for 5 days approximately, with 600,000 iteration and batchsize 16. Training charts (total loss and regularization loss) are shown at Fig. 9. During experiments, in contrast to training, validation and test metrics during training, it is hard to evaluate TTS engine with some strict metric, due to only human can truly evaluate if the output sound meets couple of requirements: • natural sound / no “robotic” sounding • no noise • no interruptions In order to generate scores for our solution, we utilized commonly used metric - Mean Opinion Score (MOS). This metric is commonly used in various applications, which in simple words can be compared to the grades in school - from 1 to 5. Hence, we use this simple metric to evaluate the average for different parameters of the model. We must also underscore that this metric nowadays strongly relies upon the opinions of the experts in the area, it is a quite reasonable method to get approximate ranking; hence we utilize commonly used ratings varying from 5 (Excellent) to 4 (Good), 3 (Fair), 2 (Poor) and 1 (Bad) as in average category ranking scale (ACR). Since we can't always achieve the highest rank (5) due to variability in ranks, we consider that every rank over 4.3 can be considered enough to be named excellent. We also want to exclude poor quality examples, so as we don't consider any below 3.5. During our experiment, using a group of 20 participants, we asked each one to rank 100 sequences, which are described more in detail in Table 1; in overall we achieved score 4.02, which is quite good for our tasks, according to metric. Based on feedback from test group, and overall testing results, such insights were obtained: • Dependency on the training vocabulary, since the technology was trained on fiction texts, news texts were synthesized with a generally worse quality; • Pretrained weight produced good impact on training process. Pretrained models on other languages provided better faster convergence for Ukrainian language. Figure 9: Training charts Table 1 Details about testing statistics Characteristic Value People participated 20 Evaluated samples 2000 In minutes 1500 Highest MOS per person 4.6 Lowest MOS per person 3.8 6. Conclusions As a result of the work, a novel engine for text to speech synthesis for Ukrainian language was proposed and implemented. The accuracy of the method was also assessed. Architecture consists of two main parts (encoder/decoder and vocoder), and smaller preprocessing part with phonemes. A novel approach to Ukrainian language phones was presented and implemented as a part of technology. During implementation multiple approaches and models were considered, such as Tacotron2, Fastspeech2, Parallel Wave GAN, MelGan, Hi-Fi GAN, WaveNet and other. Based on experimental results, Tacotron2 and Parallel Wave GAN were selected. A proper metrics (MOS) was proposed and measured as a part of work. A statistically significant testing was performed using 20 people and a MOS of 4.02 was achieved. Among the limitations of the proposed approach, it should be noted that significant computing resources should be used for implementation. Further development of the text to speech engine will address improvement its versatility based of the dataset. 7. References [1] Amazon AWS: Polly. URL: https://aws.amazon.com/en/polly/ [2] Microsoft Azure: Text to speech. URL: https://azure.microsoft.com/en-us/services/cognitive- services/text-to-speech/ [3] Kovtun V., Kovtun O. System of methods of automated cognitive linguistic analysis of speech signals with noise, Multimedia Tools and Applications. Springer Science and Business Media LLC, 2022. https://doi.org/10.1007/s11042-022-13249-5 [4] Kovtun V., Kovtun O., Semenov A. Entropy-Argumentative Concept of Computational Phonetic Analysis of Speech Taking into Account Dialect and Individuality of Phonation, Entropy, vol. 24, no. 7, 2022; 1006. https://doi.org/10.3390/e24071006 [5] Krak, Y., Barmak, O., Mazurets, O. The practice implementation of the information technology for automated definition of semantic terms sets in the content of educational materials. In: CEUR Workshop Proceedings 2139, pp. 245-254 (2018). doi:10.15407/pp2018.02.245 [6] I.G. Kryvonos, Iu.V.Krak, O.V.Barmak, R.O. Bagrii. New Tools of Alternative Communication for Persons with Verbal Communication Disorders. Cybern. Syst. Anal. 52(5), 655–673 (2016). doi: 10.1007/s10559-016-9869-3 [7] Rashkevych, Y., Peleshko, D., Pelekh, I., Izonin, I. Speech signal marking on the base of local magnitude and invariant segmentation. Mathematical Modeling and Computing, 2014, 1(2), pp. 234–244.URL: https://ena.lpnu.ua/handle/ntb/26455 [8] UkrVox. URL: https://biblprog.org.ua/ru/ukrvox/ [9] URL: https://gud.rv.ua/ [10] RHVoice. URL: https://rhvoice.su/ [11] Google Cloud: Text to speech. URL: https://cloud.google.com/text-to-speech [12] Cerence/Nuance TTS Ukrainian. URL: https://nextup.com/cerence/ [13] Mel-spectrogram. URL: https://en.wikipedia.org/wiki/Mel_scale [14] Griffin-Lim Algorithm. URL: https://paperswithcode.com/method/griffin-lim-algorithm [15] J. Yu, et al. DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer. Entropy. 2023; 25(1):41. https://doi.org/10.3390/e25010041 [16] J. Shen, et al. TTS Synthesis by Conditioning Wavelet on Mel Spectrogram Predictions. URL: https://arxiv.org/pdf/1712.05884.pdf [17] Y. Ren, C. Hu, X. Tan, T. Qin. FastSpeech2: Fast and High-quality End-to-end Text to Speech. URL: https://arxiv.org/pdf/2006.04558.pdf [18] Y. Ren, et al., Fastspeech: Fast robust and controllable text to speech, Advances in Neural Information Processing Systems. URL: https:// proceedings.neurips.cc/paper/2019/ file/f63f65b503e22cb970527f23c9ad7db1-Paper.pdf [19] K. Cho, at al. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. URL: https://arxiv.org/abs/1406.1078 [20] D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. URL: https://arxiv.org/abs/1409.0473 [21] Tacotron2 vs FastSpeech2 – URL: https://towardsdatascience.com/text-to-speech-with-tacotron- 2-and-fastspeech-using-espnet-3a711131e0fa [22] K. Kumar, at al. MelGan: Generative Adversarial Networks for Conditional Waveform Synthesis URL: https://arxiv.org/abs/1910.06711 [23] J. Shen, et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 2018, pp. 4779-4783, doi: 10.1109/ICASSP.2018.8461368 [24] J. Kong, J. Kim, J. Bae. HIFI-GAN: Generative Adversarial Networks for Efficient and High- Fidelity Speech Synthesis. URL: https://arxiv.org/abs/2010.05646 [25] A. Oord, at al. WaveNet: A Generative Model for Raw Audio URL: https://arxiv.org/abs/1609.03499 [26] K. Park, T. Mulc. CSS10: A collection of single speaker speech dataset for 10 languages. URL: https://arxiv.org/pdf/1903.11269.pdf [27] S.O. Arik, at al. Deep Voice: Real-time Neural Text-to-Speech. Proceedings of the 34th Intern. Conf. on Machine Learning, 70:195-204. 2017. https://proceedings.mlr.press/v70/arik17a.html [28] Mozilla Foundation: Common Voice Dataset. URL: https://commonvoice.mozilla.org/en/datasets [29] Google Research: Google Colab. URL: https://colab.research.google.com/