Parameterizing Human Speech Generation

                                       Nazariy Perepichka

                       Ukrainian Catholic University, Lviv 79007, Ukraine
                                 perepichka@ucu.edu.ua


        Abstract. In modern days, the synthesis of human images and videos is, argua-
        bly, one of the most popular topics in the Data Science community. The synthesis
        of human speech is less trendy but is deeply bounded to the mentioned topic.
        Since the publication of WaveNet paper in 2016, the State-of-the-Art transited
        from parametric and concatenative systems to the use of deep learning models.
        Each significant paper on the topic mentions the way to parameterize the output
        audio with different voices and sentiments, though parameterizing is not the ma-
        jor focus of those works. Most of the time-proven solutions require re-training of
        models for speech synthesis of unknown to the model voice. In my master’s pro-
        ject, I aim to implement a competitive text-to-speech solution, enhance parame-
        terization abilities, and improve the performance of current models.


        Keywords: text-to-speech · deep learning · recurrent neural networks · audio
        generation


1       Introduction

The speech synthesis problem has a long research history. The desire to generate human
speech from a written text is easy to understand. The potential areas of applications for
such systems are enormous: generation of audio books, voice acting of films, making
computer systems socially accessible.
    In the last five years, the researchers have made significant progress. The big break-
through happened with the applying of deep learning techniques to the task. With every
year, the solutions diminish the difference between programmatically generated and
human speech samples.
    Though the quality of generated speech increased, human speech possesses multiple
parameters: tone of speech, the mood of the speaker, melodic component. Replication
of these parameters is still not a fully mature research area. The ability to generate nat-
ural-sounding, emotional speech can become the next big breakthrough in speech syn-
thesis history.
    During the work on my master’s diploma, I plan to research possibilities for speech
parameterization and present working solutions for such a task.
    The rest of this paper is organized as follows. In Section 2, I describe the domain:
evaluation metrics, history of algorithms development, results of text-to-speech (TTS)


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
In: Proceedings of the 1st Masters Symposium on Advances in Data Mining, Machine Learning,
and Computer Vision (MS-AMLV 2019), Lviv, Ukraine, November 15-16, 2019, pp. 52–58
Parametrizing Human Speech Generation                                                 53


systems. In Section 3, I define the problem for my master’s thesis and describe the
proposed solution, along with a description of the datasets and timeline of the research.


2      Domain Description

2.1    Evaluation of Algorithms
The main two goals of the system are intelligibility (capability of being understood)
and naturalness (ability to mimic human speech). Human perception of the output de-
fines both of the evaluation parameters. Therefore, the evaluation of TTS systems re-
quires subjective techniques for quality measurements. The most popular metric is the
mean opinion score (MOS) - average grade given to the audio sample by respondents.
MOS is arithmetic mean over user-given rating and depends on two parameters: quan-
tity of respondents and grading scale:
                                            ∑𝑁𝑁
                                             𝑖𝑖=1 𝑅𝑅𝑛𝑛
                                 𝑀𝑀𝑀𝑀𝑀𝑀 =                                             (1)
                                               𝑁𝑁


2.2    Classical TTS Systems
Current State-of-the-Art solutions have two predecessors, which are considered as clas-
sical speech synthesis approaches: concatenative and parametric TTS.
   Concatenative systems are the implementation of the intuitive idea to compose the
final audio out of small pre-recorded samples. This system builds output by concate-
nating recording units (words, phonemes). Such an approach satisfies the intelligibility
requirement, but it has multiple drawbacks: large, hard to collect unit database; artifi-
cial, so to say ”robotic”, sound; hardcoded rule-based programming.
   Parametric systems (Fig. 1) exploit a statistical approach for speech generation.
Such systems model synthesized speech based on acoustic and linguistic features. The
mathematical model is called the vocoder. Parametric TTS requires feature engineering
by hand, and it is the main drawback of approach. Hypothetically, with proper features
selection, such systems should work on the same level as deep learning models, but
practically such systems perform much poorly.


                   Fig. 1. A generic workflow of a parametric TTS system
54                                                                          Nazariy Perepichka


2.3    Deep Learning TTS
Deep learning approach to TTS is the natural step forward from parametric systems.
The main difference is the replacement of the features engineered by humans by the
features learned by machine learning models. A breakthrough in speech synthesis hap-
pened with the publication from Google Deepmind in 2016 [1]. The researchers pre-
sented a new architecture, called WaveNet, which operates directly on the raw audio
waveform and functions as a vocoder. The joint probability of a waveform is factorized
as a product of conditional probabilities, as follows:
                                     𝑇𝑇

                         𝑝𝑝(𝑥𝑥) = � 𝑝𝑝(𝑥𝑥𝑡𝑡 | 𝑥𝑥1 , … , 𝑥𝑥𝑡𝑡−1 )                          (2)
                                    𝑡𝑡=1

   Authors modeled the conditional probability distribution with a stack of convolu-
tional layers (Fig. 2). On the output layer, we receive conditional probability distribu-
tion for 𝑥𝑥𝑡𝑡 given by softmax function.


              Fig. 2. Visual representation of WaveNet convolutional layers [1]

  In [1], the authors mention the ability to parameterize output audio by incorporating
additional parameters h in the joint probability, as follows:
                                     𝑇𝑇

                       𝑝𝑝(𝑥𝑥|ℎ) = � 𝑝𝑝(𝑥𝑥𝑡𝑡 | 𝑥𝑥1 , … , 𝑥𝑥𝑡𝑡−1 , ℎ)                       (3)
                                    𝑡𝑡=1

   For their research, they parameterized the narrator – by passing encoding of the voice
– and text – by passing linguistic features of the text. The TTS solution significantly
outperformed all the previous benchmarks.
   Since then, the WaveNet vocoder is used in the most of TTS research, which focus
mostly on the representation of parameters. In 2018, Google researchers presented [2],
which described Tacotron 2 architecture for speech synthesis (Fig. 4). The model pro-
jects text to MEL-scale spectrograms and then uses modified WaveNet vocoder to audio
output itself. This architecture combines WaveNet and Tacotron 1 models. Tacotron 2
achieved a MOS score of 4.53, in comparison to 4.58 for professionally recorded human
speech.
Parametrizing Human Speech Generation                                              55


                     Fig. 3. MOS scores presented in WaveNet paper [1]


                              Fig. 4. Tacotron 2 architecture [2]


   This year, the engineers from Dessa company presented the results of their new TTS
model named Realtalk [6]. They published only a YouTube video presenting the gen-
erated speech of Joe Rogan (a famous podcast host) and wrote two small articles for
the general public with minimal technical specifications. The output audio mimics the
emotions and intonations of the narrator and synthesizes highly realistic output. They
claim that their model has been trained on a significantly smaller dataset compared to
56                                                                       Nazariy Perepichka


the State-of-the-Art models (8 hours to 20 hours of audio) with no loss in quality of
generated audio.


3      Proposal

3.1    Problem Statement
For my master’s, I plan to implement the end-to-end TTS system, which generates the
speech specific to speaker voice and tone.


3.2    Proposed Solution
The system will have three main components. The first component will encode senti-
ment and voice and pass the representation to the second component – the sequential
encoder. I want to experiment with possible implementations of sequential encoding.
You can see possible solutions and forms of representation in Fig. 5.


                               Fig. 5. Proposed architecture

   For the third component – the vocoder – I want to experiment with wave generation
algorithms: starting from more classical algorithms, like Griffin-Lim, to that more fre-
quently used in current research, like WaveNet. The model will consist of two parts:
the first one will represent input features: voice, sentiment, text; the second model will
be vocoder for audio generation.
Parametrizing Human Speech Generation                                                  57


3.3    Datasets Description
In my research, I plan to use the following datasets:
   VoxCeleb [3] – the dataset contains more than two thousand hours of speech from
seven thousand of different speakers.
   Toronto emotional speech set (TESS) [4] – the dataset contains 2800 audio sam-
ples recorded by two female speakers aged 26 and 64 years. Each sample has a length
of two seconds
   The LJ speech dataset [5] – the dataset contains 13,100 audio clips of a single
speaker with transcriptions of the text. The total length of clips is approximately 24
hours.
   Self-collected dataset – I plan to collect a transcripted dataset with different voices
from YouTube videos. I implemented the script, which downloads audio with subtitles,
generated by the platform. By downloading data from specific channels, I can ensure
speaker identity.


3.4    Timeline
Below I describe the timeline of research:

  Present state
─ researched most significant papers on the topic;
─ build the script for extracting audio and description from YouTube videos;
─ wrote the general audio preprocessing scripts;
─ worked with pre-trained models;

  October 2019
─ collecting the dataset;
─ modeling;
─ experiments with vocoder input representation;

  November 2019
─ experimenting with vocoder;
─ hyper-parameters tuning;
─ building the pipeline; packaging the model;

  December 2019
─ analyzing the results;
─ evaluating the model using MOS;
─ working on the diploma text;

  January 2019

─ bringing everything together.
58                                                                              Nazariy Perepichka


4      Conclusion

The goal of my work is to build a working TTS-system with the ability to parameterize
audio output. By incorporating current knowledge in the domain and experimenting
with new approaches, I want to present a working solution for my thesis defense.


References
 1. Oord, A.V.D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner,
    N., Senior, A., Kavukcuoglu, K.: WaveNet: a generative model for raw audio. arXiv preprint
    arXiv:1609.03499 (2016)
 2. Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang,Y., Wang,
    Y., Skerrv-Ryan, R., Saurous, R.A.: Natural TTS synthesis by conditioning WaveNet on
    MEL spectrogram predictions. In: 2018 IEEE International Conference on Acoustics,
    Speech, and Signal Processing, pp. 4779–4783. IEEE Press, New York (2018)
 3. Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification da-
    taset. arXiv preprint, arXiv:1706.08612 (2017)
 4. Kate Dupuis, M. Kathleen Pichora-Fuller, University of Toronto, Psychology Department.
    https://tspace.library.utoronto.ca/handle/1807/24487
 5. Keith Ito: The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/
 6. RealTalk: We Recreated Joe Rogan’s Voice with AI. https://dessa.com/realtalkwerecreated-
    joe-rogans-voice-with-ai/