Enhancing Controllability of Text Generation Anton Shcherbyna1 and Kostiantyn Omelianchuk2 1 Ukrainian Catholic University, Faculty of Applied Sciences, Lviv, Ukraine a.shcherbyna@ucu.edu.ua 2 Grammarly, Kyiv, Ukraine komelianchuk@gmail.com Abstract. There are many models used to generate text, conditioned on some context. However, those approaches do not provide an ability to control various aspects of the generated text like style, tone, language, tense, sentiment, lengths, grammaticality, etc. In this work, we are exploring unsupervised ways to learn disentangled vector representations of sentences with different interpretable com- ponents and trying to generate text in a controllable manner based on obtained representations. Keywords: natural language processing · natural language understanding · rep- resentation learning · text generation · unsupervised learning 1 Introduction 1.1 Text Generation Overview In recent years, there was a significant advancement in the field of text generation. In 2014, sequence-to-sequence models with LSTM encoder and decoder were proposed [1]. This approach became state-of-the-art in the field and was successfully used for various tasks e.g., machine translation. However, LSTM networks tend to forget infor- mation from the whole sequence, so the next significant improvement – attention mech- anism – was proposed [2]. The main idea of this approach is to provide a decoder with the information from each token from the source sequence directly and score each piece of information by usefulness for the decoder. Finally, a pure attentional model, which is called Transformer, was proposed [3]. Since then, transformer-like models became the State-of-the-Art methods in text representation learning and text generation. For example, BERT released by Google [4] became the standard for extracting representa- tions from texts and GPT-2 developed by OpenAI [5] became the most powerful tool for text generation. In the case of GPT-2, the authors provided weights only for a small model with limited capabilities. They mentioned that their model is capable of produc- ing such high-quality texts, so they have a fear that somebody can use this model to produce fakes. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). In: Proceedings of the 1st Masters Symposium on Advances in Data Mining, Machine Learning, and Computer Vision (MS-AMLV 2019), Lviv, Ukraine, November 15-16, 2019, pp. 98–105 Enhancing Controllability of Text Generation 99 All those models have similar structure. Typical text generation model consists of an encoder 𝐸𝐸𝜃𝜃 (𝑥𝑥) and decoder 𝐷𝐷𝜑𝜑 (ℎ). Both an encoder and decoder can be represented as a deep neural network: LSTM [1], CNN [6], or a stacked feed-forward network, which forms a transformer-like model [3]. An encoder extracts information from the source sequence {𝑥𝑥𝑖𝑖 } into hidden representations {ℎ} and then a decoder produces target sequence based on those representations (Fig. 1). Such models are trained end-to-end and use various training signals. For example, we can force an encoder to encode one sentence and decoder to produce the next sentence from the same text. Or we can use the so-called ”hidden language model” approach when our sequence-to-sequence model is forced to predict intentionally deleted tokens from the source sequence. In both cases, we use classic categorical entropy between the distribution predicted by the network and the true distribution as a loss function. Fig. 1. A simple sequence-to-sequence model However, all those approaches lack one crucial property – controllability. By con- trollability, we mean an ability to change the attributes of the generated text such as sentiment, length, complexity, etc. The models described above are conditioned only on the text they saw previously, which is uninterpretable and unpredictable controllable parameter. Furthermore, the space of hidden representations of such models is un- smooth [7]. It means that we cannot interpolate in the latent space to discover depend- encies between different hidden representations and generated text. Another problem is that such representations capture information about text attributes alongside with the context, and we want to manipulate only using attributes. This problem limits the usage of such models for modern applications like dialog systems or question-answering sys- tems. Also, it’s worth to note that currently transformer-like models outperform old mod- els based on LSTM, but they are harder to train, require much more training data and computational resources. Hence, we focus on LSTM-based models. Moreover, there is no difference between LSTM and transformer-based models in terms of our problem. 100 Anton Shcherbyna and Kostiantyn Omelianchuk Therefore, we can transfer all the methods developed for LSTM to transformer-like models. 1.2 Useful Approaches from Vision Domain There was considerable progress in the direction of controllable generation in the vision domain. VAE [8] extends a classical auto-encoder with probabilistic argumentation and gives the ability to control generation by exploring the latent space. For this purpose, we define a latent variable 𝑧𝑧 ∼ 𝑝𝑝𝑧𝑧 (𝑧𝑧), which has some probabilistic prior distribution (typically Gaussian). Then we define some complex posterior conditional distribution 𝑥𝑥 ∼ 𝑝𝑝𝜃𝜃 (𝑥𝑥|𝑧𝑧) (typically it is a Gaussian distribution with mean and variance expressed by a neural network with parameters 𝜃𝜃). Now we can define the likelihood: 𝑝𝑝𝜃𝜃 (𝑥𝑥) = ∫ 𝑝𝑝𝜃𝜃 (𝑥𝑥|𝑧𝑧)𝑝𝑝(𝑧𝑧)𝑑𝑑𝑑𝑑, (1) which appears to be intractable, so we cannot optimize it directly. However, there is a solution. We introduce a new conditional prior distribution 𝑞𝑞𝜑𝜑 (𝑧𝑧|𝑥𝑥) (similar to poste- rior) parameterized by a neural network with parameters φ (Fig. 2). This allows us to derive a lower bound on the data likelihood that is tractable, so we can optimize it using gradient descent: ℒ = 𝔼𝔼𝑧𝑧∼𝑞𝑞𝜙𝜙 (𝑧𝑧|𝑥𝑥) 𝑙𝑙𝑙𝑙𝑙𝑙𝑝𝑝𝜃𝜃 (𝑥𝑥|𝑧𝑧) − 𝐾𝐾𝐾𝐾(𝑞𝑞𝜙𝜙 (𝑧𝑧|𝑥𝑥)||𝑝𝑝(𝑧𝑧)). (2) Now we can encode source sample to latent space, tweak the latents and decode it. Fig.2. VAE architecture with discriminator 2 Problem Setting To make text generation more controllable, we want to incorporate VAE-like approach from vision domain into text generation. From the first view, it looks straightforward, but we have to face a couple of problems: Enhancing Controllability of Text Generation 101 1. VAE expressive power is limited due to the restriction we put on the posterior dis- tribution [9] 2. VAE often faces posterior collapse. It means that a strong decoder tends to ignore latent codes during generation [7]. Furthermore, even if we solve those two problems and successfully incorporate VAE, we will still face the other crucial problem: we extended the sequence-to-se- quence model with meaningful latent space, but latent codes are highly entangled, so it is hard to change each attribute separately. Moreover, such latents also capture context information, which is undesirable. Therefore, we need to find a way to make those rep- resentations disentangled. 3 Related Work 3.1 Deal with Entanglement, Supervised Way To solve the problem with high entanglements of latent codes we can extend our VAE model with additional discriminator network [10]. In this work, the authors augmented a latent code z with additional part c: z is responsible for encoding context information as in classic approach; c is forcefully disentangled and each its component captures attribute information. It works as follows: the encoder produces a latent pair (𝑧𝑧, 𝑐𝑐), then the decoder generates a sample 𝑥𝑥�, which is encoded by the encoder to get 𝑐𝑐̂ . The dis- criminator is used to distinguish between c and 𝑐𝑐̂ . The signal from the discriminator is used to update the weights of the decoder (Fig. 3). Fig.3. VAE architecture with a discriminator Also in this paper, the authors proposed the method to deal with the discrete nature of text. The decoder, at each step, produces the probability distribution function param- eterized by softmax over possible tokens and then the token with the highest probability 102 Anton Shcherbyna and Kostiantyn Omelianchuk is selected. For discriminator training, we may leave this parameterized probability dis- tribution and control it with the temperature parameter τ: ℎ𝑡𝑡 𝑥𝑥� = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠( ) (3) 𝜏𝜏 There are three big problems with such approach. Firstly, we need to build a separate discriminator for each attribute. Hence, the complexity of the model grows signifi- cantly, as we add new attributes. The second problem is that we need to get data to pre- train discriminators and for some attributes, such as complexity, it might be difficult. The third problem is that there are no solutions for limited expressive capabilities of the Gaussian posterior and posterior collapse problems. 3.2 Dealing with VAE Problems in an Unsupervised Way In the other work [11], the authors presented a fully unsupervised approach, which at- tacks all the three problems. They proposed sample-based representations, which are more expressive than Gaussian posterior, and called their approach Implicit VAE (or iVAE). They defined a sampling mechanism instead of using explicit Gaussian and thus represented the distribution generated by the encoder as the set of latents: 𝑧𝑧 = 𝑞𝑞𝜃𝜃 (𝑥𝑥, 𝜖𝜖), 𝜖𝜖 ∼ 𝑞𝑞(𝜖𝜖), (4) where 𝑝𝑝(𝜖𝜖) is a Gaussian and q is the concatenation of the hidden state of the encoder and 𝜖𝜖. In this case, the KL-divergence 𝐾𝐾𝐾𝐾(𝑞𝑞𝜑𝜑 (𝑧𝑧|𝑥𝑥)||𝑝𝑝(𝑧𝑧) became intractable, but we can represent it using a dual form: 𝔼𝔼𝑧𝑧∼𝑞𝑞𝜙𝜙(𝑧𝑧|𝑥𝑥) 𝑣𝑣𝜓𝜓 (𝑥𝑥, 𝑧𝑧) − 𝔼𝔼𝑧𝑧∼𝑝𝑝(𝑧𝑧) exp(𝑣𝑣𝜓𝜓 (𝑥𝑥, 𝑧𝑧)). (5) Then the final loss function looks as follows: ℒ = 𝔼𝔼𝑧𝑧∼𝑞𝑞𝜙𝜙(𝑧𝑧|𝑥𝑥 ) 𝑙𝑙𝑙𝑙𝑙𝑙𝑝𝑝𝜃𝜃 (𝑥𝑥, 𝑧𝑧) − 𝔼𝔼𝑧𝑧∼𝑞𝑞𝜙𝜙(𝑧𝑧|𝑥𝑥 ) 𝑣𝑣𝜓𝜓 (𝑥𝑥, 𝑧𝑧) + 𝔼𝔼𝑧𝑧∼𝑝𝑝(𝑧𝑧) exp(𝑣𝑣𝜓𝜓 (𝑥𝑥, 𝑧𝑧)). (6) Also, the authors described the solution to the posterior collapse problem. Posterior collapse means that a strong decoder ignores the dependency on latent codes. As a re- sult, the distribution generated by the encoder 𝑞𝑞𝜑𝜑 (𝑧𝑧|𝑥𝑥) exactly matches p(z). To over- come this problem authors proposed to add stronger regularization on the latent space by changing the KL-divergence that we used previously to the new one: ℒ𝑀𝑀𝑀𝑀 = 𝐾𝐾𝐾𝐾(𝑞𝑞𝜙𝜙 (𝑧𝑧)||𝑝𝑝(𝑧𝑧)), (7) where 𝑞𝑞𝜙𝜙 (𝑧𝑧) = ∫ 𝑞𝑞𝜙𝜙 (𝑧𝑧|𝑥𝑥)𝑞𝑞(𝑥𝑥)𝑑𝑑𝑑𝑑 – aggregated posterior. This approach is called Im- plicit VAE with mutual information. Those improvements helps us solve problems with VAE and learn latent codes fully unsupervised, but those representations are still en- tangled. Enhancing Controllability of Text Generation 103 3.3 Dealing with Entanglement: Partially Solving VAE Problem in an Unsupervised Way The problem of entanglement was attacked in [12]. The authors proposed to use two different encoders and split latent codes into two parts z1 and z2. Then, the first encoder will be forced to capture the global variations in the data, which correspond to the at- tributes that we want to control. The second part will capture context information useful for reconstruction purposes. Then they decided to constraint the latent space for z1 to have the following structure: 𝑧𝑧1 = ∑𝐾𝐾 𝐾𝐾 𝑖𝑖=1 𝑝𝑝𝑖𝑖 𝑒𝑒𝑖𝑖 , ∑𝑖𝑖=1 𝑝𝑝𝑖𝑖 = 1, (8) where 𝑒𝑒𝑖𝑖 are learnable vectors and 𝑝𝑝𝑖𝑖 can be obtained through the scoring procedure: 𝑝𝑝 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑊𝑊𝑧𝑧̂1 + 𝑏𝑏), (9) where 𝑧𝑧̂1 is a classic posterior obtained from 𝑞𝑞𝜓𝜓1 (𝑧𝑧|𝑥𝑥). In other words, we want to learn a set of basis vectors and then obtain the latent code as a linear combination of such vectors. Those basis vectors 𝑒𝑒𝑖𝑖 in such a setting tend to capture global variations in the data and it is easier for the decoder to generate sentences because latent codes are just the combinations of such basis vectors. This model is trained as a typical VAE, but to train the parameters W, b and 𝑒𝑒𝑖𝑖 an addi- tional term is introduced: 1 ℒ𝑟𝑟𝑟𝑟𝑟𝑟 = 𝔼𝔼𝑧𝑧1∼𝑞𝑞𝜓𝜓 �𝑧𝑧1 �𝑥𝑥 � ( ∑𝑚𝑚 𝑖𝑖=1 max(0.1 − 𝑧𝑧̂1 𝑧𝑧1 + 𝑧𝑧̂1 𝑢𝑢𝑖𝑖 ), (10) 1 𝑚𝑚 where m is the number of samples from the data and 𝑢𝑢𝑖𝑖 are the latent codes of those samples. However, only this loss is incapable of forcing orthogonality of the basis vec- tors, so one more term was introduced: ℒ𝑜𝑜𝑜𝑜𝑜𝑜 = ‖𝐸𝐸 T 𝐸𝐸 − 𝐼𝐼‖. (11) The authors showed that with such structural constraint there is a small chance that there will be a posterior collapse. However, there is still a problem with the expressive capabilities of VAE. Moreover, authors added additional constraint, which can limit those capabilities even more. 4 Research Goal and Evaluation The main goal of the master thesis is to empirically evaluate the approaches described above and combine them to build the model, which will be capable of solving all the problems we defined in problem setting. Then we want to explore the latent space and discover which attributes of the text the model was able to capture The proposed model will consist of the LSTM encoder and decoder with VAE mech- anism between them, implicit sampling-based posterior, and the constraint on the re- sulted latent. Let us breakdown the whole process into the following steps: 104 Anton Shcherbyna and Kostiantyn Omelianchuk 1. First, we will take a source sequence and encode it into the two hidden vectors h1 and h2 2. Then we will add noise to those vectors to obtain the pairs (ℎ1 , 𝜖𝜖1 ) and (ℎ2 , 𝜖𝜖2 ), where 𝜖𝜖 is a Gaussian (as described in 3.2) 3.Next we will propagate this vector through MLPs to obtain 𝑧𝑧̂1 and 𝑧𝑧2 (as described in 3.2) 4.Finally, we will use 𝑧𝑧̂1 to calculate the scores 𝑝𝑝𝑖𝑖 for the final latent calculation: 𝐾𝐾 𝑧𝑧1 = � 𝑝𝑝𝑖𝑖 𝑒𝑒𝑖𝑖 (12) 𝑖𝑖=1 5. Now we can use concatenated (𝑧𝑧1 , 𝑧𝑧2 ) for further text generation. Evaluation of this approach will be done via solving style transfer problem on Yelp dataset. We will measure: 1. Content preservation (BLEU) 2. Style transfer strength (supervised classifiers) 3. Fluency and correct grammar (perplexity by GPT-2 language model) 5 Research Plan We plan to organize further work in the following way: 1. Implement and test unsupervised approach based on Implicit VAE with MI regular- ization 2. Implement and test unsupervised approach based on latents as linear combination of basis vectors, which incorporates global variation from data 3. Add implicit latent learning to the second approach 4. Explore latent space and find attributes, which model captured 5. In case of a success, we will extend those models to be transformer-like with more powerful encoder and decoder 6 Conclusion In this master’s thesis proposal, we made an overview of the current state in the field of text generation and described VAE, which is used for controllable generation in the vision domain and is applicable in text domain. We defined the most crucial problems: issues with VAE itself (its expressive limits and posterior collapse) and difficulty with an entanglement of latents. Further, we made related works overview and proposed our potential solution, which is based on the combination of implicit posterior distribution and constraint on the resulted latent in the form of the linear combination of basis vec- tors. We believe that this improvement can increase the degree of controllability and quality of the resulting samples. Enhancing Controllability of Text Generation 105 References 1. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. arXiv preprint arXiv: 1409.3215 (2014) 2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv: 1409.0473 (2015) 3. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. arXiv preprint arXiv: 1706.03762 (2017) 4. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv: 1810.04805 (2018) 5. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8) (2019) 6. Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to sequence learning. arXiv preprint arXiv: 1705.03122 (2017) 7. Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating sentences from a continuous space. arXiv preprint arXiv: 1511.06349 (2015) 8. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv: 1312.6114 (2013) 9. Cremer, C., Li, X., Duvenaud, D.: Inference suboptimality in variational autoencoders. arXiv preprint arXiv: 1801.03558 (2018) 10. Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., Xing E.P.: Toward controlled generation of text. arXiv preprint arXiv: 1703.00955 (2017) 11. Fang, L., Li, C., Gao, J., Dong, W., Chen, C.: Implicit deep latent variable models for text generation. arXiv preprint arXiv: 1908.11527 (2019) 12. Xu, P., Cao, Y., Cheung, J.C.K.: Unsupervised controllable text generation with global var- iation discovery and disentanglement. arXiv preprint arXiv: 1905.11975 (2019)