Enhancing Controllability of Text Generation

                     Anton Shcherbyna1 and Kostiantyn Omelianchuk2
          1 Ukrainian Catholic University, Faculty of Applied Sciences, Lviv, Ukraine

                                a.shcherbyna@ucu.edu.ua
                                   2 Grammarly, Kyiv, Ukraine

                                 komelianchuk@gmail.com


        Abstract. There are many models used to generate text, conditioned on some
        context. However, those approaches do not provide an ability to control various
        aspects of the generated text like style, tone, language, tense, sentiment, lengths,
        grammaticality, etc. In this work, we are exploring unsupervised ways to learn
        disentangled vector representations of sentences with different interpretable com-
        ponents and trying to generate text in a controllable manner based on obtained
        representations.


        Keywords: natural language processing · natural language understanding · rep-
        resentation learning · text generation · unsupervised learning


1       Introduction

1.1     Text Generation Overview
In recent years, there was a significant advancement in the field of text generation. In
2014, sequence-to-sequence models with LSTM encoder and decoder were proposed
[1]. This approach became state-of-the-art in the field and was successfully used for
various tasks e.g., machine translation. However, LSTM networks tend to forget infor-
mation from the whole sequence, so the next significant improvement – attention mech-
anism – was proposed [2]. The main idea of this approach is to provide a decoder with
the information from each token from the source sequence directly and score each piece
of information by usefulness for the decoder. Finally, a pure attentional model, which
is called Transformer, was proposed [3]. Since then, transformer-like models became
the State-of-the-Art methods in text representation learning and text generation. For
example, BERT released by Google [4] became the standard for extracting representa-
tions from texts and GPT-2 developed by OpenAI [5] became the most powerful tool
for text generation. In the case of GPT-2, the authors provided weights only for a small
model with limited capabilities. They mentioned that their model is capable of produc-
ing such high-quality texts, so they have a fear that somebody can use this model to
produce fakes.


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
In: Proceedings of the 1st Masters Symposium on Advances in Data Mining, Machine Learning,
and Computer Vision (MS-AMLV 2019), Lviv, Ukraine, November 15-16, 2019, pp. 98–105
Enhancing Controllability of Text Generation                                           99


   All those models have similar structure. Typical text generation model consists of
an encoder 𝐸𝐸𝜃𝜃 (𝑥𝑥) and decoder 𝐷𝐷𝜑𝜑 (ℎ). Both an encoder and decoder can be represented
as a deep neural network: LSTM [1], CNN [6], or a stacked feed-forward network,
which forms a transformer-like model [3]. An encoder extracts information from the
source sequence {𝑥𝑥𝑖𝑖 } into hidden representations {ℎ} and then a decoder produces target
sequence based on those representations (Fig. 1). Such models are trained end-to-end
and use various training signals. For example, we can force an encoder to encode one
sentence and decoder to produce the next sentence from the same text. Or we can use
the so-called ”hidden language model” approach when our sequence-to-sequence
model is forced to predict intentionally deleted tokens from the source sequence. In
both cases, we use classic categorical entropy between the distribution predicted by the
network and the true distribution as a loss function.


                           Fig. 1. A simple sequence-to-sequence model

   However, all those approaches lack one crucial property – controllability. By con-
trollability, we mean an ability to change the attributes of the generated text such as
sentiment, length, complexity, etc. The models described above are conditioned only
on the text they saw previously, which is uninterpretable and unpredictable controllable
parameter. Furthermore, the space of hidden representations of such models is un-
smooth [7]. It means that we cannot interpolate in the latent space to discover depend-
encies between different hidden representations and generated text. Another problem is
that such representations capture information about text attributes alongside with the
context, and we want to manipulate only using attributes. This problem limits the usage
of such models for modern applications like dialog systems or question-answering sys-
tems.
   Also, it’s worth to note that currently transformer-like models outperform old mod-
els based on LSTM, but they are harder to train, require much more training data and
computational resources. Hence, we focus on LSTM-based models. Moreover, there is
no difference between LSTM and transformer-based models in terms of our problem.
100                                                          Anton Shcherbyna and Kostiantyn Omelianchuk


Therefore, we can transfer all the methods developed for LSTM to transformer-like
models.


1.2     Useful Approaches from Vision Domain
There was considerable progress in the direction of controllable generation in the vision
domain. VAE [8] extends a classical auto-encoder with probabilistic argumentation and
gives the ability to control generation by exploring the latent space. For this purpose,
we define a latent variable 𝑧𝑧 ∼ 𝑝𝑝𝑧𝑧 (𝑧𝑧), which has some probabilistic prior distribution
(typically Gaussian). Then we define some complex posterior conditional distribution
𝑥𝑥 ∼ 𝑝𝑝𝜃𝜃 (𝑥𝑥|𝑧𝑧) (typically it is a Gaussian distribution with mean and variance expressed
by a neural network with parameters 𝜃𝜃). Now we can define the likelihood:
                              𝑝𝑝𝜃𝜃 (𝑥𝑥) = ∫ 𝑝𝑝𝜃𝜃 (𝑥𝑥|𝑧𝑧)𝑝𝑝(𝑧𝑧)𝑑𝑑𝑑𝑑,                                 (1)
which appears to be intractable, so we cannot optimize it directly. However, there is a
solution. We introduce a new conditional prior distribution 𝑞𝑞𝜑𝜑 (𝑧𝑧|𝑥𝑥) (similar to poste-
rior) parameterized by a neural network with parameters φ (Fig. 2). This allows us to
derive a lower bound on the data likelihood that is tractable, so we can optimize it using
gradient descent:
               ℒ = 𝔼𝔼𝑧𝑧∼𝑞𝑞𝜙𝜙 (𝑧𝑧|𝑥𝑥) 𝑙𝑙𝑙𝑙𝑙𝑙𝑝𝑝𝜃𝜃 (𝑥𝑥|𝑧𝑧) − 𝐾𝐾𝐾𝐾(𝑞𝑞𝜙𝜙 (𝑧𝑧|𝑥𝑥)||𝑝𝑝(𝑧𝑧)).               (2)
    Now we can encode source sample to latent space, tweak the latents and decode it.


                            Fig.2. VAE architecture with discriminator


2       Problem Setting

To make text generation more controllable, we want to incorporate VAE-like approach
from vision domain into text generation. From the first view, it looks straightforward,
but we have to face a couple of problems:
Enhancing Controllability of Text Generation                                             101


1. VAE expressive power is limited due to the restriction we put on the posterior dis-
   tribution [9]
2. VAE often faces posterior collapse. It means that a strong decoder tends to ignore
   latent codes during generation [7].
   Furthermore, even if we solve those two problems and successfully incorporate
VAE, we will still face the other crucial problem: we extended the sequence-to-se-
quence model with meaningful latent space, but latent codes are highly entangled, so it
is hard to change each attribute separately. Moreover, such latents also capture context
information, which is undesirable. Therefore, we need to find a way to make those rep-
resentations disentangled.


3       Related Work

3.1     Deal with Entanglement, Supervised Way
To solve the problem with high entanglements of latent codes we can extend our VAE
model with additional discriminator network [10]. In this work, the authors augmented
a latent code z with additional part c: z is responsible for encoding context information
as in classic approach; c is forcefully disentangled and each its component captures
attribute information. It works as follows: the encoder produces a latent pair (𝑧𝑧, 𝑐𝑐), then
the decoder generates a sample 𝑥𝑥�, which is encoded by the encoder to get 𝑐𝑐̂ . The dis-
criminator is used to distinguish between c and 𝑐𝑐̂ . The signal from the discriminator is
used to update the weights of the decoder (Fig. 3).


                           Fig.3. VAE architecture with a discriminator


   Also in this paper, the authors proposed the method to deal with the discrete nature
of text. The decoder, at each step, produces the probability distribution function param-
eterized by softmax over possible tokens and then the token with the highest probability
102                                                                Anton Shcherbyna and Kostiantyn Omelianchuk


is selected. For discriminator training, we may leave this parameterized probability dis-
tribution and control it with the temperature parameter τ:
                                                    ℎ𝑡𝑡
                               𝑥𝑥� = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠( )                                  (3)
                                                    𝜏𝜏
   There are three big problems with such approach. Firstly, we need to build a separate
discriminator for each attribute. Hence, the complexity of the model grows signifi-
cantly, as we add new attributes. The second problem is that we need to get data to pre-
train discriminators and for some attributes, such as complexity, it might be difficult.
The third problem is that there are no solutions for limited expressive capabilities of the
Gaussian posterior and posterior collapse problems.


3.2     Dealing with VAE Problems in an Unsupervised Way
In the other work [11], the authors presented a fully unsupervised approach, which at-
tacks all the three problems. They proposed sample-based representations, which are
more expressive than Gaussian posterior, and called their approach Implicit VAE (or
iVAE). They defined a sampling mechanism instead of using explicit Gaussian and thus
represented the distribution generated by the encoder as the set of latents:
                                      𝑧𝑧 = 𝑞𝑞𝜃𝜃 (𝑥𝑥, 𝜖𝜖), 𝜖𝜖 ∼ 𝑞𝑞(𝜖𝜖),                                           (4)

where 𝑝𝑝(𝜖𝜖) is a Gaussian and q is the concatenation of the hidden state of the encoder
and 𝜖𝜖. In this case, the KL-divergence 𝐾𝐾𝐾𝐾(𝑞𝑞𝜑𝜑 (𝑧𝑧|𝑥𝑥)||𝑝𝑝(𝑧𝑧) became intractable, but we
can represent it using a dual form:
                       𝔼𝔼𝑧𝑧∼𝑞𝑞𝜙𝜙(𝑧𝑧|𝑥𝑥) 𝑣𝑣𝜓𝜓 (𝑥𝑥, 𝑧𝑧) − 𝔼𝔼𝑧𝑧∼𝑝𝑝(𝑧𝑧) exp(𝑣𝑣𝜓𝜓 (𝑥𝑥, 𝑧𝑧)).                          (5)
   Then the final loss function looks as follows:
 ℒ = 𝔼𝔼𝑧𝑧∼𝑞𝑞𝜙𝜙(𝑧𝑧|𝑥𝑥 ) 𝑙𝑙𝑙𝑙𝑙𝑙𝑝𝑝𝜃𝜃 (𝑥𝑥, 𝑧𝑧) − 𝔼𝔼𝑧𝑧∼𝑞𝑞𝜙𝜙(𝑧𝑧|𝑥𝑥 ) 𝑣𝑣𝜓𝜓 (𝑥𝑥, 𝑧𝑧) + 𝔼𝔼𝑧𝑧∼𝑝𝑝(𝑧𝑧) exp(𝑣𝑣𝜓𝜓 (𝑥𝑥, 𝑧𝑧)).   (6)
   Also, the authors described the solution to the posterior collapse problem. Posterior
collapse means that a strong decoder ignores the dependency on latent codes. As a re-
sult, the distribution generated by the encoder 𝑞𝑞𝜑𝜑 (𝑧𝑧|𝑥𝑥) exactly matches p(z). To over-
come this problem authors proposed to add stronger regularization on the latent space
by changing the KL-divergence that we used previously to the new one:
                                     ℒ𝑀𝑀𝑀𝑀 = 𝐾𝐾𝐾𝐾(𝑞𝑞𝜙𝜙 (𝑧𝑧)||𝑝𝑝(𝑧𝑧)),                                            (7)
where 𝑞𝑞𝜙𝜙 (𝑧𝑧) = ∫ 𝑞𝑞𝜙𝜙 (𝑧𝑧|𝑥𝑥)𝑞𝑞(𝑥𝑥)𝑑𝑑𝑑𝑑 – aggregated posterior. This approach is called Im-
plicit VAE with mutual information. Those improvements helps us solve problems with
VAE and learn latent codes fully unsupervised, but those representations are still en-
tangled.
Enhancing Controllability of Text Generation                                               103


3.3     Dealing with Entanglement: Partially Solving VAE Problem in an
        Unsupervised Way
The problem of entanglement was attacked in [12]. The authors proposed to use two
different encoders and split latent codes into two parts z1 and z2. Then, the first encoder
will be forced to capture the global variations in the data, which correspond to the at-
tributes that we want to control. The second part will capture context information useful
for reconstruction purposes.
   Then they decided to constraint the latent space for z1 to have the following structure:
                               𝑧𝑧1 = ∑𝐾𝐾                𝐾𝐾
                                      𝑖𝑖=1 𝑝𝑝𝑖𝑖 𝑒𝑒𝑖𝑖 , ∑𝑖𝑖=1 𝑝𝑝𝑖𝑖 = 1,                     (8)
where 𝑒𝑒𝑖𝑖 are learnable vectors and 𝑝𝑝𝑖𝑖 can be obtained through the scoring procedure:
                                 𝑝𝑝 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑊𝑊𝑧𝑧̂1 + 𝑏𝑏),                         (9)
where 𝑧𝑧̂1 is a classic posterior obtained from 𝑞𝑞𝜓𝜓1 (𝑧𝑧|𝑥𝑥). In other words, we want to learn
a set of basis vectors and then obtain the latent code as a linear combination of such
vectors. Those basis vectors 𝑒𝑒𝑖𝑖 in such a setting tend to capture global variations in the
data and it is easier for the decoder to generate sentences because latent codes are just
the combinations of such basis vectors.
This model is trained as a typical VAE, but to train the parameters W, b and 𝑒𝑒𝑖𝑖 an addi-
tional term is introduced:
                                            1
              ℒ𝑟𝑟𝑟𝑟𝑟𝑟 = 𝔼𝔼𝑧𝑧1∼𝑞𝑞𝜓𝜓 �𝑧𝑧1 �𝑥𝑥 � ( ∑𝑚𝑚
                                                 𝑖𝑖=1 max(0.1 − 𝑧𝑧̂1 𝑧𝑧1 + 𝑧𝑧̂1 𝑢𝑢𝑖𝑖 ),   (10)
                                 1          𝑚𝑚

where m is the number of samples from the data and 𝑢𝑢𝑖𝑖 are the latent codes of those
samples. However, only this loss is incapable of forcing orthogonality of the basis vec-
tors, so one more term was introduced:
                                     ℒ𝑜𝑜𝑜𝑜𝑜𝑜 = ‖𝐸𝐸 T 𝐸𝐸 − 𝐼𝐼‖.                            (11)

   The authors showed that with such structural constraint there is a small chance that
there will be a posterior collapse. However, there is still a problem with the expressive
capabilities of VAE. Moreover, authors added additional constraint, which can limit
those capabilities even more.


4       Research Goal and Evaluation

The main goal of the master thesis is to empirically evaluate the approaches described
above and combine them to build the model, which will be capable of solving all the
problems we defined in problem setting. Then we want to explore the latent space and
discover which attributes of the text the model was able to capture
   The proposed model will consist of the LSTM encoder and decoder with VAE mech-
anism between them, implicit sampling-based posterior, and the constraint on the re-
sulted latent. Let us breakdown the whole process into the following steps:
104                                                   Anton Shcherbyna and Kostiantyn Omelianchuk


1. First, we will take a source sequence and encode it into the two hidden vectors h1
   and h2
2. Then we will add noise to those vectors to obtain the pairs (ℎ1 , 𝜖𝜖1 ) and (ℎ2 , 𝜖𝜖2 ),
   where 𝜖𝜖 is a Gaussian (as described in 3.2)
 3.Next we will propagate this vector through MLPs to obtain 𝑧𝑧̂1 and 𝑧𝑧2 (as described
   in 3.2)
 4.Finally, we will use 𝑧𝑧̂1 to calculate the scores 𝑝𝑝𝑖𝑖 for the final latent calculation:

                                          𝐾𝐾

                                  𝑧𝑧1 = � 𝑝𝑝𝑖𝑖 𝑒𝑒𝑖𝑖                                         (12)
                                         𝑖𝑖=1


5. Now we can use concatenated (𝑧𝑧1 , 𝑧𝑧2 ) for further text generation.

   Evaluation of this approach will be done via solving style transfer problem on Yelp
dataset. We will measure:
1. Content preservation (BLEU)
2. Style transfer strength (supervised classifiers)
3. Fluency and correct grammar (perplexity by GPT-2 language model)


5      Research Plan

We plan to organize further work in the following way:
1. Implement and test unsupervised approach based on Implicit VAE with MI regular-
   ization
2. Implement and test unsupervised approach based on latents as linear combination of
   basis vectors, which incorporates global variation from data
3. Add implicit latent learning to the second approach
4. Explore latent space and find attributes, which model captured
5. In case of a success, we will extend those models to be transformer-like with more
   powerful encoder and decoder


6      Conclusion

In this master’s thesis proposal, we made an overview of the current state in the field
of text generation and described VAE, which is used for controllable generation in the
vision domain and is applicable in text domain. We defined the most crucial problems:
issues with VAE itself (its expressive limits and posterior collapse) and difficulty with
an entanglement of latents. Further, we made related works overview and proposed our
potential solution, which is based on the combination of implicit posterior distribution
and constraint on the resulted latent in the form of the linear combination of basis vec-
tors. We believe that this improvement can increase the degree of controllability and
quality of the resulting samples.
Enhancing Controllability of Text Generation                                                105


References
 1. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks.
    arXiv preprint arXiv: 1409.3215 (2014)
 2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align
    and translate. arXiv preprint arXiv: 1409.0473 (2015)
 3. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.,
    Polosukhin, I.: Attention is all you need. arXiv preprint arXiv: 1706.03762 (2017)
 4. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional
    transformers for language understanding. arXiv preprint arXiv: 1810.04805 (2018)
 5. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are
    unsupervised multitask learners. OpenAI Blog 1(8) (2019)
 6. Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to
    sequence learning. arXiv preprint arXiv: 1705.03122 (2017)
 7. Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating
    sentences from a continuous space. arXiv preprint arXiv: 1511.06349 (2015)
 8. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:
    1312.6114 (2013)
 9. Cremer, C., Li, X., Duvenaud, D.: Inference suboptimality in variational autoencoders.
    arXiv preprint arXiv: 1801.03558 (2018)
10. Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., Xing E.P.: Toward controlled generation of
    text. arXiv preprint arXiv: 1703.00955 (2017)
11. Fang, L., Li, C., Gao, J., Dong, W., Chen, C.: Implicit deep latent variable models for text
    generation. arXiv preprint arXiv: 1908.11527 (2019)
12. Xu, P., Cao, Y., Cheung, J.C.K.: Unsupervised controllable text generation with global var-
    iation discovery and disentanglement. arXiv preprint arXiv: 1905.11975 (2019)