Correcting Linguistic Training Bias in an
             FAQ-bot using LSTM-VAE

       Mayur Patidar, Puneet Agarwal, Lovekesh Vig, and Gautam Shro↵

                           TCS Research, New Delhi.
         {patidar.mayur,puneet.a,lovekesh.vig,gautam.shroff}@tcs.com


        Abstract. We consider an automated assistant that has been deployed
        in a large multinational organization to answer employee FAQs. The
        system is based on an LSTM classifier that has been trained on a corpus
        of questions and answers carefully prepared by a small team of domain
        experts. We find that linguistic training bias creeps into the manually
        created training data due to specific phrases being used, with little or
        no variation, which biases the deep-learning classifier towards incorrect
        features. Further, often the FAQs as envisaged by the trainers are in
        fact incomplete, and transferring linguistic variations across question-
        answer pairs can uncover new question classes for which answers are
        missing. In this paper we demonstrate that an LSTM-based variational
        auto-encoder (VAE) can be used to automatically generate linguistically
        novel questions, which, (a) corrects classifier bias when added to the
        training set, (b) uncovers incompleteness in the set of answers and (c)
        improves the accuracy and generalization abilities of the base LSTM
        classifier, enabling it to learn from smaller training sets.

        Keywords: Variational Autoencoders, Classification, Language Mod-
        elling, FAQ-bot


1    Introduction

The recent successes of deep learning techniques for NLP have seemingly obvi-
ated the need for careful crafting of features based on linguistic properties. Deep
models purportedly can learn the features required for a task directly from data,
provided it comes in sufficient volume and variety, typically obtained in ‘natural’
settings such as the web. However, in practical applications as we shall describe,
training data is far more limited and is often created manually in a curated man-
ner specifically for the task at hand. We show that linguistic training bias often
creeps into such training data and degrades the performance of deep models.
Further, manual curation can often result in incomplete task specification due
to insufficient linguistic variation in the training data.
    We have created and deployed a chatbot for the use of employees in our
large organization, which answers human resource (HR) policy related questions
in natural language. A deep-learning model in the form of an Long short-term
memory (LSTM) [11] classifier was used for mapping questions to classes, with

In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Skopje, Macedonia, 2017.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
66      M. Patidar, P. Agarwal, L. Vig and G. Shro↵

each class having a manually curated answer. A small team of HR officers (5
members) manually created the training data used to train this classifier.
     We observed that the chosen deep-learning technique performed better than
traditionally known approaches on a given test set (also created by the same HR
officers). However, the deep model also sometimes classified a user query using
clearly irrelevant features (i.e., words). For example, the question “when my sick
leave gets credited ? ” was classified into a category related to ‘Adoption Leave’
resulting in a completely irrelevant answer. Upon analysis we discovered that
this happens mainly because the words surrounding ‘sick leave’ (e.g., “gets”) in
the query occurred more often in the training data for ‘Adoption Leave’. As a
result, if such words occur in users’ query, the model ignores other important
words (such as ‘sick leave’) and classifies the query into incorrect class, based
on such words. We refer to this phenomenon as linguistic training bias, which
creeps in merely because templates used for di↵erent classes are not exhaustive.
Such examples led us to fear that such a system would not generalize well when
exposed to real users.
     More generally, relying on human curation often results in such linguistic
training biases creeping into the training data, since every individual has a spe-
cific style of writing natural language and uses some words in specific context
only. Deep models end up learning these biases, instead of the core concept words
of the target classes.
     In order to correct these biases we automatically generate meaningful sen-
tences using a generative model, and then use them for training the classifica-
tion model after suitable human annotation. We use a Variational Autoencoder
(VAE) [15] as our generative model for generating novel sentences and utilize a
Language Model (LM) [18] for selecting sentences based on likelihood. We model
the VAE using RNNs comprising of LSTM units. As we shall demonstrate in Sec-
tion 6, this approach gives us a gain of about 9% in accuracy of the classification,
when tested with core concept words only.
     Further, the HR officers created the classes and corresponding questions
based on their experience of what are the frequently asked questions by vari-
ous employees of the organization. However, when a chatbot is available, users
tend to ask extra questions not asked otherwise. It is therefore imperative to
have broader coverage of such classes. If we could show novel classes of questions
to the HR officers, generated automatically, it becomes easier for them to accept
or reject the proposed classes rather than having to imagine all such possibilities.
As we demonstrate in Section 6.2 our approach generated many new classes that
were accepted by the HR officers for deployment.
     More specifically, the use of a deep generative model to augment training
sentences resulted in the following important benefits, which form the key con-
tributions of this paper:

 1. Augmenting training data with automatically generated sentences is able to
    correct over-fitting due to linguistic training bias. To show this we present
    results on ‘concept words’, indicating potentially better generalization.
       Correcting Linguistic Training Bias in an FAQ-bot using LSTM-VAE          67

 2. The newly generated sentences sometimes belonged to completely new classes
    not present in the original training data: 33 new classes were found in prac-
    tice.
 3. Augmenting training data with automatically generated sentences results in
    an improved accuracy (2%) of the deep-learning classifier.

We also present an improved approach for training VAEs: weighted cost anneal-
ing, based on our experience of training LSTM-VAE on a real-life dataset.


2   Related Work

VAEs have found widespread usage in image generation [9, 8], in text modeling [3,
17] and in recent work on sketch drawing [10]. Researchers have also used VAEs
for semi-supervised learning [14, 26]. [3] have used VAE and showed positive
results for novel sentence generation, missing words imputation, and have shared
techniques for efficiently training VAEs. They also observed negative results on
language modeling task. Dilated CNNs (Convolutional Neural Networks) and
combinations of both RNN and CNN have been used for modeling the VAE
encoder and decoder in [27, 23]. We have used a VAE architecture similar to [3]
with some modification in the training procedure, referred here as weighted cost
annealing.
     As shown in [4, 7], deep learning models misclassify the pertubed input sam-
ples with high confidence including adversarial examples. Several methods have
been proposed by researchers for training a robust model, for example, dropout
[24] and training with noisy samples [16]. In the text domain [16] have shown
that training data augmented with noisy training sentences leads to better clas-
sification performance. They have used wordnet [19], counter-fitting method as
described in [20] and deep linguistic parser [5], sentence compression techniques
for generating semantically and syntactically noisy training sentences, respec-
tively. Unlike [16], we have augmented the training data with sentences gener-
ated by the LSTM-VAE. We have observed that the sentences generated by the
LSTM-VAE are not always semantically and syntactically correct, i.e., it also
acts as a noisy sentence generator. Importantly, we have observed improvement
in classification accuracy after adding the generated sentences.


3   Problem Description

We assume that a dataset of frequently asked questions for building a chatbot
comprises of sets of semantically similar questions Qi = {q1 , ..., qni } and their
corresponding answer ai . A set of such questions Qi and corresponding answer
ai are collectively referred to as a query set si = {Qi , ai }. Questions of a query
set si are represented as Qi = Q(si ). We assume that the dataset D comprises
of many such query sets, i.e., D = {s1 , ..., sm }. In our chatbot implementation,
given a user’s query q, the objective is to select the corresponding query set s
68     M. Patidar, P. Agarwal, L. Vig and G. Shro↵

via a multi-class classification model, such that corresponding answer a could be
shown.
     Given all the questions in the training data D, Q = [Q(si ), 8si 2 D, we in-
tend to generate new questions Q0 using LSTM-VAE. Some of the questions
in Q0 are semantically similar to one of the query sets of D, while the re-
maining questions do not belong to any of the existing classes. Using these
newly generated queries Q0 , we present the results of our study comprising of
a) analysis of the new query-sets s0 and results of the review of these query
sets by HR domain experts, b) comparison of classifiers trained on Q, and Qnew
(= [{Q, Q0 }, 8q 2 Qnew 9i, s.t. q 2 si ) on core concept words rather than the
full natural language questions, and c) also analyze how the accuracy of a clas-
sification model improves.

4     Background
Recurrent neural networks (RNNs) are known for modeling sequential depen-
dencies in the input, for e.g. in language modeling tasks where prediction of a
word depends on the previous words in the sentence. Generally vanilla RNNs
do not handle long term dependencies in the input due to the vanishing gra-
dient problem [2], which is overcome by using RNNs with LSTM cells in the
hidden layers. In this section, we describe the details of various components of
our approach, comprising of an LSTM-VAE and a language model (LM).

4.1   Sequence Autoencoder
Autoencoders are neural networks used for learning lower dimension represen-
tations of data, referred to as encoding or latent representation (z). Autoen-
coders are comprised of two components: an encoder (z = fenc (x)) and a de-
coder (x̂ = fdec (z)), where the encoder transforms the input x to z and the
decoder tries to reconstruct the x from z. Autoencoders are trained by minimiz-
ing the reconstruction error of the decoded data x̂ with respect to input x, i.e.,
L(x, x̂) = kx x̂k2 .
    The choice of encoder and decoder networks generally depends on the input
data. A sequence autoencoder uses RNNs to encode (as well as to decode) se-
quential input data, e.g., a sentence. Latent representations of data, learned by
the sequence autoencoder, can also be used for classification. [6] uses RNNs with
LSTM units for both encoder and decoder of the sequence autoencoder. The
sequence autoencoders cannot be used for generation of novel samples as they
do not enforce a prior distribution on the latent representations. Also, as shown
by [3], encodings learned by the sequence autoencoder are unable to capture
semantic features.

4.2   Variational Autoencoder
The VAE is a generative model which unlike sequence autoencoders, is comprised
of a probabilistic encoder (q (z|x), recognition model ) and a decoder (p✓ (x|z),
        Correcting Linguistic Training Bias in an FAQ-bot using LSTM-VAE          69

generative model ). The posterior distribution p✓ (z|x) is known to be computa-
tionally intractable. VAE approximates p✓ (z|x) using the encoder q (z|x), which
is assumed to be Gaussian and is parameterized by = {µ, }, and the encoder
learns to predict . As a result, it becomes possible to draw samples from this
distribution.
    In order to decode a sample z drawn from q (z|x), to the input x, the recon-
struction loss also needs to be minimized. The reconstruction loss is represented
by Eq (z|x) (log p✓ (x|z)). If the VAE were trained similar to a sequence autoen-
coder, i.e., with reconstruction loss only, it would not allow enough variance in
q (x|z), as a result the VAE would approximately map input x into a latent rep-
resentation deterministically. This gets avoided by minimizing the reconstruction
loss together with the KL-divergence between q (z|x) and the prior p✓ (z).

                        L( , ✓, x) =   Eq (z|x) (log p✓ (x|z))
                         +KL(q (z|x)kp✓ (z))  log p✓ (x)                        (1)

    Here, p✓ (z) is assumed to be multi-variate Gaussian N (0, I). [15] showed that
this loss term is a variational lower bound on log likelihood of x, i.e., log p✓ (x).
    Sampling of z from q (z|x) is a non-continous operation, therefore it is not
possible to train the encoder via back-propagation. The Re-parametrization trick
introduced in [15] allows us to train VAE using stochastic gradient descent
via back-propagation. Novel samples can be generated by sampling the z from
N (0, I) and decoding the samples into input space, using the decoder model.


4.3   Language Model

An RNN language model (RNNLM) is a generative model, which learns the
conditional probability distribution over the vocabulary words. It predicts the
next word (wi+1 ) given the representation of words seen so far hi and current
input wi by maximizing the log likelihood of the next word p(wi+1 | hi , wi ) =
softmax(Ws hi + bs ), averaged over sequence length N . The cross-entropy loss is
used to train the language model.
                                       N
                                   1 X
                     LCE LM =            log(p(wi+1 | hi , wi ))                 (2)
                                   N i=1

Generally performance of the RNNLM is measured using perplexity (lower is
better), Perplexity = expLCE LM . We have used RNNLM with LSTM units as
described in [28].


4.4   Classification

Classification can be considered as a two step process with the first step requir-
ing a representation of the data. The second step involves using this representa-
tion for classification. Data can be represented using a bag of words approach,
70      M. Patidar, P. Agarwal, L. Vig and G. Shro↵

which ignores the word order information; or using hand-crafted features, which
fail to generalize to multiple datasets / tasks. We learn the task-specific sen-
tence representation using RNNs with LSTM units by representing the vari-
able length sentence in a fixed length vector representation h, obtained after
passing the sentence through the RNN layer. We then apply softmax over the
affine transformation of h, i.e., p(c | h) = softmax(Ws h + bs ). To learn the
weights ofPthe above model, we minimize the categorical cross entropy loss, i.e.,
             m
LCE =        i=1 y · log(p(ci | h)), where ci is one of the m class. Here, y is 1 only
for the target class and zero otherwise.


5     LSTM-VAE with LM


                          Fig. 1. LSTM-VAE Architecture


5.1   Variational Recurrent Autoencoder
Similar to [3], we have used single layer RNN with LSTM units as encoder and
decoder of the VAE. We pass the variable length input sentence in the encoder in
reverse order as shown in Figure 1. Words of a sentence are first converted into
a vector representation after passing through the word embedding layer, before
being fed to the LSTM layer. The final hidden state of the LSTM, hen+1 then
passes through a feed forward layer, which predicts µ and         of the posterior
distribution q (z|x).
    After sampling and via the re-parameterization trick (explained later in this
section), the sampled encoding z passes through a feed-forward layer to obtain
hd0 , which is the start state of the decoder RNN. We pass the encoding z as
input to the LSTM layer at every time-step. The vector representation of the
word with the highest probability, predicted at time t, is also passed as input at
next time-step (t + 1), as shown in Figure 1.

Training the LSTM-VAE It is not straightforward to train the LSTM-VAE
using the loss function given in Eq. 1. As described in Section 4.2, if the KL-
divergence loss term is not used, the model will converge to a state that gives
fixed latent representations. Similarly, if the reconstruction loss is high it would
lead to meaningless sentences. If equal weightage is given to both loss terms the
       Correcting Linguistic Training Bias in an FAQ-bot using LSTM-VAE      71

               Source Sent: where can i apply earned leave?
               Generated Sentences
               where can i apply earned leave ?
               where can can check lwp guidelines ?
               where can i apply sick leave ?
               where can i apply earned leave ?
               where can i apply earned leave ?
               where can i find lwp guidelines ?
               where can i apply earned leave ?
               where can can apply casual leave ?
               how can i apply earned leave ?
               where can i find lwp leave ?

                         Table 1. Sentence Generation


model learns to encode the input sequence into a desired latent representation
too well, i.e., KL divergence (KLD) between approximate posterior q (z|x) and
prior p✓ (z) reaches zero. As a result, the decoder network reduces to merely a
language model (RNNLM) and it does not generate novel samples. The network
should be trained such that both of the loss terms gradually decrease towards
convergence.
    This problem was also reported by [3], and they proposed cost annealing and
word-dropout at the decoder as a remedy. For cost annealing, during the learning
phase, after r1 steps of training they increase the weight of the KL-divergence
loss term gradually from 0 to 1, while retaining full reconstruction loss term
from beginning to end. This approach did not work for us, and we could not
train the model on our data. We therefore propose a variant of their approach,
which worked well for us.

 – Weighted cost annealing: We have used an improved version of the KL cost
   annealing presented in [3]. We utilize a weighted loss function as mentioned
   in Eq. 3 and started training the model with = 0, keeping it fixed for the
   first e epochs, i.e., (0 e) = 0. We keep increasing it by r after every e
   epochs, i.e., (e 2e) = (0 e) + r. Here, e and r are assumed to be hyper
   parameters.

                        L( , ✓, x) =     · KL(q (z|x)kp✓ (z))
                               (1      ) · Eq (z|x) (log p✓ (x|z))           (3)

 – To avoid discrepancy between the training and inference procedures, we have
   used the special case of scheduled sampling mentioned in [1], i.e., we always
   take the word from predicted distribution as input instead of the actual
   word from the input sentence. We pass z at every step of the LSTM-decoder
72      M. Patidar, P. Agarwal, L. Vig and G. Shro↵

   with the highest probability word taken from the predicted distribution, i.e.,
   greedy decoding wt = argmaxwt p(wt |w0,...,t 1 , hdt 1 , z).
 – To make the decoder rely more on the z during sentence decoding, we have
   used word dropout similar to [3] by passing the UNK token to next step
   instead of the word predicted by the decoder using greedy decoding. During
   decoding of a sentence using z, k, the fraction of words are replaced by UNK
   randomly, where k 2 [0, 1] is also taken as a hyper-parameter.

Inference: Sentence Generation To generate sentences similar to input sen-
tences, we have used the recognition model to get the parameters of the distribu-
tion corresponding to a sentence (µ, ). If we sample z from this distribution, the
sampling step will become non-continuous and we will not be able to learn the pa-
rameters of the neural network using back propagation. The re-parameterization
trick [15] comes to our rescue in this situation. Here, we sample ✏ and obtain
z using the Eq. 4, which is a continuous function and therefore di↵erentiable.
These sampled encodings are decoded by the generative model using greedy de-
coding to obtain the sentences. Table 1 contains the set of sentences generated
and the corresponding source sentences.

                          z = µ + ✏ · , where ✏ ⇠ N (0, I)                         (4)

5.2   Sentence Selection
It is difficult to use the sentences generated by the LSTM-VAE in a downstream
task (classification in our case) without preprocessing because of the word rep-
etition problem in the generated samples observed by various researchers [25].
Therefore, we discard all the sentences containing consecutively repeating words
as highlighted in Table 1. The remaining sentences may still not be syntactically
and semantically correct. We therefore choose the top-K good sentences sorted
by likelihood via a LM (see Section 4.3) trained on the original training data.

5.3   LSTM based Sentence Classification
Base Classifier (M1 ): We use the single layer recurrent neural network with
LSTM units for classification trained on the original data. We use this as a
baseline for the classification task.
    Base Classifier augmented with generated sentences (M2 ): To obtain
the label for the novel sentences generated by VAE, we use M1 and choose the
top N sentences, based on the entropy of the softmax distribution, as candidates
for augmenting the training data. We manually verify the label and correct the
label if it is incorrectly classified by M1 ; we also remove the sentences that clearly
correspond to new classes. We augment the training data with this new dataset
and train another classification model (M2 ), as shown in Figure 2.
    Base Classifier augmented with generated sentences based on word
embedding(M3 ): Here, we try to augment the training set with additional sen-
tences that are generated by replacing some words in each training sentence. All
         Correcting Linguistic Training Bias in an FAQ-bot using LSTM-VAE       73

the words in a sentence except english stopwords and leave types {sick, casual
etc.} (to avoid change of class label) are selected as the candidate words for
replacement. We replace a candidate word with a word corresponding to the
nearest neighbour of the candidate word’s embedding. The nearest neighbour is
chosen based on the cosine similarity between two word embeddings obtained
from glove [21]. We generate one sentence per each training sentence and aug-
ment the training data with these generated sentences to train another LSTM
based classification model (M3 ).


6     Results and Discussion

6.1    Experimental Setup and Training

Dataset Description: We use a dataset of Leave policy questions, created by
HR officers1 , for chatbot creation, see Table 2 for more details. This dataset was
divided in three parts (Training, Validation and Test) in 60:20:20 ratio. Training
and Validation datasets are used for training the LSTM-VAE model, here we
select the best model based on loss on validation data, i.e., L( , ✓, x), see Eq.
3. New sentences are generated using only training data as input. The test data
is used for reporting the performance metrics of the classification model, and
is never exposed to LSTM-VAE model. Another test data comprising only the
core concept words was also used for testing (see Section 6.3).


                           Characteristic N       c   l   V
                                Value      2916 117 11 1108

Table 2. Characteristics of the Leave dataset N : Number of sentences, c: Number of
classes, l : Average sentence length, V : Vocabulary size


    Training Details: Word embeddings for tokens (delimited by space) are
initialized randomly and learned during the training. We use Adam [13] for op-
timization and learning rate is selected from the range [1e-2,1e-3] for all the
models, the choice of other hyper-parameters is given in Table 3. For regulariza-
tion of classification model, we have used dropout [24, 22] on word embedding
and LSTM layers, along with batch normalization [12]. For LM, we used trun-
cated back-propagation for gradient calculation. Next, we present the details of
a) new classes generated by our approach, b) results on core concept word test-
ing demonstrating linguistic training bias, and finally show c) the improvement
in accuracy using the generated sentences.
1
    A subset of this data set will be shared on demand for research purposes
74     M. Patidar, P. Agarwal, L. Vig and G. Shro↵

                   Parameter                 Range
                   Dim. of word embed.       {50, 100, ..., 300}
                   Dimension of z (30)       {20, 30, 50}
                   weighted cost annealing
                   r (chose .05)             {.1, .05, .025}
                   e (chose 10)              {5, 10, 15}
                   k (LSTM-VAE word-drop) {0.5, 0.75, 1.0}
                   Classification Model
                   and Language Model
                   Word drop rate            {0, 0.1, 0.2, ..., 0.5}

                     Table 3. Hyper Parameter Tuning ranges


6.2   New Classes generated by LSTM-VAE

When preparing the questions to train a chatbot as described in Section 3, it
is hard to visualize all possible questions (i.e., query-sets) that users can ask
when it is made live. Using LSTM-VAE we were able to generate many new
query sets, which were accepted by the core HR Team. In Figure 2 we present
the entire workflow followed to generate the novel queries. Out of the 175,000
queries generated using LSTM-VAE, we removed the queries already present
in the training data, as well as those which have same word repeating more
than once consecutively. After this stage we obtained about 5,700 sentences.
These sentences were then tested using a LM (as described in Section 4.3), and
we picked-up only top 1500 sentences based on the likelihood. Many of these
sentences were found to be grammatically correct, while only some of them were
semantically inconsistent. In this process it was discovered that 434 sentences did
not belong to any of the existing classes. These sentences were given to the HR
team for review, and they selected 120 sentences belonging to 33 new classes. We
have made this chatbot live in our company and these 33 classes are in practical
use (see Key Contribution 2 ).
    Further, in Table 4, we show sample novel sentences generated by LSTM-
VAE belonging to 9 di↵erent classes (separated by horizontal line in the table).
For every novel sentence, we also show the most similar sentence taken from
training data, identified using Jaccard Similarity.


6.3   Core Concept Word Testing

Next, our HR team tested the chatbot built using models M1 and M2 (see Section
5.3) by using only core concept words as inputs, rather than complete sentences:
see Table 5. To build the model M2 , we added sentences generated by the process
flow, shown in Figure 2. Here, out of 1500 sentences chosen by the LM 1066 were
manually found to be belonging to existing classes, and we ran the model M1
         Correcting Linguistic Training Bias in an FAQ-bot using LSTM-VAE                                75


Fig. 2. Sentence Generation Process flow, where s: VAE generated sentences, c: clas-
sification of VAE generated sentences into existing classes, ns: novel sentences found
during maual labelling of the generated sentences, nc: Classification of valid ns by HR
officers into new classes


Novel Sentences generated by LSTM-VAE              Most Similar Sentences from training data
Can adoption leave be applied in advance           In what all scenarios can adoption leave be applied
Can i use adoption leave in advance                Can i avail adoption leave in parts?
How do i see more on leave ?                       Where do i get more information on adoption leave ?
Please help with know more information on leave ? Pls help with more information on timesheet leave ?
Where do i get more about on leave ?               Where do i get more information on adoption leave ?
Where do i see faq on leave ?                      Where do i get more information on adoption leave ?
Can lwp be encashed ?                              Can casual leave be planned ?
What is the eligibilty to encash sick leaves ?     What is the method to encash adoption leave ?
Are sick leaves credited during to separation ?    Are sick leaves credited during lwp ?
Can casual leave combined with earned leave ?      Can al be combined with casual leave ?
How many days leaves can credited in a quarter ?   How many casual leaves are credited in a quarter ?
Can i cancel my leave in system ?                  How can i cancel my availed casual leave ?
What is the procedure to cancel for leave ?        What is the procedure to cancel adoption leave ?
I have resigned recently . i go for lwp ?          I have resigned. can i go for casual leave ?

Table 4. Novel sentences generated using LSTM-VAE and corresponding similar sen-
tences present in the training data based on Jaccard-Similarity
76      M. Patidar, P. Agarwal, L. Vig and G. Shro↵

Sentences from training data                          Concept Word Query
Are business associates eligible to take al?
                                                      Adoption leave business associates
Are business associates eligible for al?
Where can i find the policy on casual leave?
Where can i read more about casual leave ?
                                                      Casual leave policy path
Where do i get more info on casual leaves?
Where i can find cl guidelines ?
How many flexi leaves will i get ?
                                                      Flexi Leave Entitlement
What is flexi leave and how can i avail them ?
Can i apply ml if i am on lwp ?
How can i go about applying ml if on lwp ?            Maternity leave entitlement while on LWP
Is it possible for me to apply ml when on lwp ?
What happens to my unutilized sick leave ?
                                                      Sick leave resign during separation ?
In event of separation can i encash my sick leave ?

      Table 5. Training data queries and corresponding concept words queries


on these sentences. We chose top-250 correctly classified sentences based on low
entropy and all the 130 wrongly classified sentences, and combined these with
the training data to train the model M2 . These 380 sentences belong to only 74
out of 117 classes. As shown in Table 6, a classifier built using only the training
dataset tends to have linguistic training bias. This bias gets reduced when the
generated sentences are also used (see Key Contribution 1 ). This di↵erence of
9% can be attributed to new sentences generated via perturbation of surrounding
words and high quality sentences generated by LSTM-VAE.


6.4   Accuracy on Test set

We now analyze the impact of addition of this training data into the model
on hold-out set. Here, a small gain of about 2% was observed in classification
accuracy, as shown in Table 6 (see Key Contribution 3 ). While this gain is small,
it is important that the classifier uses the right words; especially since the test
set, being a subset of curated questions, also has linguistic training bias. It was
observed that the model M1 often classified the query using non-relevant words
as shown in Table 6. Even if such a classifier had worked better on the test data,
we believe it would perform worse on actual user queries; of course this can only
be confirmed by a controlled A/B test on real users.


6.5   Analysis: Why Linguistic Training Bias gets reduced?

During user testing of the FAQ-bot, it was observed that often model’s decision
to classify a user query in certain class is based on non-concept words. It was sus-
pected that this phenomenon occurs due to linguistic training bias, as explained
        Correcting Linguistic Training Bias in an FAQ-bot using LSTM-VAE            77

                   Dataset                 M1      M2       M3
                   Concept Word Testing 55.00% 64.28% 62.14%
                   Test Set                79.41% 81.13% 80.78%

Table 6. Accuracy of classifiers on test set and sentences containing concept words
(CW) only, where number of classes is 117 for M1 , M2 and M3


in Section 1. New sentences generated by LSTM-VAE, which were wrongly clas-
sified by M1 , are likely to be of this type, and are therefore more value adding for
the model. For example, many sentences get generated which contain the same
surrounding words (e.g., ‘gets’) but belonging to di↵erent classes, e.g., “When
my casual leaves gets credited ? ” and “When my maternity leaves gets credited
? ”. When such sentences are added to the training set, it indirectly forces the
model to learn to distinguish the classes based on some other words than such
non-concept words. Perhaps this is one of the reasons why merely with an ad-
dition of 130 sentences (which is less than 10% of the original training data)
accuracy gain of almost 10% was observed during concept-word based testing
of the model as shown in Table 6. It would otherwise have required many more
new sentences to get the same degree of gain in classification accuracy.


7   Conclusion

Based on our experience of building and deploying a FAQ-bot in practice, we have
noted that linguistic training bias leads to over-fitting of deep-learning model.
Such bias is more likely to occur when the training data is created by domain
experts, who tend to use templates when writing natural language sentences.
We have shown that such bias can be fixed to certain extent by generating novel
sentences using a generative model. We also described an approach for such a
generative model, which uses LSTM-VAE followed by sentence selection using a
LM. We presented weighted cost annealing, an improved approach for training
the LSTM-VAE, and showed that accuracy of a classification model can be
increased significantly by using generated sentences. Most importantly, we have
shown that our approach was able to generate newer classes of questions for our
FAQ-chatbot, not present in the original training data, which were reviewed and
accepted by the domain experts for deployment.


References
 1. Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sam-
    pling for sequence prediction with recurrent neural networks. In Advances in Neural
    Information Processing Systems 28. 2015.
 2. Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with
    gradient descent is difficult. Trans. Neur. Netw.
78      M. Patidar, P. Agarwal, L. Vig and G. Shro↵

 3. Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz,
    and Samy Bengio. Generating sentences from a continuous space. In Proceedings
    of The 20th SIGNLL Conference on Computational Natural Language Learning,
    2016.
 4. Wojciech Zaremba Christian Szegedy, Ilya Sutskever, Dumitru Erhan Joan Bruna,
    Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In
    International Conference on Learning Representations (ICLR), 2014.
 5. Ann Copestake and Dan Flickinger. An open source grammar development envi-
    ronment and broad-coverage english grammar using hpsg. In Proceedings of the
    Second International Conference on Language Resources and Evaluation, 2000.
 6. Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advances
    in Neural Information Processing Systems 28. 2015.
 7. Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and har-
    nessing adversarial examples. In International Conference on Learning Represen-
    tations (ICLR), 2015.
 8. Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan
    Wierstra. Towards conceptual compression. In Advances in Neural Information
    Processing Systems 29. 2016.
 9. Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra.
    Draw: A recurrent neural network for image generation. In Proceedings of the 32nd
    International Conference on Machine Learning, 2015.
10. David Ha and Douglas Eck. A neural representation of sketch drawings. CoRR,
    2017.
11. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Com-
    put., 1997.
12. Sergey Io↵e and Christian Szegedy. Batch normalization: Accelerating deep net-
    work training by reducing internal covariate shift. In Proceedings of the 32nd
    International Conference on Machine Learning, 2015.
13. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
    CoRR, 2014.
14. Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling.
    Semi-supervised learning with deep generative models. In Advances in Neural
    Information Processing Systems 27. 2014.
15. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In Inter-
    national Conference on Learning Representations (ICLR), 2014.
16. Yitong Li, Trevor Cohn, and Timothy Baldwin. Robust training under linguistic
    adversity. In Proceedings of the 15th Conference of the European Chapter of the
    Association for Computational Linguistics: Volume 2, Short Papers, April 2017.
17. Yishu Miao, Lei Yu, and Phil Blunsom. Neural variational inference for text pro-
    cessing. In Proceedings of The 33rd International Conference on Machine Learning,
    2016.
18. Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khu-
    danpur. Recurrent neural network based language model. In 11th Annual Confer-
    ence of the International Speech Communication Association, 2010.
19. George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and
    Katherine J. Miller. Introduction to wordnet: an on-line lexical database. In-
    ternational Journal of Lexicography, 1990.
20. Nikola Mrkšić, Diarmuid Ó Séaghdha, Blaise Thomson, Milica Gašić, Lina Rojas-
    Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve Young.
    Counter-fitting word vectors to linguistic constraints. In Proceedings of HLT-
    NAACL, 2016.
        Correcting Linguistic Training Bias in an FAQ-bot using LSTM-VAE           79

21. Je↵rey Pennington, Richard Socher, and Christopher Manning. Global vectors for
    word representation. In Proceedings of Empirical Methods in Natural Language
    Processing, EMNLP, 2014.
22. Vu Pham, Théodore Bluche, Christopher Kermorvant, and Jérôme Louradour.
    Dropout improves recurrent neural networks for handwriting recognition. In 14th
    International Conference on Frontiers in Handwriting Recognition (ICFHR), 2014.
23. Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. A hybrid convolutional
    variational autoencoder for text generation. CoRR, 2017.
24. Nitish Srivastava, Geo↵rey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
    Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.
    Journal of Machine Learning Research, 2014.
25. Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. Modeling
    coverage for neural machine translation. In Proceedings of the 54th Annual Meeting
    of the Association for Computational Linguistics, 2016.
26. Weidi Xu, Haoze Sun, Chao Deng, and Ying Tan. Variational autoencoder for
    semi-supervised text classification. In Thirty-First AAAI Conference on Artificial
    Intelligence, 2017.
27. Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick.
    Improved variational autoencoders for text modeling using dilated convolutions.
    CoRR, 2017.
28. Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network
    regularization. CoRR, 2014.


Appendix: E↵ect of di↵erent training procedures on
sentence generation

 – Weighted cost annealing: As mentioned in Section 5.1 the training of
   LSTM-VAE by minimizing the loss according to Equation 5 causes the KL-
   divergence loss to increase for few time-steps initially and then drop to zero
   as shown in Figure 3. This makes the decoder to behave like an RNNLM. To
   overcome the above issue we use weighted cost annealing i.e. we increase the
   weight of KL-divergence loss lineraly after every e epochs and simultaneously
   reduce the weight of the reconstruction loss. Due to this, even though the
   KL-divergence loss increases initially for few time-steps, it starts decreasing
   over the time-steps but remains non zero as shown in Figure 3.

                            L( , ✓, x) = KL(q (z|x)kp✓ (z))
                                        Eq (z|x) (log p✓ (x|z))                   (5)

 – As described in Section 5.1 the input to the LSTM-VAE decoder, for pre-
   dicting the word at time t + 1, is the output of the decoder at time t rather
   than the actual word from the input sentence. In our case we found that
   this helps the LSTM-VAE in generating more semantically and syntactically
   correct sentences as compared to those generated by inputting the actual
   word to the decoder. This can be observed from the Table 7.
80       M. Patidar, P. Agarwal, L. Vig and G. Shro↵


          Fig. 3. KL-divergence loss over the training steps (horizontal axis)


                      Source Sent: path to apply sick leave?
      Predicted word as decoder input Actual Word as decoder input
      path to apply casual leave ?           path sick sick sick cl cl cl ? ? ? ?
      path to apply sl ?                     path cl cl cl cl cl ? ? ? ? ?
      where to apply casual leave ?          path sick sick sick cl cl cl ? ? ? ?
      path to apply earned leave ?           who sick sick cl cl cl cl ? ? ? ?
      how to apply adoption leave ?          path sick sick sick cl cl cl ? ? ? ?
      can to apply adoption leave ?          path sick sick cl cl ? ? ? ?
      link to apply sick leave               path sick sick sick cl cl cl ? ? ? ?
      path to apply leave ?                  path path sick sick cl cl cl el ? ? ? ?
      path path apply adoption leave ?       path sick sick cl cl cl cl ? ? ? ?
      path to apply adoption leave ?         path sick sick cl cl cl ? ? ? ?

     Table 7. Sentences generated using VAE under two di↵erent training scenario