<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Correcting Linguistic Training Bias in an FAQ-bot using LSTM-VAE</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mayur Patidar</string-name>
          <email>patidar.mayur@tcs.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Puneet Agarwal</string-name>
          <email>puneet.a@tcs.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lovekesh Vig</string-name>
          <email>lovekesh.vig@tcs.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gautam Shro↵</string-name>
          <email>gautam.shroff@tcs.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TCS Research</institution>
          ,
          <addr-line>New Delhi</addr-line>
        </aff>
      </contrib-group>
      <fpage>65</fpage>
      <lpage>80</lpage>
      <abstract>
        <p>We consider an automated assistant that has been deployed in a large multinational organization to answer employee FAQs. The system is based on an LSTM classifier that has been trained on a corpus of questions and answers carefully prepared by a small team of domain experts. We find that linguistic training bias creeps into the manually created training data due to specific phrases being used, with little or no variation, which biases the deep-learning classifier towards incorrect features. Further, often the FAQs as envisaged by the trainers are in fact incomplete, and transferring linguistic variations across questionanswer pairs can uncover new question classes for which answers are missing. In this paper we demonstrate that an LSTM-based variational auto-encoder (VAE) can be used to automatically generate linguistically novel questions, which, (a) corrects classifier bias when added to the training set, (b) uncovers incompleteness in the set of answers and (c) improves the accuracy and generalization abilities of the base LSTM classifier, enabling it to learn from smaller training sets.</p>
      </abstract>
      <kwd-group>
        <kwd>Variational Autoencoders</kwd>
        <kwd>Classification</kwd>
        <kwd>Language Modelling</kwd>
        <kwd>FAQ-bot</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The recent successes of deep learning techniques for NLP have seemingly
obviated the need for careful crafting of features based on linguistic properties. Deep
models purportedly can learn the features required for a task directly from data,
provided it comes in sucient volume and variety, typically obtained in ‘natural’
settings such as the web. However, in practical applications as we shall describe,
training data is far more limited and is often created manually in a curated
manner specifically for the task at hand. We show that linguistic training bias often
creeps into such training data and degrades the performance of deep models.
Further, manual curation can often result in incomplete task specification due
to insucient linguistic variation in the training data.</p>
      <p>
        We have created and deployed a chatbot for the use of employees in our
large organization, which answers human resource (HR) policy related questions
in natural language. A deep-learning model in the form of an Long short-term
memory (LSTM) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] classifier was used for mapping questions to classes, with
each class having a manually curated answer. A small team of HR ocers (5
members) manually created the training data used to train this classifier.
      </p>
      <p>We observed that the chosen deep-learning technique performed better than
traditionally known approaches on a given test set (also created by the same HR
ocers). However, the deep model also sometimes classified a user query using
clearly irrelevant features (i.e., words). For example, the question “when my sick
leave gets credited ? ” was classified into a category related to ‘Adoption Leave’
resulting in a completely irrelevant answer. Upon analysis we discovered that
this happens mainly because the words surrounding ‘sick leave’ (e.g., “gets”) in
the query occurred more often in the training data for ‘Adoption Leave’. As a
result, if such words occur in users’ query, the model ignores other important
words (such as ‘sick leave’) and classifies the query into incorrect class, based
on such words. We refer to this phenomenon as linguistic training bias, which
creeps in merely because templates used for di↵erent classes are not exhaustive.
Such examples led us to fear that such a system would not generalize well when
exposed to real users.</p>
      <p>More generally, relying on human curation often results in such linguistic
training biases creeping into the training data, since every individual has a
specific style of writing natural language and uses some words in specific context
only. Deep models end up learning these biases, instead of the core concept words
of the target classes.</p>
      <p>
        In order to correct these biases we automatically generate meaningful
sentences using a generative model, and then use them for training the
classification model after suitable human annotation. We use a Variational Autoencoder
(VAE) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] as our generative model for generating novel sentences and utilize a
Language Model (LM) [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] for selecting sentences based on likelihood. We model
the VAE using RNNs comprising of LSTM units. As we shall demonstrate in
Section 6, this approach gives us a gain of about 9% in accuracy of the classification,
when tested with core concept words only.
      </p>
      <p>Further, the HR ocers created the classes and corresponding questions
based on their experience of what are the frequently asked questions by
various employees of the organization. However, when a chatbot is available, users
tend to ask extra questions not asked otherwise. It is therefore imperative to
have broader coverage of such classes. If we could show novel classes of questions
to the HR ocers, generated automatically, it becomes easier for them to accept
or reject the proposed classes rather than having to imagine all such possibilities.
As we demonstrate in Section 6.2 our approach generated many new classes that
were accepted by the HR ocers for deployment.</p>
      <p>More specifically, the use of a deep generative model to augment training
sentences resulted in the following important benefits, which form the key
contributions of this paper:
1. Augmenting training data with automatically generated sentences is able to
correct over-fitting due to linguistic training bias. To show this we present
results on ‘concept words’, indicating potentially better generalization.
2. The newly generated sentences sometimes belonged to completely new classes
not present in the original training data: 33 new classes were found in
practice.
3. Augmenting training data with automatically generated sentences results in
an improved accuracy (2%) of the deep-learning classifier.</p>
      <p>We also present an improved approach for training VAEs: weighted cost
annealing, based on our experience of training LSTM-VAE on a real-life dataset.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        VAEs have found widespread usage in image generation [
        <xref ref-type="bibr" rid="ref8 ref9">9, 8</xref>
        ], in text modeling [
        <xref ref-type="bibr" rid="ref17 ref3">3,
17</xref>
        ] and in recent work on sketch drawing [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Researchers have also used VAEs
for semi-supervised learning [
        <xref ref-type="bibr" rid="ref14 ref26">14, 26</xref>
        ]. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] have used VAE and showed positive
results for novel sentence generation, missing words imputation, and have shared
techniques for eciently training VAEs. They also observed negative results on
language modeling task. Dilated CNNs (Convolutional Neural Networks) and
combinations of both RNN and CNN have been used for modeling the VAE
encoder and decoder in [
        <xref ref-type="bibr" rid="ref23 ref27">27, 23</xref>
        ]. We have used a VAE architecture similar to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
with some modification in the training procedure, referred here as weighted cost
annealing.
      </p>
      <p>
        As shown in [
        <xref ref-type="bibr" rid="ref4 ref7">4, 7</xref>
        ], deep learning models misclassify the pertubed input
samples with high confidence including adversarial examples. Several methods have
been proposed by researchers for training a robust model, for example, dropout
[
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] and training with noisy samples [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. In the text domain [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] have shown
that training data augmented with noisy training sentences leads to better
classification performance. They have used wordnet [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], counter-fitting method as
described in [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and deep linguistic parser [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], sentence compression techniques
for generating semantically and syntactically noisy training sentences,
respectively. Unlike [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], we have augmented the training data with sentences
generated by the LSTM-VAE. We have observed that the sentences generated by the
LSTM-VAE are not always semantically and syntactically correct, i.e., it also
acts as a noisy sentence generator. Importantly, we have observed improvement
in classification accuracy after adding the generated sentences.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Problem Description</title>
      <p>We assume that a dataset of frequently asked questions for building a chatbot
comprises of sets of semantically similar questions Qi = {q1, ..., qni } and their
corresponding answer ai. A set of such questions Qi and corresponding answer
ai are collectively referred to as a query set si = {Qi, ai}. Questions of a query
set si are represented as Qi = Q(si). We assume that the dataset D comprises
of many such query sets, i.e., D = {s1, ..., sm}. In our chatbot implementation,
given a user’s query q, the objective is to select the corresponding query set s
via a multi-class classification model, such that corresponding answer a could be
shown.</p>
      <p>Given all the questions in the training data D, Q = [ Q(si), 8 si 2 D, we
intend to generate new questions Q0 using LSTM-VAE. Some of the questions
in Q0 are semantically similar to one of the query sets of D, while the
remaining questions do not belong to any of the existing classes. Using these
newly generated queries Q0, we present the results of our study comprising of
a) analysis of the new query-sets s0 and results of the review of these query
sets by HR domain experts, b) comparison of classifiers trained on Q, and Qnew
(= [{ Q, Q0}, 8 q 2 Qnew9 i, s.t. q 2 si) on core concept words rather than the
full natural language questions, and c) also analyze how the accuracy of a
classification model improves.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Background</title>
      <p>
        Recurrent neural networks (RNNs) are known for modeling sequential
dependencies in the input, for e.g. in language modeling tasks where prediction of a
word depends on the previous words in the sentence. Generally vanilla RNNs
do not handle long term dependencies in the input due to the vanishing
gradient problem [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which is overcome by using RNNs with LSTM cells in the
hidden layers. In this section, we describe the details of various components of
our approach, comprising of an LSTM-VAE and a language model (LM).
4.1
      </p>
      <sec id="sec-4-1">
        <title>Sequence Autoencoder</title>
        <p>Autoencoders are neural networks used for learning lower dimension
representations of data, referred to as encoding or latent representation (z).
Autoencoders are comprised of two components: an encoder (z = fenc(x)) and a
decoder (xˆ = fdec(z)), where the encoder transforms the input x to z and the
decoder tries to reconstruct the x from z. Autoencoders are trained by
minimizing the reconstruction error of the decoded data xˆ with respect to input x, i.e.,
L(x, xˆ) = kx xˆk2.</p>
        <p>
          The choice of encoder and decoder networks generally depends on the input
data. A sequence autoencoder uses RNNs to encode (as well as to decode)
sequential input data, e.g., a sentence. Latent representations of data, learned by
the sequence autoencoder, can also be used for classification. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] uses RNNs with
LSTM units for both encoder and decoder of the sequence autoencoder. The
sequence autoencoders cannot be used for generation of novel samples as they
do not enforce a prior distribution on the latent representations. Also, as shown
by [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], encodings learned by the sequence autoencoder are unable to capture
semantic features.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Variational Autoencoder</title>
        <p>The VAE is a generative model which unlike sequence autoencoders, is comprised
of a probabilistic encoder (q (z|x), recognition model ) and a decoder (p✓ (x|z),
generative model ). The posterior distribution p✓ (z|x) is known to be
computationally intractable. VAE approximates p✓ (z|x) using the encoder q (z|x), which
is assumed to be Gaussian and is parameterized by = {µ, }, and the encoder
learns to predict . As a result, it becomes possible to draw samples from this
distribution.</p>
        <p>In order to decode a sample z drawn from q (z|x), to the input x, the
reconstruction loss also needs to be minimized. The reconstruction loss is represented
by Eq (z|x)(log p✓ (x|z)). If the VAE were trained similar to a sequence
autoencoder, i.e., with reconstruction loss only, it would not allow enough variance in
q (x|z), as a result the VAE would approximately map input x into a latent
representation deterministically. This gets avoided by minimizing the reconstruction
loss together with the KL-divergence between q (z|x) and the prior p✓ (z).</p>
        <p>L(, ✓,
x) =</p>
        <p>
          Here, p✓ (z) is assumed to be multi-variate Gaussian N (0, I). [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] showed that
this loss term is a variational lower bound on log likelihood of x, i.e., log p✓ (x).
        </p>
        <p>
          Sampling of z from q (z|x) is a non-continous operation, therefore it is not
possible to train the encoder via back-propagation. The Re-parametrization trick
introduced in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] allows us to train VAE using stochastic gradient descent
via back-propagation. Novel samples can be generated by sampling the z from
N (0, I) and decoding the samples into input space, using the decoder model.
An RNN language model (RNNLM) is a generative model, which learns the
conditional probability distribution over the vocabulary words. It predicts the
next word (wi+1) given the representation of words seen so far hi and current
input wi by maximizing the log likelihood of the next word p(wi+1 | hi, wi) =
softmax(Wshi + bs), averaged over sequence length N . The cross-entropy loss is
used to train the language model.
(1)
(2)
LCE LM =
        </p>
        <p>N
1 X log(p(wi+1 | hi, wi))</p>
        <p>
          N i=1
Generally performance of the RNNLM is measured using perplexity (lower is
better), Perplexity = expLCE LM . We have used RNNLM with LSTM units as
described in [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ].
4.4
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Classification</title>
        <p>Classification can be considered as a two step process with the first step
requiring a representation of the data. The second step involves using this
representation for classification. Data can be represented using a bag of words approach,
which ignores the word order information; or using hand-crafted features, which
fail to generalize to multiple datasets / tasks. We learn the task-specific
sentence representation using RNNs with LSTM units by representing the
variable length sentence in a fixed length vector representation h, obtained after
passing the sentence through the RNN layer. We then apply softmax over the
ane transformation of h, i.e., p(c | h) = softmax(Wsh + bs). To learn the
weights of the above model, we minimize the categorical cross entropy loss, i.e.,
LCE = Pim=1 y · log(p(ci | h)), where ci is one of the m class. Here, y is 1 only
for the target class and zero otherwise.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>LSTM-VAE with LM</title>
      <p>
        Similar to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], we have used single layer RNN with LSTM units as encoder and
decoder of the VAE. We pass the variable length input sentence in the encoder in
reverse order as shown in Figure 1. Words of a sentence are first converted into
a vector representation after passing through the word embedding layer, before
being fed to the LSTM layer. The final hidden state of the LSTM, hen+1 then
passes through a feed forward layer, which predicts µ and of the posterior
distribution q (z|x).
      </p>
      <p>After sampling and via the re-parameterization trick (explained later in this
section), the sampled encoding z passes through a feed-forward layer to obtain
hd0 , which is the start state of the decoder RNN. We pass the encoding z as
input to the LSTM layer at every time-step. The vector representation of the
word with the highest probability, predicted at time t, is also passed as input at
next time-step (t + 1), as shown in Figure 1.</p>
      <p>Training the LSTM-VAE It is not straightforward to train the LSTM-VAE
using the loss function given in Eq. 1. As described in Section 4.2, if the
KLdivergence loss term is not used, the model will converge to a state that gives
fixed latent representations. Similarly, if the reconstruction loss is high it would
lead to meaningless sentences. If equal weightage is given to both loss terms the
Source Sent: where can i apply earned leave?
Generated Sentences
where can i apply earned leave ?
where can can check lwp guidelines ?
where can i apply sick leave ?
where can i apply earned leave ?
where can i apply earned leave ?
where can i find lwp guidelines ?
where can i apply earned leave ?
where can can apply casual leave ?
how can i apply earned leave ?
where can i find lwp leave ?
model learns to encode the input sequence into a desired latent representation
too well, i.e., KL divergence (KLD) between approximate posterior q (z|x) and
prior p✓ (z) reaches zero. As a result, the decoder network reduces to merely a
language model (RNNLM) and it does not generate novel samples. The network
should be trained such that both of the loss terms gradually decrease towards
convergence.</p>
      <p>
        This problem was also reported by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and they proposed cost annealing and
word-dropout at the decoder as a remedy. For cost annealing, during the learning
phase, after r1 steps of training they increase the weight of the KL-divergence
loss term gradually from 0 to 1, while retaining full reconstruction loss term
from beginning to end. This approach did not work for us, and we could not
train the model on our data. We therefore propose a variant of their approach,
which worked well for us.
      </p>
      <p>
        – Weighted cost annealing: We have used an improved version of the KL cost
annealing presented in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We utilize a weighted loss function as mentioned
in Eq. 3 and started training the model with = 0, keeping it fixed for the
first e epochs, i.e., (0 e) = 0. We keep increasing it by r after every e
epochs, i.e., (e 2e) = (0 e) + r. Here, e and r are assumed to be hyper
parameters.
– To avoid discrepancy between the training and inference procedures, we have
used the special case of scheduled sampling mentioned in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], i.e., we always
take the word from predicted distribution as input instead of the actual
word from the input sentence. We pass z at every step of the LSTM-decoder
with the highest probability word taken from the predicted distribution, i.e.,
greedy decoding wt = argmaxwt p(wt|w0,...,t 1, hdt 1 , z).
– To make the decoder rely more on the z during sentence decoding, we have
used word dropout similar to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] by passing the UNK token to next step
instead of the word predicted by the decoder using greedy decoding. During
decoding of a sentence using z, k, the fraction of words are replaced by UNK
randomly, where k 2 [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] is also taken as a hyper-parameter.
      </p>
      <p>
        Inference: Sentence Generation To generate sentences similar to input
sentences, we have used the recognition model to get the parameters of the
distribution corresponding to a sentence (µ, ). If we sample z from this distribution, the
sampling step will become non-continuous and we will not be able to learn the
parameters of the neural network using back propagation. The re-parameterization
trick [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] comes to our rescue in this situation. Here, we sample ✏ and obtain
z using the Eq. 4, which is a continuous function and therefore di↵erentiable.
These sampled encodings are decoded by the generative model using greedy
decoding to obtain the sentences. Table 1 contains the set of sentences generated
and the corresponding source sentences.
      </p>
      <p>z = µ + ✏ · , where ✏ ⇠ N (0, I)
(4)
5.2</p>
      <sec id="sec-5-1">
        <title>Sentence Selection</title>
        <p>
          It is dicult to use the sentences generated by the LSTM-VAE in a downstream
task (classification in our case) without preprocessing because of the word
repetition problem in the generated samples observed by various researchers [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ].
Therefore, we discard all the sentences containing consecutively repeating words
as highlighted in Table 1. The remaining sentences may still not be syntactically
and semantically correct. We therefore choose the top-K good sentences sorted
by likelihood via a LM (see Section 4.3) trained on the original training data.
5.3
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>LSTM based Sentence Classification</title>
        <p>Base Classifier (M1): We use the single layer recurrent neural network with
LSTM units for classification trained on the original data. We use this as a
baseline for the classification task.</p>
        <p>Base Classifier augmented with generated sentences (M2): To obtain
the label for the novel sentences generated by VAE, we use M1 and choose the
top N sentences, based on the entropy of the softmax distribution, as candidates
for augmenting the training data. We manually verify the label and correct the
label if it is incorrectly classified by M1; we also remove the sentences that clearly
correspond to new classes. We augment the training data with this new dataset
and train another classification model (M2), as shown in Figure 2.</p>
        <p>
          Base Classifier augmented with generated sentences based on word
embedding(M3): Here, we try to augment the training set with additional
sentences that are generated by replacing some words in each training sentence. All
the words in a sentence except english stopwords and leave types {sick, casual
etc.} (to avoid change of class label) are selected as the candidate words for
replacement. We replace a candidate word with a word corresponding to the
nearest neighbour of the candidate word’s embedding. The nearest neighbour is
chosen based on the cosine similarity between two word embeddings obtained
from glove [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. We generate one sentence per each training sentence and
augment the training data with these generated sentences to train another LSTM
based classification model (M3).
6
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Results and Discussion</title>
      <p>6.1</p>
      <sec id="sec-6-1">
        <title>Experimental Setup and Training</title>
        <p>Dataset Description : We use a dataset of Leave policy questions, created by
HR ocers 1, for chatbot creation, see Table 2 for more details. This dataset was
divided in three parts (Training, Validation and Test) in 60:20:20 ratio. Training
and Validation datasets are used for training the LSTM-VAE model, here we
select the best model based on loss on validation data, i.e., L(, ✓, x), see Eq.
3. New sentences are generated using only training data as input. The test data
is used for reporting the performance metrics of the classification model, and
is never exposed to LSTM-VAE model. Another test data comprising only the
core concept words was also used for testing (see Section 6.3).</p>
        <p>Characteristic N
c l</p>
        <p>V
Value</p>
        <p>
          Training Details: Word embeddings for tokens (delimited by space) are
initialized randomly and learned during the training. We use Adam [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] for
optimization and learning rate is selected from the range [1e-2,1e-3] for all the
models, the choice of other hyper-parameters is given in Table 3. For
regularization of classification model, we have used dropout [
          <xref ref-type="bibr" rid="ref22 ref24">24, 22</xref>
          ] on word embedding
and LSTM layers, along with batch normalization [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. For LM, we used
truncated back-propagation for gradient calculation. Next, we present the details of
a) new classes generated by our approach, b) results on core concept word
testing demonstrating linguistic training bias, and finally show c) the improvement
in accuracy using the generated sentences.
1 A subset of this data set will be shared on demand for research purposes
        </p>
        <p>Parameter
Dim. of word embed.</p>
        <p>Dimension of z (30)
weighted cost annealing
When preparing the questions to train a chatbot as described in Section 3, it
is hard to visualize all possible questions (i.e., query-sets) that users can ask
when it is made live. Using LSTM-VAE we were able to generate many new
query sets, which were accepted by the core HR Team. In Figure 2 we present
the entire workflow followed to generate the novel queries. Out of the 175,000
queries generated using LSTM-VAE, we removed the queries already present
in the training data, as well as those which have same word repeating more
than once consecutively. After this stage we obtained about 5,700 sentences.
These sentences were then tested using a LM (as described in Section 4.3), and
we picked-up only top 1500 sentences based on the likelihood. Many of these
sentences were found to be grammatically correct, while only some of them were
semantically inconsistent. In this process it was discovered that 434 sentences did
not belong to any of the existing classes. These sentences were given to the HR
team for review, and they selected 120 sentences belonging to 33 new classes. We
have made this chatbot live in our company and these 33 classes are in practical
use (see Key Contribution 2 ).</p>
        <p>Further, in Table 4, we show sample novel sentences generated by
LSTMVAE belonging to 9 di↵erent classes (separated by horizontal line in the table).
For every novel sentence, we also show the most similar sentence taken from
training data, identified using Jaccard Similarity.
Next, our HR team tested the chatbot built using models M1 and M2 (see Section
5.3) by using only core concept words as inputs, rather than complete sentences:
see Table 5. To build the model M2, we added sentences generated by the process
flow, shown in Figure 2. Here, out of 1500 sentences chosen by the LM 1066 were
manually found to be belonging to existing classes, and we ran the model M1
Sentences from training data
Are business associates eligible to take al?
Are business associates eligible for al?
Where can i find the policy on casual leave?
Where can i read more about casual leave ?
Where do i get more info on casual leaves?
Where i can find cl guidelines ?
How many flexi leaves will i get ?
What is flexi leave and how can i avail them ?
Can i apply ml if i am on lwp ?
How can i go about applying ml if on lwp ?
Is it possible for me to apply ml when on lwp ?
What happens to my unutilized sick leave ?
In event of separation can i encash my sick leave ?
Concept Word Query
Adoption leave business associates
Casual leave policy path
Flexi Leave Entitlement
Maternity leave entitlement while on LWP</p>
        <p>Sick leave resign during separation ?
on these sentences. We chose top-250 correctly classified sentences based on low
entropy and all the 130 wrongly classified sentences, and combined these with
the training data to train the model M2. These 380 sentences belong to only 74
out of 117 classes. As shown in Table 6, a classifier built using only the training
dataset tends to have linguistic training bias. This bias gets reduced when the
generated sentences are also used (see Key Contribution 1 ). This di↵erence of
9% can be attributed to new sentences generated via perturbation of surrounding
words and high quality sentences generated by LSTM-VAE.
6.4</p>
      </sec>
      <sec id="sec-6-2">
        <title>Accuracy on Test set</title>
        <p>We now analyze the impact of addition of this training data into the model
on hold-out set. Here, a small gain of about 2% was observed in classification
accuracy, as shown in Table 6 (see Key Contribution 3 ). While this gain is small,
it is important that the classifier uses the right words; especially since the test
set, being a subset of curated questions, also has linguistic training bias. It was
observed that the model M1 often classified the query using non-relevant words
as shown in Table 6. Even if such a classifier had worked better on the test data,
we believe it would perform worse on actual user queries; of course this can only
be confirmed by a controlled A/B test on real users.
6.5</p>
      </sec>
      <sec id="sec-6-3">
        <title>Analysis: Why Linguistic Training Bias gets reduced?</title>
        <p>During user testing of the FAQ-bot, it was observed that often model’s decision
to classify a user query in certain class is based on non-concept words. It was
suspected that this phenomenon occurs due to linguistic training bias, as explained
Dataset</p>
        <p>M1</p>
        <p>M 2</p>
        <p>M 3
Concept Word Testing 55.00% 64.28% 62.14%
Test Set
in Section 1. New sentences generated by LSTM-VAE, which were wrongly
classified by M1, are likely to be of this type, and are therefore more value adding for
the model. For example, many sentences get generated which contain the same
surrounding words (e.g., ‘gets’) but belonging to di↵erent classes, e.g., “ When
my casual leaves gets credited ? ” and “When my maternity leaves gets credited
? ”. When such sentences are added to the training set, it indirectly forces the
model to learn to distinguish the classes based on some other words than such
non-concept words. Perhaps this is one of the reasons why merely with an
addition of 130 sentences (which is less than 10% of the original training data)
accuracy gain of almost 10% was observed during concept-word based testing
of the model as shown in Table 6. It would otherwise have required many more
new sentences to get the same degree of gain in classification accuracy.
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>Based on our experience of building and deploying a FAQ-bot in practice, we have
noted that linguistic training bias leads to over-fitting of deep-learning model.
Such bias is more likely to occur when the training data is created by domain
experts, who tend to use templates when writing natural language sentences.
We have shown that such bias can be fixed to certain extent by generating novel
sentences using a generative model. We also described an approach for such a
generative model, which uses LSTM-VAE followed by sentence selection using a
LM. We presented weighted cost annealing, an improved approach for training
the LSTM-VAE, and showed that accuracy of a classification model can be
increased significantly by using generated sentences. Most importantly, we have
shown that our approach was able to generate newer classes of questions for our
FAQ-chatbot, not present in the original training data, which were reviewed and
accepted by the domain experts for deployment.</p>
    </sec>
    <sec id="sec-8">
      <title>Appendix: E↵ect of di↵erent training procedures on sentence generation</title>
      <p>– Weighted cost annealing: As mentioned in Section 5.1 the training of
LSTM-VAE by minimizing the loss according to Equation 5 causes the
KLdivergence loss to increase for few time-steps initially and then drop to zero
as shown in Figure 3. This makes the decoder to behave like an RNNLM. To
overcome the above issue we use weighted cost annealing i.e. we increase the
weight of KL-divergence loss lineraly after every e epochs and simultaneously
reduce the weight of the reconstruction loss. Due to this, even though the
KL-divergence loss increases initially for few time-steps, it starts decreasing
over the time-steps but remains non zero as shown in Figure 3.
x) = KL(q (z|x)kp✓ (z))
– As described in Section 5.1 the input to the LSTM-VAE decoder, for
predicting the word at time t + 1, is the output of the decoder at time t rather
than the actual word from the input sentence. In our case we found that
this helps the LSTM-VAE in generating more semantically and syntactically
correct sentences as compared to those generated by inputting the actual
word to the decoder. This can be observed from the Table 7.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Samy</given-names>
            <surname>Bengio</surname>
          </string-name>
          , Oriol Vinyals, Navdeep Jaitly, and
          <string-name>
            <given-names>Noam</given-names>
            <surname>Shazeer</surname>
          </string-name>
          .
          <article-title>Scheduled sampling for sequence prediction with recurrent neural networks</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          <volume>28</volume>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Simard</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Frasconi</surname>
          </string-name>
          .
          <article-title>Learning long-term dependencies with gradient descent is dicult</article-title>
          .
          <source>Trans. Neur</source>
          . Netw.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Samuel</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Bowman</surname>
          </string-name>
          , Luke Vilnis, Oriol Vinyals,
          <string-name>
            <surname>Andrew M. Dai</surname>
          </string-name>
          , Rafal Jo´zefowicz, and Samy Bengio.
          <article-title>Generating sentences from a continuous space</article-title>
          .
          <source>In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Wojciech</given-names>
            <surname>Zaremba Christian Szegedy</surname>
          </string-name>
          , Ilya Sutskever, Dumitru Erhan Joan Bruna, Ian Goodfellow, and
          <string-name>
            <given-names>Rob</given-names>
            <surname>Fergus</surname>
          </string-name>
          .
          <article-title>Intriguing properties of neural networks</article-title>
          .
          <source>In International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Ann Copestake and
          <string-name>
            <given-names>Dan</given-names>
            <surname>Flickinger</surname>
          </string-name>
          .
          <article-title>An open source grammar development environment and broad-coverage english grammar using hpsg</article-title>
          .
          <source>In Proceedings of the Second International Conference on Language Resources and Evaluation</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Andrew M Dai and Quoc V Le</surname>
          </string-name>
          .
          <article-title>Semi-supervised sequence learning</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          <volume>28</volume>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ian</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Goodfellow</surname>
            , Jonathon Shlens, and
            <given-names>Christian</given-names>
          </string-name>
          <string-name>
            <surname>Szegedy</surname>
          </string-name>
          .
          <article-title>Explaining and harnessing adversarial examples</article-title>
          .
          <source>In International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Karol</given-names>
            <surname>Gregor</surname>
          </string-name>
          , Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and
          <string-name>
            <given-names>Daan</given-names>
            <surname>Wierstra</surname>
          </string-name>
          .
          <article-title>Towards conceptual compression</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          <volume>29</volume>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Karol</given-names>
            <surname>Gregor</surname>
          </string-name>
          , Ivo Danihelka, Alex Graves, Danilo Rezende, and
          <string-name>
            <given-names>Daan</given-names>
            <surname>Wierstra</surname>
          </string-name>
          .
          <article-title>Draw: A recurrent neural network for image generation</article-title>
          .
          <source>In Proceedings of the 32nd International Conference on Machine Learning</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. David Ha and
          <string-name>
            <given-names>Douglas</given-names>
            <surname>Eck</surname>
          </string-name>
          .
          <article-title>A neural representation of sketch drawings</article-title>
          .
          <source>CoRR</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <article-title>Sepp Hochreiter and Ju¨rgen Schmidhuber. Long short-term memory</article-title>
          .
          <source>Neural Comput.</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Sergey Io↵e
          <string-name>
            <given-names>and Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          .
          <article-title>Batch normalization: Accelerating deep network training by reducing internal covariate shift</article-title>
          .
          <source>In Proceedings of the 32nd International Conference on Machine Learning</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Diederik</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kingma</surname>
            and
            <given-names>Jimmy</given-names>
          </string-name>
          <string-name>
            <surname>Ba</surname>
          </string-name>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>CoRR</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Diederik P Kingma</surname>
            , Shakir Mohamed, Danilo Jimenez Rezende, and
            <given-names>Max</given-names>
          </string-name>
          <string-name>
            <surname>Welling</surname>
          </string-name>
          .
          <article-title>Semi-supervised learning with deep generative models</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          <volume>27</volume>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Diederik P Kingma and Max Welling</surname>
          </string-name>
          .
          <article-title>Auto-encoding variational bayes</article-title>
          .
          <source>In International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Yitong</surname>
            <given-names>Li</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Trevor</given-names>
            <surname>Cohn</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Timothy</given-names>
            <surname>Baldwin</surname>
          </string-name>
          .
          <article-title>Robust training under linguistic adversity</article-title>
          .
          <source>In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>2</volume>
          ,
          <string-name>
            <surname>Short</surname>
            <given-names>Papers</given-names>
          </string-name>
          ,
          <year>April 2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Yishu</surname>
            <given-names>Miao</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Lei</given-names>
            <surname>Yu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Phil</given-names>
            <surname>Blunsom</surname>
          </string-name>
          .
          <article-title>Neural variational inference for text processing</article-title>
          .
          <source>In Proceedings of The 33rd International Conference on Machine Learning</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Martin Karafia´t, Luka´s Burget, Jan Cernocky´, and
          <string-name>
            <given-names>Sanjeev</given-names>
            <surname>Khudanpur</surname>
          </string-name>
          .
          <article-title>Recurrent neural network based language model</article-title>
          .
          <source>In 11th Annual Conference of the International Speech Communication Association</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>George</surname>
            <given-names>A</given-names>
          </string-name>
          . Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and
          <string-name>
            <given-names>Katherine J.</given-names>
            <surname>Miller</surname>
          </string-name>
          .
          <article-title>Introduction to wordnet: an on-line lexical database</article-title>
          .
          <source>International Journal of Lexicography</source>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20. Nikola Mrkˇsi´c, Diarmuid O´ S´eaghdha, Blaise Thomson, Milica Gaˇsi´c, Lina RojasBarahona,
          <string-name>
            <surname>Pei-Hao</surname>
            <given-names>Su</given-names>
          </string-name>
          , David Vandyke,
          <string-name>
            <surname>Tsung-Hsien Wen</surname>
            , and
            <given-names>Steve</given-names>
          </string-name>
          <string-name>
            <surname>Young</surname>
          </string-name>
          .
          <article-title>Counter-fitting word vectors to linguistic constraints</article-title>
          .
          <source>In Proceedings of HLTNAACL</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21. Je↵rey Pennington, Richard Socher, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In Proceedings of Empirical Methods in Natural Language Processing</source>
          , EMNLP,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Vu</surname>
            <given-names>Pham</given-names>
          </string-name>
          , Th´eodore Bluche, Christopher Kermorvant, and
          <string-name>
            <surname>J</surname>
          </string-name>
          ´eroˆme Louradour.
          <article-title>Dropout improves recurrent neural networks for handwriting recognition</article-title>
          .
          <source>In 14th International Conference on Frontiers in Handwriting Recognition (ICFHR)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Stanislau</surname>
            <given-names>Semeniuta</given-names>
          </string-name>
          , Aliaksei Severyn, and
          <string-name>
            <given-names>Erhardt</given-names>
            <surname>Barth</surname>
          </string-name>
          .
          <article-title>A hybrid convolutional variational autoencoder for text generation</article-title>
          .
          <source>CoRR</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Nitish</surname>
            <given-names>Srivastava</given-names>
          </string-name>
          , Geo↵rey
          <string-name>
            <given-names>E</given-names>
            <surname>Hinton</surname>
          </string-name>
          , Alex Krizhevsky, Ilya Sutskever, and
          <string-name>
            <given-names>Ruslan</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          .
          <article-title>Dropout: a simple way to prevent neural networks from overfitting</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Zhaopeng</surname>
            <given-names>Tu</given-names>
          </string-name>
          , Zhengdong Lu, Yang Liu, Xiaohua Liu, and
          <string-name>
            <given-names>Hang</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Modeling coverage for neural machine translation</article-title>
          .
          <source>In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Weidi</surname>
            <given-names>Xu</given-names>
          </string-name>
          , Haoze Sun, Chao Deng, and
          <string-name>
            <given-names>Ying</given-names>
            <surname>Tan</surname>
          </string-name>
          .
          <article-title>Variational autoencoder for semi-supervised text classification</article-title>
          .
          <source>In Thirty-First AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Zichao</surname>
            <given-names>Yang</given-names>
          </string-name>
          , Zhiting Hu, Ruslan Salakhutdinov, and
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          Berg-Kirkpatrick.
          <article-title>Improved variational autoencoders for text modeling using dilated convolutions</article-title>
          .
          <source>CoRR</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Wojciech</surname>
            <given-names>Zaremba</given-names>
          </string-name>
          , Ilya Sutskever, and
          <string-name>
            <given-names>Oriol</given-names>
            <surname>Vinyals</surname>
          </string-name>
          .
          <article-title>Recurrent neural network regularization</article-title>
          .
          <source>CoRR</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>