<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Improving Goal-Oriented Visual Dialog Agents via Advanced Recurrent Nets with Tempered Policy Gradient</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Ludwig-Maximilians-Universita ̈t M u ̈nchen</institution>
          ,
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Learning goal-oriented dialogues by means of deep reinforcement learning has recently become a popular research topic. However, training textgenerating agents efficiently is still a considerable challenge. Commonly used policy-based dialogue agents often end up focusing on simple utterances and suboptimal policies. To mitigate this problem, we propose a class of novel temperature-based extensions for policy gradient methods, which are referred to as Tempered Policy Gradients (TPGs). These methods encourage exploration with different temperature control strategies. We derive three variations of the TPGs and show their superior performance on a recently published AI-testbed, i.e., the GuessWhat?! game. On the testbed, we achieve significant improvements with two innovations. The first one is an extension of the state-ofthe-art solutions with Seq2Seq and Memory Network structures that leads to an improvement of 9%. The second one is the application of our newly developed TPG methods, which improves the performance additionally by around 5% and, even more importantly, helps produce more convincing utterances. TPG can easily be applied to any goal-oriented dialogue systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In recent years, deep learning has shown convincing
performance in various areas such as image recognition, speech
recognition, and natural language processing (NLP). Deep
neural nets are capable of learning complex dependencies
from huge amounts of data and its human generated
annotations in a supervised way. In contrast, reinforcement
learning agents [Sutton and Barto, 1998] can learn directly from
their interactions with the environment without any
supervision and surpass human performance in several domains,
for instance in the game of GO [Silver et al., 2016], as well
as many computer games [Mnih et al., 2015]. In this paper
we are concerned with the application of both approaches to
goal-oriented dialogue systems [Bordes and Weston, 2017;
de Vries et al., 2017; Das et al., 2017; Strub et al., 2017;
Das et al., 2017; Lewis et al., 2017; Dhingra et al., 2016],
a problem that has recently caught the attention of machine
learning researchers. De Vries et al. [2017] have proposed
as AI-testbed a visual grounded object guessing game called
GuessWhat?!. Das et al. [2017] formulated a visual dialogue
system which is about two chatbots asking and answering
questions to identify a specific image within a group of
images. More practically, dialogue agents have been applied to
negotiate a deal [Lewis et al., 2017] and access certain
information from knowledge bases [Dhingra et al., 2016]. The
essential idea in these systems is to train different dialogue
agents to accomplish the tasks. In those papers, the agents
have been trained with policy gradients, i.e. REINFORCE
[Williams, 1992].</p>
      <p>In order to improve the exploration quality of policy
gradients, we present three instances of temperature-based
methods. The first one is a single-temperature approach which
is very easy to apply. The second one is a parallel approach
with multiple temperature policies running concurrently. This
second approach is more demanding on computational
resources, but results in more stable solutions. The third one
is a temperature policy approach that dynamically adjusts the
temperature for each action at each time-step, based on action
frequencies. This dynamic method is more sophisticated and
proves more efficient in the experiments. In the experiments,
all these methods demonstrate better exploration strategies in
comparison to the plain policy gradient.</p>
      <p>We demonstrate our approaches using a real-world dataset
called GuessWhat?!. The GuessWhat?! game [de Vries et al.,
2017] is a visual object discovery game between two players,
the Oracle and the Questioner. The Questioner tries to
identify an object by asking the Oracle questions. The original
works [de Vries et al., 2017; Strub et al., 2017] first
proposed supervised learning to simulate and optimize the game.
Strub et al. [2017] showed that the performance could be
improved by applying plain policy gradient reinforcement
learning, which maximizes the game success rate, as a second
processing step. Building on these previous works, we propose
two network architecture extensions. We utilize a Seq2Seq
model [Sutskever et al., 2014] to process the image along
with the historical dialogues for question generation. For the
guessing task, we develop a Memory Network [Sukhbaatar et
al., 2015] with Attention Mechanism [Bahdanau et al., 2014]
to process the generated question-answer pairs. We first train
these two models using the plain policy gradient and use them
as our baselines. Subsequently, we train the models with
our new TPG methods and compare the performances with
the baselines. We show that the TPG method is compatible
with state-of-the-art architectures such as Seq2Seq and
Memory Networks and contributes orthogonally to these advanced
neural architectures. To the best of our knowledge, the
presented work is the first to propose temperature-based policy
gradient methods to leverage exploration and exploitation in
the field of goal-oriented dialogue systems. We demonstrate
the superior performance of our TPG methods by applying it
to the GuessWhat?! game. Our contributions are:
We introduce Tempered Policy Gradients in the context
of goal-oriented dialogue systems, a novel class of
approaches to temperature control to better leverage
exploration and exploitation during training.</p>
      <p>We extend the state-of-the-art solutions for the
GuessWhat?! game by integrating Seq2Seq and Memory
Networks. We show that TPGs are compatible with these
advanced models and further improve the performance.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>In our notation, we use x to denote the input to a policy
network , and xi to denote the i-th element of the input vector.
Similarly, w denotes the weight vector of , and wi denotes
the i-th element of the weight vector of that . The output
y is a multinoulli random variable with N states that follows
a probability mass function, f (y = n j (x j w)), where
nN=1f (y = n j (x j w)) = 1 and f ( ) 0. In a nutshell, a
policy network parametrizes a probabilistic unit that produces
the sampled output, mathematically, y f ( (x j w)).</p>
      <p>
        At this point, we have defined the policy neural net and
now discuss performance measures commonly used for
optimizations. Typically, the expected value of the
accumulated reward, i.e. return, conditioned on the policy network
parameters E(r j w) is used. Here, E denotes the
expectation operator, r the accumulated reward signal, and w
the network weight vector. The objective of reinforcement
learning is to update the weights in a way that maximizes
the expected return at each trial. In particular, the
REINFORCE updating rule is: wi = i(r bi)ei, where wi
denotes the weight adjustment of weight wi, i is a
nonnegative learning rate factor, and bi is a reinforcement
baseline. The ei is the characteristic eligibility of wi, defined as
ei = (@f =@wi)=f = @lnf =@wi. Williams [
        <xref ref-type="bibr" rid="ref26">1992</xref>
        ] has proved
that the updating quantity, (r bi)@lnf =@wi, represents an
unbiased estimate of @E(r j w)=@wi.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Tempered Policy Gradient</title>
      <p>In order to improve the exploration quality of REINFORCE
in the task of optimizing policy-based dialogue agents, we
attempt to find the optimal compromise between exploration
and exploitation. In TPGs we introduce a parameter , the
sampling temperature of the probabilistic output unit, which
allows us to explicitly control the strengths of the exploration.</p>
      <sec id="sec-3-1">
        <title>3.1 Exploration and Exploitation</title>
        <p>The trade-off between exploration and exploitation is one of
the great challenges in reinforcement learning [Sutton and</p>
        <p>Barto, 1998]. To obtain a high reward, an agent must
exploit the actions that have already proved effective in
getting more rewards. However, to discover such actions, the
agent must try actions, which appear suboptimal, to explore
the action space. In a stochastic task like text generation,
each action, i.e. a word, must be tried many times to find
out whether it is a reliable choice or not. The
explorationexploitation dilemma has been intensively studied over many
decades [Carmel and Markovitch, 1999; Nachum et al., 2016;
Liu et al., 2017]. Finding the balance between exploration
and exploitation is considered crucial for the success of
reinforcement learning [Thrun, 1992].
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Temperature Sampling</title>
        <p>In text generation, it is well-known that the simple trick
of temperature adjustment is sufficient to shift the language
model to be more conservative or more diversified [Karpathy
and Fei-Fei, 2015]. In order to control the trade-off between
exploration and exploitation, we borrow the strength of the
temperature parameter 0 to control the sampling. The
output probability of each word is transformed by a
temperature function as:
f (y = n j (x j w)) =</p>
        <p>f (y = n j (x j w)) 1
m=1f (y = m j (x j w)) 1 :</p>
        <p>N
We use notation f to denote a probability mass function f
that is transferred by a temperature function with temperature
. When the temperature is high, &gt; 1, the distribution
becomes more uniform; when the temperature is low, &lt; 1,
the distribution appears more spiky. TPG is defined as an
extended algorithm of the Monte Carlo Policy Gradient
approach. We use a higher temperature, &gt; 1, to encourage
the model to explore in the action space, and conversely, a
lower temperature, &lt; 1, to encourage exploitation. In the
extreme case, when = 0, we obtain greedy search.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Tempered Policy Gradient Methods</title>
        <p>Here, we introduce three instances of TPGs in the domain
of goal-oriented dialogues, including single, parallel, and
dynamic tempered policy gradient methods.</p>
        <p>Single-TPG: The Single-TPG method simply uses a global
temperature global during the whole training process, i.e., we
use global &gt; 1 to encourage exploration. The forward pass
is represented mathematically as: y global f global ( (x j
w)), where (x j w) represents a policy neural network that
parametrizes a distribution f global over the vocabulary, and
y global means the word sampled from this tempered word
distribution. After sampling, the weight of the neural net is
updated using,
wi =
i(r
Noteworthy is that the actual gradient, @lnf (y global j (x j
w))=@wi, depends on the sampled word, y global , however,
does not depend directly on the temperature, . We prefer
to find the words that lead to a reward, so that the model
can learn quickly from these actions, otherwise, the neural
network only learns to avoid current failure actions. With
Single-TPG and &gt; 1, the entire vocabulary of a dialogue
agent is explored more efficiently than by REINFORCE,
because nonpreferred words have a higher probability of
being explored. This Single-TPG method is very easy to use
and could yield a performance improvement after training
because the goal-oriented dialogue optimization could
benefit from increased exploration. The temperature is initialized
with = 1, then fine-tuned based on the learning curve on
the validation sets, and subsequently left fixed..</p>
        <p>Parallel-TPG: A more advanced version of Single-TPG
is the Parallel-TPG that deploys several Single-TPGs
concurrently with different temperatures, 1; :::; n, and updates
the weights based on all generated samples. During the
forward pass, multiple copies of the neural nets parameterize
multiple word distributions. The words are sampled in
parallel at various temperatures, mathematically, y 1 ; :::; y n
f 1;:::; n ( (x j w)). After sampling, in the backward pass
the weights are updated with the sum of gradients. The
formula is given by
wi =
k i(r
where k 2 f1; :::; ng. The combinational use of higher and
lower temperatures ensures both exploration and exploitation
at the same time. The sum over weight updates of
parallel policies gives a more accurate Monte Carlo estimate of
@E(r j w)=@wi, due to the nature of Monte Carlo methods
[Robert, 2004]. Thus, compared to Single-TPG, we would
argue that Parallel-TPG is more robust and stable, although
Parallel-TPG needs more computational power. However,
these computations can be easily distributed in a parallel
fashion using state-of-the-art graphics processing units.</p>
        <p>Dynamic-TPG: As a third variant, we introduce the
Dynamic-TPG, which is the most sophisticated approach in
the current TPG family. The essential idea is that we use a
heuristic function h to assign the temperature to the word
distribution at each time step, t. The temperature is bounded
in a predefined range [ min; max]. The heuristic function
we used here is based upon the term frequency inverse
document frequency, tf-idf [Leskovec et al., 2014]. In the
context of goal-oriented dialogues, we use the counted
number of each word as term frequency tf and the total
number of generated dialogues during training as document
frequency df. We use the word that has the highest
probability to be sampled at current time-step, yt , as the input to
the heuristic function h. Here, yt is the maximizer of the
probability mass function f . Mathematically, it is defined as
yt = argmax(f ( (x j w))). We propose that tf-idf(yt )
approximates the concentration level of the distribution, which
means that if the same word is always sampled from a
distribution, then the distribution is very concentrated. Too much
concentration prevents the model from exploration, so that a
higher temperature is needed. In order to achieve this effect,
the heuristic function is defined as
th = h(tf-idf(yt ))
=
min) tf-idfmax
tf-idf(yt )
tf-idfmin :
With this heuristic, words that occur very often are depressed
by applying a higher temperature to those words, making</p>
        <p>LSTM LSTM LSTM LSTM LSTM</p>
        <p>Is
it</p>
        <p>a person ?</p>
        <p>Spatial MLP
Category</p>
        <p>MLP</p>
        <p>No
MLP Yes</p>
        <p>N.A.
them less likely to be selected in the near future. In the
forward pass, a word is sampled using y th f th ( (x j w)). In
the backward pass, the weights are updated correspondingly,
using
wi =
i(r</p>
        <p>h
bi)@lnf (y t j (x j w))=@wi;
where th is the temperature calculated from the
heuristic function. Compared to Parallel-TPG, the advantage of
Dynamic-TPG is that it assigns temperature more
appropriately, without increasing the computational load.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>GuessWhat?! Game</title>
      <p>We evaluate our concepts using a recent testbed for AI, called
the GuessWhat?! game [de Vries et al., 2017], available at
https://guesswhat.ai. The dataset consists of 155 k
dialogues, including 822 k question-answer pairs, each
composed of around 5 k words, about 67 k images [Lin et al.,
2014] and 134 k objects. The game is about visual object
discovery trough a multi-round QA among different players.</p>
      <p>Formally, a GuessWhat?! game is represented by a tuple
(I; D; O; o ), where I 2 RH W denotes an image of height
H and width W ; D represents a dialogue composed of M
rounds of question-answer pairs (QAs), D = (qm; am)mM=1;
O stands for a list of K objects O = (ok)kK=1; and o is
the target object. Each question is a sequence of words,
qm = fym;1; ::::::; ym;Nm g with length Nm. The words are
taken from a defined vocabulary V , which consists of the
words and a start token and an end token. Each answer is
either yes, no, or not applicable, i.e. am 2 fyes; no; n:a:g.
For each object ok, there is a corresponding object
category ck 2 f1; ::::::; Cg and a pixel-wise segmentation mask
Sk 2 f0; 1gH W . Finally, we use colon notation (:) to select
a subset of a sequence, for instance, (q; a)1:m refers to the
first m rounds of QAs in a dialogue.
4.1</p>
      <sec id="sec-4-1">
        <title>Models and Pretraining</title>
        <p>Following [Strub et al., 2017], we first train all three models
in a supervised fashion.</p>
        <p>Oracle: The task of the Oracle is to answer questions
regarding to the target object. We outline here the
simple neural network architecture that achieved the best
performance in the study of [de Vries et al., 2017], and which
we also used in our experiments. The input information
used here is of three modalities, namely the question q, the
spatial information xspatial and the category c of the
target object. For encoding the question, de Vries et al. first
use a lookup table to learn the embedding of words, then</p>
        <sec id="sec-4-1-1">
          <title>Image CNN</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>Is it a person? Yes History LSTM</title>
          <p>......</p>
          <p>Decoder
Is</p>
          <p>this person . . .</p>
          <p>LSTM LSTM LSTM LSTM . . .</p>
          <p>Is</p>
          <p>this person
use a one layer long-short-term-memory (LSTM)
[Hochreiter and Schmidhuber, 1997] to encode the whole
question. For spatial information, de Vries et al. extract an
8-dimensional vector of the location of the bounding box
[xmin; ymin; xmax; ymax; xcenter; ycenter; wbox; hbox],
where x, y denote the coordinates and wbox, hbox denote
the width and height of the bounding box, respectively. De
Vries et al. normalize the image width and height so that the
coordinates range from -1 to 1. The origin is at the image
center. The category embedding of the object is also learned
with a lookup table during training. At the last step, de Vries
et al. concatenate all three embeddings into one feature
vector and fed it into a one hidden layer multilayer perceptron
(MLP). The softmax output layer predicts the distribution,
Oracle := p(a j q; c ; xspatial), over the three classes,
including no, yes, and not applicable. The model is trained
using the negative log-likelihood criterion. The Oracle
structure is shown in Fig. 1.</p>
          <p>Question-Generator: The goal of the Question-Generator
(QGen) is to ask the Oracle meaningful questions, qm+1,
given the whole image, I, and the historical question-answer
pairs, (q; a)1:m. In previous work [Strub et al., 2017], the
state transition function was modelled as an LSTM, which
was trained using whole dialogues so that the model
memorizes the historical QAs. We refer to this as dialogue level
training. We develop a novel QGen architecture using a
modified version of the Seq2Seq model [Sutskever et al., 2014].
The modified Seq2Seq model enables question level training,
which means that the model is fed with historical QAs, and
then learns to produce a new question. Following [Strub et
al., 2017], we first encode the whole image into a fixed-size
feature vector using the VGG-net [Simonyan and Zisserman,
2014]. The features come from the fc-8 layer of the
VGGnet. For processing historical QAs, we use a lookup table to
learn the word embeddings, then again use an LSTM encoder
to encode the history information into a fixed-size latent
representation, and concatenate it with the image representation:
semn;cNm = encoder((LSTM(q; a)1:m); VGG(I)):
The encoder and decoder are coupled by initializing the
decoder state with the last encoder state, mathematically,
sdme+c1;0 = semn;cNm. The LSTM decoder generates each word
based on the concatenated representation and the previous
generated word (note the first word is a start token):
ym+1;n = decoder(LSTM((ym+1;n 1; sdme+c1;n 1)):
The decoder shares the same lookup table weights as the
encoder. The Seq2Seq model, consisting of the encoder and</p>
          <p>Objects
...</p>
          <p>...</p>
          <p>Memory
Attention Mechanism
. SoftMax</p>
          <p>Object
+</p>
          <p>Weighted Sum
the decoder, is trained end-to-end to minimize the negative
log-likelihood cost. During testing, the decoder gets a start
token and the representation from the encoder, and then
generates each word at each time step until it encounters a
question mark token, QGen := p(ym+1;n j (q; a)1:m; I). The
output is a complete question. After several question-answer
rounds, the QGen outputs an end-of-dialogue token, and stops
asking questions. The overall structure of the QGen model is
illustrated in Fig. 2.</p>
          <p>Guesser: The goal of the Guesser model is to find out
which object the Oracle model is referring to, given the
complete history of the dialogue and a list of objects in the
image, p(o j (q; a)1:M ; xsOpatial; cO). The Guesser model has
access to the spatial, xsOpatial, and category information, cO,
of the objects in the list. The task of the Guesser model is
challenging because it needs to understand the dialogue and
to focus on the important content, and then guess the object.
To accomplish this task, we decided to integrate the
Memory [Sukhbaatar et al., 2015] and Attention [Bahdanau et al.,
2014] modules into the Guesser architecture used in the
previous work [Strub et al., 2017]. First, we use an LSTM header
to process the varying lengths of question-answer pairs in
parallel into multiple fixed-size vectors. Here, each QA-pair has
been encoded into some facts, Factm = LSTM((q; a)m),
and stored into a memory base. Later, we use the sum of
the spatial and category embeddings of all objects as a key,
Key1 = MLP(xsOpatial; cO), to query the memory and
calculate an attention mask, Attention1(Factm) = Factm key1,
over each fact. Next, we use the sum of attended facts and the
first key to calculate the second key. Further, we use the
second key to query the memory base again to have a more
accurate attention. These are the so called “two-hops” of attention
in the literature [Sukhbaatar et al., 2015]. Finally, we
compare the attended facts with each object embedding in the list
using a dot product. The most similar object to these facts is
the prediction, Guesser := p(o j (q; a)1:M ; xsOpatial; cO).
The intention of using the attention module here is to find out
the most relevant descriptions or facts concerning the
candidate objects. We train the whole Guesser network end-to-end
using the negative log-likelihood criterion. A more graphical
description of the Guesser model is shown in Fig. 3.
4.2</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Reinforcement Learning</title>
        <p>Now, we post-train the QGen and the Guesser model with
reinforcement learning. We keep the Oracle model fixed. In
each game episode, when the models find the correct object,
r = 1, otherwise, r = 0.</p>
        <p>Next, we can assign credits for each action of the QGen
and the Guesser models. In the case of the QGen model, we
spread the reward uniformly over the sequence of actions in
the episode. The baseline function, b, used here is the running
average of the game success rate. Consider that the Guesser
model has only one action in each episode, i.e., taking the
guess. If the Guesser finds the correct object, then it gets an
immediate reward and the Guesser’s parameters are updated
using the REINFORCE rule without baseline. The QGen is
trained using the following four methods.</p>
        <p>REINFORCE: The baseline method used here is
REINFORCE [Williams, 1992]. During training, in the
forward pass the words are sampled with = 1, ym+1;n
f (QGen(x j w)). In the backward pass, the weights are
updated using REINFORCE, as,
w = w + (r</p>
        <p>b)rwlnf (ym+1;n j QGen(x j w)):
Single-TPG: We use temperature global = 1:5 during
training to encourage exploration, mathematically, ymg+lo1ba;nl
f global (QGen(x j w)). In the backward pass, the weights
are updated using
w = w + (r</p>
        <p>b)rwlnf (ymg+lo1ba;nl j QGen(x j w)):
Parallel-TPG: For Parallel-TPG, we use two
temperatures 1 = 1:0 and 2 = 1:5 to encourage the
exploration. The words are sampled in the forward pass using
ym1+1;n; ym2+1;n f 1; 2 (QGen(x j w)). In the backward
pass, the weights are updated using
w = w +
2
k=1 (r</p>
        <p>b)rwlnf (ymk+1;n j QGen(x j w)):
Dynamic-TPG: The last method we evaluated is
DynamicTPG. We use a heuristic function to calculate the temperature
for each word at each time step:
h
m+1;n =
where we set min = 0:5, max = 1:5, and set tf-idfmin = 0,
tf-idfmax = 8. After the calculation of mh+1;n, we substitute
the value into the formula at each time step and sample the
next word using
ymmh++11;;nn</p>
        <p>f mh+1;n (QGen(x j w)):
In the backward pass, the weights are updated using
w = w + (r</p>
        <p>b)rwlnf (ymmh++11;;nn j QGen(x j w)):
For all four methods, we use greedy search in evaluation.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiment</title>
      <p>We first train all the networks in a supervised fashion,
and then further optimize the QGen and the Guesser
model using reinforcement learning. The source code
is available at https://github.com/ruizhaogit/
GuessWhat-TemperedPolicyGradient, which uses
Torch7 [Collobert et al., 2011].</p>
      <p>Method
[Strub et al., 2017]
[Strub and de Vries, 2017]
REINFORCE
Single-TPG
Parallel-TPG
Dynamic-TPG
Accuracy
52.30%
60.30%
69.66%
69.76%
73.86%
74.31%
We train all three models using 0.5 dropout [Srivastava et al.,
2014] during training, using the ADAM optimizer [Kingma
and Ba, 2014]. We use a learning rate of 0.0001 for the
Oracle model and the Guesser model, and a learning rate of
0.001 for QGen. All the models are trained with at most 30
epochs and early stopped within five epochs without
improvement on the validation set. We report the performance on the
test set which consists of images not used in training. We
report the game success rate as the performance metric for
all three models, which equals to the number of succeeded
games divided by the total number of all games. Compared
to previous works [de Vries et al., 2017; Strub et al., 2017;
Strub and de Vries, 2017], after supervised training, our
models obtain a game success rate of 48.77%, that is 4% higher
than state-of-the-art methods [Strub and de Vries, 2017],
which has 44.6% accuracy.
We first initialize all models with pre-trained parameters from
supervised learning and then post-train the QGen using either
REINFORCE or TPG for 80 epochs. We update the
parameters using stochastic gradient descent (SGD) with a learning
rate of 0.001 and a batch size of 64. In each epoch, we
sample each image in the training set once and randomly pick
one of the objects as a target. We track the running average of
the game success rate and use it directly as the baseline, b, in
REINFORCE. We limit the maximum number of questions
to 8 and the maximum number of words to 12.
Simultaneously, we train the Guesser model using REINFORCE
without baseline and using SGD with a learning rate of 0.0001.
The performance comparison between our baseline (# 3) with
methods from literature (# 1 &amp; # 2) is shown in Tab. 1.</p>
      <p>REINFORCE Baseline: From Tab. 1, we see that our
models trained with REINFORCE (# 3) are about 9% better
than the state-of-the-art methods (# 1 &amp; # 2). The
improvements are due to using advanced mechanisms and techniques
such as the Seq2Seq structure in the QGen, the memory and
attention mechanisms in the Guesser, and the training of the
Guesser model with reinforcement learning. One important
difference is that our QGen model is trained in question level.
This means that the model first learns to query meaningfully,
step by step. Eventually, it learns to conduct a meaningful
dialog. Compared to directly learning to manage a
strategic conversation, this bottom-up training procedure helps the
model absorb knowledge, because it breaks large tasks down
into smaller, more manageable pieces. This makes the
learnIs it in left?
Is it in front?
Is it in right?
Is it in middle?
Is it person?
Is it ball?
Is it bat?
Is it car?
Status:
Is it in left?
Is it in front?
Is it in right?
Is it in middle?
Is it person?
Is it giraffe?
Is in middle?
Is in middle?
Status:
No
No
Yes
Yes
No
No
No</p>
      <p>Yes
Failure</p>
      <p>No
Yes
No
Yes
No
Yes
Yes</p>
      <p>Yes
Failure
Is it a person?
Is it a vehicle?
Is it a truck?
Is it in front of photo?
In the left half?
In the middle of photo?
Is it to the right photo?
Is it in the middle of photo?
Status:
Is it a giraffe?
In front of photo?
In the left half?
Is it in the middle of photo?
Is it to the left of photo?
Is it to the right photo?
In the left in photo?
In the middle of photo?
Status:
No
Yes
Yes
No
No
Yes
Yes</p>
      <p>Yes
Success</p>
      <p>Yes
Yes
Yes
Yes
Yes
No
No</p>
      <p>Yes
Success
ing for QGen much easier. In the remainder of the section,
we use our models, boosted with memory network, attention,
and Seq2Seq, trained with REINFORCE as a strong baseline
and analyse the performance improvements achieved by the
TPGs.</p>
      <p>From Tab. 1, we see that compared to REINFORCE (# 3),
Single-TPG (# 4) with global = 1:5 achieves a comparable
performance. With two different temperatures 1 = 1:0 and
2 = 1:5, Parallel-TPG (# 5) achieves an improvement of
approximately 4%. Parallel-TPG requires more computational
resources. Compared to Parallel-TPG, Dynamic-TPG only
uses the same computational power as REINFORCE does and
still gives a larger improvement by using a dynamic
temperature, th 2 [0:5; 1:5]. After comparison, we can see that the
best model is Dynamic-TPG (# 6), which gives a 4.65%
improvement upon our strong baseline. Here, we have shown
that our proposed methods contribute orthogonally, in the
sense that they further improve the models already boosted
with advanced modules such as memory network, attention,
and Seq2Seq.</p>
      <p>TPG Dialogue Samples: The generated dialogue samples
in Tab. 2 can give some interesting insights in explaining why
TPG methods give a better result. First of all, the sentences
generated from TPG-trained models are on average longer
and use slightly more complex structures, such as “Is it in
the middle of photo?” instead of a simple form “Is it in
middle?”. Secondly, TPGs enable the models to explore better
and comprehend more words. For example, in the first task
(upper half of Tab. 2), both models ask about the category.
The REINFORCE-trained model can only ask with the
single word “car” to query about the vehicle category. In
contrast, the TPG-trained model can first ask a more general
category with the word “vehicle” and follows up querying with
a more specific category “trucks”. These two words
“vehicle” and “trucks” give much more information than the single
word “car”, and help the Guesser model identify the truck
among many cars. Lastly, similar to the category case, the
models trained with TPG can first ask a larger spatial range
of the object and follow up with a smaller range. In the
second task (lower half of Tab. 2), we see that the TPG-trained
model first asks “In the left half?”, which refers to all the
three giraffes in the left half, and the answer is “Yes”. Then
it asks “Is it to the left of photo?”, which refers to the
second left giraffe, and the answer is “Yes”. Eventually the
QGen asks “In the left in photo?”, which refers to the most
left giraffe, and the answer is “No”. These specific
questions about locations are not learned using REINFORCE. The
REINFORCE-trained model can only ask a similar question
with the word “left”. In this task, there are many giraffes in
the left part of the image. The top-down spatial questions
help the Guesser model find the correct giraffe. To
summarize, the TPG-trained models use longer and more
informative sentences than the REINFORCE-trained models.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>Our paper makes two contributions. Firstly, by extending
existing models with Seq2Seq and Memory Networks we could
improve the performance of a goal-oriented dialogue
system by 9%. Secondly, we introduced TPG, a novel class of
temperature-based policy gradient approaches. TPGs boosted
the performance of the goal-oriented dialogue systems by
another 4.7%. Among the three TPGs, Dynamic-TPG gave the
best performance, which helped the agent comprehend more
words, and produce more meaningful questions. TPG is a
generic strategy to encourage word exploration on top of
policy gradients and can be applied to any text-generating agents.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Bahdanau et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Dzmitry</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <surname>Yoshua Bengio.</surname>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          .
          <source>arXiv preprint arXiv:1409.0473</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[Bordes and Weston</source>
          , 2017]
          <string-name>
            <given-names>Antoine</given-names>
            <surname>Bordes</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jason</given-names>
            <surname>Weston</surname>
          </string-name>
          .
          <article-title>Learning end-to-end goal-oriented dialog</article-title>
          .
          <source>In International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>[Carmel and Markovitch</source>
          , 1999]
          <string-name>
            <given-names>David</given-names>
            <surname>Carmel</surname>
          </string-name>
          and
          <string-name>
            <given-names>Shaul</given-names>
            <surname>Markovitch</surname>
          </string-name>
          .
          <article-title>Exploration strategies for model-based learning in multi-agent systems: Exploration strategies</article-title>
          .
          <source>Autonomous Agents and Multi-agent systems</source>
          ,
          <volume>2</volume>
          (
          <issue>2</issue>
          ):
          <fpage>141</fpage>
          -
          <lpage>172</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Collobert et al.,
          <year>2011</year>
          ]
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Farabet</surname>
          </string-name>
          .
          <article-title>Torch7: A matlab-like environment for machine learning</article-title>
          .
          <source>In BigLearn, NIPS Workshop</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>[Das</surname>
          </string-name>
          et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Abhishek</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Satwik Kottur</surname>
            , Jose´
            <given-names>M.F.</given-names>
          </string-name>
          <string-name>
            <surname>Moura</surname>
            ,
            <given-names>Stefan</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>and Dhruv</given-names>
          </string-name>
          <string-name>
            <surname>Batra</surname>
          </string-name>
          .
          <article-title>Learning cooperative visual dialog agents with deep reinforcement learning</article-title>
          .
          <source>In International Conference on Computer Vision</source>
          (ICCV),
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>[de Vries</surname>
          </string-name>
          et al.,
          <year>2017</year>
          ] Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and
          <string-name>
            <surname>Aaron</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Courville</surname>
          </string-name>
          . Guesswhat?
          <article-title>! visual object discovery through multi-modal dialogue</article-title>
          .
          <source>In Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Dhingra et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Bhuwan</given-names>
            <surname>Dhingra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Lihong</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xiujun</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jianfeng</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yun-Nung</surname>
            <given-names>Chen</given-names>
          </string-name>
          , Faisal Ahmed,
          <string-name>
            <given-names>and Li</given-names>
            <surname>Deng</surname>
          </string-name>
          .
          <article-title>End-to-end reinforcement learning of dialogue agents for information access</article-title>
          .
          <source>arXiv preprint arXiv:1609.00777</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>[Hochreiter and Schmidhuber</source>
          , 1997]
          <article-title>Sepp Hochreiter and Ju¨rgen Schmidhuber. Long short-term memory</article-title>
          .
          <source>Neural computation</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Karpathy and
          <string-name>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <year>2015</year>
          ]
          <article-title>Andrej Karpathy and Li FeiFei. Deep visual-semantic alignments for generating image descriptions</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <fpage>3128</fpage>
          -
          <lpage>3137</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>[Kingma and Ba</source>
          , 2014]
          <string-name>
            <given-names>Diederik</given-names>
            <surname>Kingma</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jimmy</given-names>
            <surname>Ba</surname>
          </string-name>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [Leskovec et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Jure</given-names>
            <surname>Leskovec</surname>
          </string-name>
          , Anand Rajaraman, and Jeffrey David Ullman.
          <article-title>Mining of massive datasets</article-title>
          . Cambridge university press,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>[Lewis</surname>
          </string-name>
          et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Mike</given-names>
            <surname>Lewis</surname>
          </string-name>
          , Denis Yarats, Yann N Dauphin,
          <string-name>
            <surname>Devi Parikh</surname>
            , and
            <given-names>Dhruv</given-names>
          </string-name>
          <string-name>
            <surname>Batra</surname>
          </string-name>
          .
          <article-title>Deal or no deal? end-to-end learning for negotiation dialogues</article-title>
          .
          <source>arXiv preprint arXiv:1706.05125</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>[Lin</surname>
          </string-name>
          et al.,
          <year>2014</year>
          ]
          <string-name>
            <surname>Tsung-Yi Lin</surname>
            ,
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Maire</surname>
            , Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dolla´r, and
            <given-names>C</given-names>
          </string-name>
          <string-name>
            <surname>Lawrence</surname>
          </string-name>
          <article-title>Zitnick</article-title>
          .
          <article-title>Microsoft coco: Common objects in context</article-title>
          .
          <source>In European conference on computer vision</source>
          , pages
          <fpage>740</fpage>
          -
          <lpage>755</lpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [Liu et al.,
          <year>2017</year>
          ] Yang Liu, Prajit Ramachandran, Qiang Liu, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Peng</surname>
          </string-name>
          .
          <article-title>Stein variational policy gradient</article-title>
          .
          <source>arXiv preprint arXiv:1704.02399</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [Mnih et al.,
          <year>2015</year>
          ]
          <string-name>
            <given-names>Volodymyr</given-names>
            <surname>Mnih</surname>
          </string-name>
          , Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,
          <article-title>Alex Graves, Martin Riedmiller, Andreas</article-title>
          K Fidjeland,
          <string-name>
            <surname>Georg Ostrovski</surname>
          </string-name>
          , et al.
          <article-title>Human-level control through deep reinforcement learning</article-title>
          .
          <source>Nature</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [Nachum et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Ofir</given-names>
            <surname>Nachum</surname>
          </string-name>
          , Mohammad Norouzi, and
          <string-name>
            <given-names>Dale</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          .
          <article-title>Improving policy gradient by exploring under-appreciated rewards</article-title>
          .
          <source>arXiv preprint arXiv:1611.09321</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>[Robert</source>
          , 2004] Christian P Robert.
          <article-title>Monte carlo methods</article-title>
          .
          <source>Wiley Online Library</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [Silver et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>David</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Aja</given-names>
            <surname>Huang</surname>
          </string-name>
          , Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam,
          <string-name>
            <given-names>Marc</given-names>
            <surname>Lanctot</surname>
          </string-name>
          , et al.
          <article-title>Mastering the game of go with deep neural networks and tree search</article-title>
          .
          <source>Nature</source>
          ,
          <volume>529</volume>
          (
          <issue>7587</issue>
          ):
          <fpage>484</fpage>
          -
          <lpage>489</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>[Simonyan and Zisserman</source>
          , 2014]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [Srivastava et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Nitish</given-names>
            <surname>Srivastava</surname>
          </string-name>
          , Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and
          <string-name>
            <given-names>Ruslan</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          .
          <article-title>Dropout: a simple way to prevent neural networks from overfitting</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>15</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [Strub and de Vries,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Florian</given-names>
            <surname>Strub</surname>
          </string-name>
          and Harm de Vries. Guesswhat?! models. https://github.com/ GuessWhatGame/guesswhat/,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [Strub et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Florian</given-names>
            <surname>Strub</surname>
          </string-name>
          , Harm de Vries, Je´re´mie Mary, Bilal Piot, Aaron C. Courville, and
          <string-name>
            <given-names>Olivier</given-names>
            <surname>Pietquin</surname>
          </string-name>
          .
          <article-title>End-to-end optimization of goal-driven and visually grounded dialogue systems</article-title>
          .
          <source>In International Joint Conference on Artificial Intelligence (IJCAI)</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [Sukhbaatar et al.,
          <year>2015</year>
          ]
          <string-name>
            <given-names>Sainbayar</given-names>
            <surname>Sukhbaatar</surname>
          </string-name>
          , Jason Weston,
          <string-name>
            <given-names>Rob</given-names>
            <surname>Fergus</surname>
          </string-name>
          , et al.
          <article-title>End-to-end memory networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>2440</fpage>
          -
          <lpage>2448</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [Sutskever et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          , Oriol Vinyals, and Quoc V Le.
          <article-title>Sequence to sequence learning with neural networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>3104</fpage>
          -
          <lpage>3112</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <source>[Sutton and Barto</source>
          , 1998] Richard S Sutton and
          <string-name>
            <given-names>Andrew G</given-names>
            <surname>Barto</surname>
          </string-name>
          .
          <article-title>Reinforcement learning: An introduction</article-title>
          . MIT press,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>[Thrun</source>
          , 1992]
          <article-title>Sebastian B Thrun</article-title>
          .
          <article-title>Efficient exploration in reinforcement learning</article-title>
          .
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <source>[Williams</source>
          , 1992]
          <string-name>
            <given-names>Ronald J</given-names>
            <surname>Williams</surname>
          </string-name>
          .
          <article-title>Simple statistical gradient-following algorithms for connectionist reinforcement learning</article-title>
          .
          <source>Machine learning</source>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>