Improving Goal-Oriented Visual Dialog Agents via Advanced Recurrent Nets with
                         Tempered Policy Gradient
                                         Rui Zhao and Volker Tresp
                             Siemens AG, Corporate Technology, Munich, Germany
                          Ludwig-Maximilians-Universität München, Munich, Germany
                                    {ruizhao, volker.tresp}@siemens.com

                         Abstract                                    a problem that has recently caught the attention of machine
                                                                     learning researchers. De Vries et al. [2017] have proposed
     Learning goal-oriented dialogues by means of                    as AI-testbed a visual grounded object guessing game called
     deep reinforcement learning has recently become                 GuessWhat?!. Das et al. [2017] formulated a visual dialogue
     a popular research topic. However, training text-               system which is about two chatbots asking and answering
     generating agents efficiently is still a considerable           questions to identify a specific image within a group of im-
     challenge. Commonly used policy-based dialogue                  ages. More practically, dialogue agents have been applied to
     agents often end up focusing on simple utterances               negotiate a deal [Lewis et al., 2017] and access certain in-
     and suboptimal policies. To mitigate this prob-                 formation from knowledge bases [Dhingra et al., 2016]. The
     lem, we propose a class of novel temperature-based              essential idea in these systems is to train different dialogue
     extensions for policy gradient methods, which are               agents to accomplish the tasks. In those papers, the agents
     referred to as Tempered Policy Gradients (TPGs).                have been trained with policy gradients, i.e. REINFORCE
     These methods encourage exploration with differ-                [Williams, 1992].
     ent temperature control strategies. We derive three
     variations of the TPGs and show their superior                     In order to improve the exploration quality of policy gradi-
     performance on a recently published AI-testbed,                 ents, we present three instances of temperature-based meth-
     i.e., the GuessWhat?! game. On the testbed, we                  ods. The first one is a single-temperature approach which
     achieve significant improvements with two innova-               is very easy to apply. The second one is a parallel approach
     tions. The first one is an extension of the state-of-           with multiple temperature policies running concurrently. This
     the-art solutions with Seq2Seq and Memory Net-                  second approach is more demanding on computational re-
     work structures that leads to an improvement of                 sources, but results in more stable solutions. The third one
     9%. The second one is the application of our                    is a temperature policy approach that dynamically adjusts the
     newly developed TPG methods, which improves                     temperature for each action at each time-step, based on action
     the performance additionally by around 5% and,                  frequencies. This dynamic method is more sophisticated and
     even more importantly, helps produce more con-                  proves more efficient in the experiments. In the experiments,
     vincing utterances. TPG can easily be applied to                all these methods demonstrate better exploration strategies in
     any goal-oriented dialogue systems.                             comparison to the plain policy gradient.
                                                                        We demonstrate our approaches using a real-world dataset
                                                                     called GuessWhat?!. The GuessWhat?! game [de Vries et al.,
1   Introduction                                                     2017] is a visual object discovery game between two players,
In recent years, deep learning has shown convincing perfor-          the Oracle and the Questioner. The Questioner tries to iden-
mance in various areas such as image recognition, speech             tify an object by asking the Oracle questions. The original
recognition, and natural language processing (NLP). Deep             works [de Vries et al., 2017; Strub et al., 2017] first pro-
neural nets are capable of learning complex dependencies             posed supervised learning to simulate and optimize the game.
from huge amounts of data and its human generated anno-              Strub et al. [2017] showed that the performance could be im-
tations in a supervised way. In contrast, reinforcement learn-       proved by applying plain policy gradient reinforcement learn-
ing agents [Sutton and Barto, 1998] can learn directly from          ing, which maximizes the game success rate, as a second pro-
their interactions with the environment without any super-           cessing step. Building on these previous works, we propose
vision and surpass human performance in several domains,             two network architecture extensions. We utilize a Seq2Seq
for instance in the game of GO [Silver et al., 2016], as well        model [Sutskever et al., 2014] to process the image along
as many computer games [Mnih et al., 2015]. In this paper            with the historical dialogues for question generation. For the
we are concerned with the application of both approaches to          guessing task, we develop a Memory Network [Sukhbaatar et
goal-oriented dialogue systems [Bordes and Weston, 2017;             al., 2015] with Attention Mechanism [Bahdanau et al., 2014]
de Vries et al., 2017; Das et al., 2017; Strub et al., 2017;         to process the generated question-answer pairs. We first train
Das et al., 2017; Lewis et al., 2017; Dhingra et al., 2016],         these two models using the plain policy gradient and use them


                                                                 1
as our baselines. Subsequently, we train the models with                Barto, 1998]. To obtain a high reward, an agent must ex-
our new TPG methods and compare the performances with                   ploit the actions that have already proved effective in get-
the baselines. We show that the TPG method is compatible                ting more rewards. However, to discover such actions, the
with state-of-the-art architectures such as Seq2Seq and Mem-            agent must try actions, which appear suboptimal, to explore
ory Networks and contributes orthogonally to these advanced             the action space. In a stochastic task like text generation,
neural architectures. To the best of our knowledge, the pre-            each action, i.e. a word, must be tried many times to find
sented work is the first to propose temperature-based policy            out whether it is a reliable choice or not. The exploration-
gradient methods to leverage exploration and exploitation in            exploitation dilemma has been intensively studied over many
the field of goal-oriented dialogue systems. We demonstrate             decades [Carmel and Markovitch, 1999; Nachum et al., 2016;
the superior performance of our TPG methods by applying it              Liu et al., 2017]. Finding the balance between exploration
to the GuessWhat?! game. Our contributions are:                         and exploitation is considered crucial for the success of rein-
   • We introduce Tempered Policy Gradients in the context              forcement learning [Thrun, 1992].
      of goal-oriented dialogue systems, a novel class of ap-
      proaches to temperature control to better leverage explo-         3.2       Temperature Sampling
      ration and exploitation during training.                          In text generation, it is well-known that the simple trick
   • We extend the state-of-the-art solutions for the Guess-            of temperature adjustment is sufficient to shift the language
      What?! game by integrating Seq2Seq and Memory Net-                model to be more conservative or more diversified [Karpathy
      works. We show that TPGs are compatible with these                and Fei-Fei, 2015]. In order to control the trade-off between
      advanced models and further improve the performance.              exploration and exploitation, we borrow the strength of the
                                                                        temperature parameter τ ≥ 0 to control the sampling. The
                                                                        output probability of each word is transformed by a tempera-
2     Preliminaries                                                     ture function as:
In our notation, we use x to denote the input to a policy net-                                                                    1
work π, and xi to denote the i-th element of the input vector.                τ                             f (y = n | π(x | w)) τ
                                                                          f (y = n | π(x | w)) =                                        1   .
Similarly, w denotes the weight vector of π, and wi denotes                                              ΣN
                                                                                                          m=1 f (y = m | π(x | w))
                                                                                                                                   τ
the i-th element of the weight vector of that π. The output
y is a multinoulli random variable with N states that follows           We use notation f τ to denote a probability mass function f
a probability mass function, f (y = n | π(x | w)), where                that is transferred by a temperature function with temperature
ΣNn=1 f (y = n | π(x | w)) = 1 and f (·) ≥ 0. In a nutshell, a
                                                                        τ . When the temperature is high, τ > 1, the distribution be-
policy network parametrizes a probabilistic unit that produces          comes more uniform; when the temperature is low, τ < 1,
the sampled output, mathematically, y ∼ f (π(x | w)).                   the distribution appears more spiky. TPG is defined as an
   At this point, we have defined the policy neural net and             extended algorithm of the Monte Carlo Policy Gradient ap-
now discuss performance measures commonly used for op-                  proach. We use a higher temperature, τ > 1, to encourage
timizations. Typically, the expected value of the accumu-               the model to explore in the action space, and conversely, a
lated reward, i.e. return, conditioned on the policy network            lower temperature, τ < 1, to encourage exploitation. In the
parameters E(r | w) is used. Here, E denotes the ex-                    extreme case, when τ = 0, we obtain greedy search.
pectation operator, r the accumulated reward signal, and w
the network weight vector. The objective of reinforcement               3.3       Tempered Policy Gradient Methods
learning is to update the weights in a way that maximizes               Here, we introduce three instances of TPGs in the domain
the expected return at each trial. In particular, the REIN-             of goal-oriented dialogues, including single, parallel, and dy-
FORCE updating rule is: ∆wi = αi (r − bi )ei , where ∆wi                namic tempered policy gradient methods.
denotes the weight adjustment of weight wi , αi is a non-                  Single-TPG: The Single-TPG method simply uses a global
negative learning rate factor, and bi is a reinforcement base-          temperature τglobal during the whole training process, i.e., we
line. The ei is the characteristic eligibility of wi , defined as       use τglobal > 1 to encourage exploration. The forward pass
ei = (∂f /∂wi )/f = ∂lnf /∂wi . Williams [1992] has proved              is represented mathematically as: y τglobal ∼ f τglobal (π(x |
that the updating quantity, (r − bi )∂lnf /∂wi , represents an          w)), where π(x | w) represents a policy neural network that
unbiased estimate of ∂E(r | w)/∂wi .                                    parametrizes a distribution f τglobal over the vocabulary, and
                                                                        y τglobal means the word sampled from this tempered word
3     Tempered Policy Gradient                                          distribution. After sampling, the weight of the neural net is
In order to improve the exploration quality of REINFORCE                updated using,
in the task of optimizing policy-based dialogue agents, we                        ∆wi = αi (r − bi )∂lnf (y τglobal | π(x | w))/∂wi .
attempt to find the optimal compromise between exploration
and exploitation. In TPGs we introduce a parameter τ , the              Noteworthy is that the actual gradient, ∂lnf (y τglobal | π(x |
sampling temperature of the probabilistic output unit, which            w))/∂wi , depends on the sampled word, y τglobal , however,
allows us to explicitly control the strengths of the exploration.       does not depend directly on the temperature, τ . We prefer
                                                                        to find the words that lead to a reward, so that the model
3.1    Exploration and Exploitation                                     can learn quickly from these actions, otherwise, the neural
The trade-off between exploration and exploitation is one of            network only learns to avoid current failure actions. With
the great challenges in reinforcement learning [Sutton and              Single-TPG and τ > 1, the entire vocabulary of a dialogue


                                                                    2
                                                                                  LSTM   LSTM     LSTM      LSTM     LSTM
agent is explored more efficiently than by REINFORCE, be-
cause nonpreferred words have a higher probability of be-                          Is     it         a      person    ?
ing explored. This Single-TPG method is very easy to use                                          Spatial                           No
                                                                                                             MLP              MLP   Yes
and could yield a performance improvement after training                                                                            N.A.
because the goal-oriented dialogue optimization could bene-                                    Category
fit from increased exploration. The temperature is initialized                                               MLP

with τ = 1, then fine-tuned based on the learning curve on
the validation sets, and subsequently left fixed..                                               Figure 1: Oracle model
    Parallel-TPG: A more advanced version of Single-TPG
is the Parallel-TPG that deploys several Single-TPGs con-
                                                                        them less likely to be selected in the near future. In the for-
currently with different temperatures, τ1 , ..., τn , and updates                                              h      h

the weights based on all generated samples. During the for-             ward pass, a word is sampled using y τt ∼ f τt (π(x | w)). In
ward pass, multiple copies of the neural nets parameterize              the backward pass, the weights are updated correspondingly,
multiple word distributions. The words are sampled in par-              using
allel at various temperatures, mathematically, y τ1 , ..., y τn ∼                                                         h

f τ1 ,...,τn (π(x | w)). After sampling, in the backward pass                  ∆wi = αi (r − bi )∂lnf (y τt | π(x | w))/∂wi ,
the weights are updated with the sum of gradients. The for-
                                                                        where τth is the temperature calculated from the heuris-
mula is given by
                                                                        tic function. Compared to Parallel-TPG, the advantage of
      ∆wi = Σk αi (r − bi )∂lnf (y τk | π(x | w))/∂wi ,                 Dynamic-TPG is that it assigns temperature more appropri-
                                                                        ately, without increasing the computational load.
where k ∈ {1, ..., n}. The combinational use of higher and
lower temperatures ensures both exploration and exploitation
at the same time. The sum over weight updates of paral-                 4     GuessWhat?! Game
lel policies gives a more accurate Monte Carlo estimate of              We evaluate our concepts using a recent testbed for AI, called
∂E(r | w)/∂wi , due to the nature of Monte Carlo methods                the GuessWhat?! game [de Vries et al., 2017], available at
[Robert, 2004]. Thus, compared to Single-TPG, we would                  https://guesswhat.ai. The dataset consists of 155 k
argue that Parallel-TPG is more robust and stable, although             dialogues, including 822 k question-answer pairs, each com-
Parallel-TPG needs more computational power. However,                   posed of around 5 k words, about 67 k images [Lin et al.,
these computations can be easily distributed in a parallel fash-        2014] and 134 k objects. The game is about visual object
ion using state-of-the-art graphics processing units.                   discovery trough a multi-round QA among different players.
   Dynamic-TPG: As a third variant, we introduce the                       Formally, a GuessWhat?! game is represented by a tuple
Dynamic-TPG, which is the most sophisticated approach in                (I, D, O, o∗ ), where I ∈ RH×W denotes an image of height
the current TPG family. The essential idea is that we use a             H and width W ; D represents a dialogue composed of M
heuristic function h to assign the temperature τ to the word            rounds of question-answer pairs (QAs), D = (qm , am )M   m=1 ;
distribution at each time step, t. The temperature is bounded           O stands for a list of K objects O = (ok )K   k=1 ; and  o∗ is
in a predefined range [τmin , τmax ]. The heuristic function            the target object. Each question is a sequence of words,
we used here is based upon the term frequency inverse doc-              qm = {ym,1 , ......, ym,Nm } with length Nm . The words are
ument frequency, tf-idf [Leskovec et al., 2014]. In the con-            taken from a defined vocabulary V , which consists of the
text of goal-oriented dialogues, we use the counted num-                words and a start token and an end token. Each answer is
ber of each word as term frequency tf and the total num-                either yes, no, or not applicable, i.e. am ∈ {yes, no, n.a.}.
ber of generated dialogues during training as document fre-             For each object ok , there is a corresponding object cate-
quency df. We use the word that has the highest probabil-               gory ck ∈ {1, ......, C} and a pixel-wise segmentation mask
ity to be sampled at current time-step, yt∗ , as the input to           Sk ∈ {0, 1}H×W . Finally, we use colon notation (:) to select
the heuristic function h. Here, yt∗ is the maximizer of the             a subset of a sequence, for instance, (q, a)1:m refers to the
probability mass function f . Mathematically, it is defined as          first m rounds of QAs in a dialogue.
yt∗ = argmax(f (π(x | w))). We propose that tf-idf(yt∗ ) ap-
proximates the concentration level of the distribution, which           4.1    Models and Pretraining
means that if the same word is always sampled from a distri-
bution, then the distribution is very concentrated. Too much            Following [Strub et al., 2017], we first train all three models
concentration prevents the model from exploration, so that a            in a supervised fashion.
higher temperature is needed. In order to achieve this effect,             Oracle: The task of the Oracle is to answer questions
the heuristic function is defined as                                    regarding to the target object. We outline here the sim-
                                                                        ple neural network architecture that achieved the best per-
     τth = h(tf-idf(yt∗ ))                                              formance in the study of [de Vries et al., 2017], and which
                                    tf-idf(yt∗ ) − tf-idfmin            we also used in our experiments. The input information
        = τmin + (τmax − τmin )                              .          used here is of three modalities, namely the question q, the
                                    tf-idfmax − tf-idfmin
                                                                        spatial information x∗spatial and the category c∗ of the tar-
With this heuristic, words that occur very often are depressed          get object. For encoding the question, de Vries et al. first
by applying a higher temperature to those words, making                 use a lookup table to learn the embedding of words, then


                                                                    3
                    Encoder                                                                                                     Spatial
                                                                                                                                                        Objects
                                                                                                                                          MLP
                                                                                                                                                 MLP
                                                                                                                                                                           .
                                                       Decoder                                                                 Category
                                                                                                                                          MLP                                    SoftMax
                        Image


                                                                                                                                 ...


                                                                                                                                                         ...
                                                                                                                                                  ...
                                                                                                                                          ...
                                 CNN                                                                                                                                             Object
                                                        person                                                                  Spatial
                                          Is    this              ......                                                                MLP
                                                                                                                                                 MLP
                                                                                                                               Category
                                                                                                                                        MLP
                                         LSTM   LSTM     LSTM    LSTM      ......

                                                                                                                                                 Attention Mechanism
                                                 Is      this    person                                                          QA 1
                                                                                                                                          LSTM
      Is it a person? Yes History                                                             Is it a person? Yes                QA 2
      ......                      LSTM                                                        Is this person in the left? No              LSTM                             +


                                                                                                                                                         ...
                                                                                              Is he on the right side? Yes       QA 3
                                                                                              ......                                      LSTM
                                                                                                                                 QA 4                             Weighted Sum
                                                                                                                                          LSTM
                                                                                                                                                        Memory
                 Figure 2: Question-Generator model
                                                                                                                         Figure 3: Guesser model

use a one layer long-short-term-memory (LSTM) [Hochre-
iter and Schmidhuber, 1997] to encode the whole ques-                                   the decoder, is trained end-to-end to minimize the negative
tion. For spatial information, de Vries et al. extract an                               log-likelihood cost. During testing, the decoder gets a start
8-dimensional vector of the location of the bounding box                                token and the representation from the encoder, and then gen-
[xmin , ymin , xmax , ymax , xcenter , ycenter , wbox , hbox ],                         erates each word at each time step until it encounters a ques-
where x, y denote the coordinates and wbox , hbox denote                                tion mark token, QGen := p(ym+1,n | (q, a)1:m , I). The
the width and height of the bounding box, respectively. De                              output is a complete question. After several question-answer
Vries et al. normalize the image width and height so that the                           rounds, the QGen outputs an end-of-dialogue token, and stops
coordinates range from -1 to 1. The origin is at the image                              asking questions. The overall structure of the QGen model is
center. The category embedding of the object is also learned                            illustrated in Fig. 2.
with a lookup table during training. At the last step, de Vries                            Guesser: The goal of the Guesser model is to find out
et al. concatenate all three embeddings into one feature vec-                           which object the Oracle model is referring to, given the com-
tor and fed it into a one hidden layer multilayer perceptron                            plete history of the dialogue and a list of objects in the im-
(MLP). The softmax output layer predicts the distribution,                              age, p(o∗ | (q, a)1:M , xO            O
                                                                                                                   spatial , c ). The Guesser model has
Oracle := p(a | q, c∗ , x∗spatial ), over the three classes,                            access to the spatial, xspatial , and category information, cO ,
                                                                                                                 O
including no, yes, and not applicable. The model is trained                             of the objects in the list. The task of the Guesser model is
using the negative log-likelihood criterion. The Oracle struc-                          challenging because it needs to understand the dialogue and
ture is shown in Fig. 1.                                                                to focus on the important content, and then guess the object.
   Question-Generator: The goal of the Question-Generator                               To accomplish this task, we decided to integrate the Mem-
(QGen) is to ask the Oracle meaningful questions, qm+1 ,                                ory [Sukhbaatar et al., 2015] and Attention [Bahdanau et al.,
given the whole image, I, and the historical question-answer                            2014] modules into the Guesser architecture used in the previ-
pairs, (q, a)1:m . In previous work [Strub et al., 2017], the                           ous work [Strub et al., 2017]. First, we use an LSTM header
state transition function was modelled as an LSTM, which                                to process the varying lengths of question-answer pairs in par-
was trained using whole dialogues so that the model mem-                                allel into multiple fixed-size vectors. Here, each QA-pair has
orizes the historical QAs. We refer to this as dialogue level                           been encoded into some facts, Factm = LSTM((q, a)m ),
training. We develop a novel QGen architecture using a mod-                             and stored into a memory base. Later, we use the sum of
ified version of the Seq2Seq model [Sutskever et al., 2014].                            the spatial and category embeddings of all objects as a key,
The modified Seq2Seq model enables question level training,                             Key1 = MLP(xO                O
                                                                                                          spatial , c ), to query the memory and calcu-
which means that the model is fed with historical QAs, and                              late an attention mask, Attention1 (Factm ) = Factm key1 ,
then learns to produce a new question. Following [Strub et                              over each fact. Next, we use the sum of attended facts and the
al., 2017], we first encode the whole image into a fixed-size                           first key to calculate the second key. Further, we use the sec-
feature vector using the VGG-net [Simonyan and Zisserman,                               ond key to query the memory base again to have a more accu-
2014]. The features come from the fc-8 layer of the VGG-                                rate attention. These are the so called “two-hops” of attention
net. For processing historical QAs, we use a lookup table to                            in the literature [Sukhbaatar et al., 2015]. Finally, we com-
learn the word embeddings, then again use an LSTM encoder                               pare the attended facts with each object embedding in the list
to encode the history information into a fixed-size latent rep-                         using a dot product. The most similar object to these facts is
resentation, and concatenate it with the image representation:                          the prediction, Guesser := p(o∗ | (q, a)1:M , xO              O
                                                                                                                                           spatial , c ).
     senc
      m,N m = encoder((LSTM(q, a)1:m ), VGG(I)).
                                                                                        The intention of using the attention module here is to find out
                                                                                        the most relevant descriptions or facts concerning the candi-
The encoder and decoder are coupled by initializing the                                 date objects. We train the whole Guesser network end-to-end
decoder state with the last encoder state, mathematically,                              using the negative log-likelihood criterion. A more graphical
sdec       enc
 m+1,0 = sm,N m . The LSTM decoder generates each word                                  description of the Guesser model is shown in Fig. 3.
based on the concatenated representation and the previous
generated word (note the first word is a start token):                                  4.2      Reinforcement Learning
    ym+1,n = decoder(LSTM((ym+1,n−1 , sdec                                              Now, we post-train the QGen and the Guesser model with
                                       m+1,n−1 )).
                                                                                        reinforcement learning. We keep the Oracle model fixed. In
The decoder shares the same lookup table weights as the en-                             each game episode, when the models find the correct object,
coder. The Seq2Seq model, consisting of the encoder and                                 r = 1, otherwise, r = 0.


                                                                                    4
   Next, we can assign credits for each action of the QGen                    #      Method                                Accuracy
and the Guesser models. In the case of the QGen model, we                     1      [Strub et al., 2017]                  52.30%
spread the reward uniformly over the sequence of actions in                   2      [Strub and de Vries, 2017]            60.30%
the episode. The baseline function, b, used here is the running               3      REINFORCE                             69.66%
average of the game success rate. Consider that the Guesser                   4      Single-TPG                            69.76%
model has only one action in each episode, i.e., taking the                   5      Parallel-TPG                          73.86%
guess. If the Guesser finds the correct object, then it gets an               6      Dynamic-TPG                           74.31%
immediate reward and the Guesser’s parameters are updated
using the REINFORCE rule without baseline. The QGen is                  Table 1: Performance comparison of our methods to other methods
trained using the following four methods.                               reported in literature after reinforcement learning
   REINFORCE: The baseline method used here is RE-
INFORCE [Williams, 1992]. During training, in the for-
                                                                        5.1       Pretraining
ward pass the words are sampled with τ = 1, ym+1,n ∼
f (QGen(x | w)). In the backward pass, the weights are up-              We train all three models using 0.5 dropout [Srivastava et al.,
dated using REINFORCE, as,                                              2014] during training, using the ADAM optimizer [Kingma
                                                                        and Ba, 2014]. We use a learning rate of 0.0001 for the Or-
    w = w + α(r − b)∇w lnf (ym+1,n | QGen(x | w)).                      acle model and the Guesser model, and a learning rate of
                                                                        0.001 for QGen. All the models are trained with at most 30
Single-TPG: We use temperature τglobal = 1.5 during train-              epochs and early stopped within five epochs without improve-
                                                 τglobal
ing to encourage exploration, mathematically, ym+1,n     ∼              ment on the validation set. We report the performance on the
f τglobal (QGen(x | w)). In the backward pass, the weights              test set which consists of images not used in training. We
are updated using                                                       report the game success rate as the performance metric for
                                  τ
                              global                                    all three models, which equals to the number of succeeded
    w = w + α(r − b)∇w lnf (ym+1,n   | QGen(x | w)).                    games divided by the total number of all games. Compared
Parallel-TPG: For Parallel-TPG, we use two tempera-                     to previous works [de Vries et al., 2017; Strub et al., 2017;
tures τ1 = 1.0 and τ2 = 1.5 to encourage the explo-                     Strub and de Vries, 2017], after supervised training, our mod-
ration. The words are sampled in the forward pass using                 els obtain a game success rate of 48.77%, that is 4% higher
 τ1
ym+1,n     τ2
        , ym+1,n  ∼ f τ1 ,τ2 (QGen(x | w)). In the backward             than state-of-the-art methods [Strub and de Vries, 2017],
pass, the weights are updated using                                     which has 44.6% accuracy.
                                τk                                      5.2       Reinforcement Learning
 w = w + Σ2k=1 α(r − b)∇w lnf (ym+1,n | QGen(x | w)).
                                                                        We first initialize all models with pre-trained parameters from
Dynamic-TPG: The last method we evaluated is Dynamic-
                                                                        supervised learning and then post-train the QGen using either
TPG. We use a heuristic function to calculate the temperature
                                                                        REINFORCE or TPG for 80 epochs. We update the parame-
for each word at each time step:
                                                                        ters using stochastic gradient descent (SGD) with a learning
                                            ∗
                                   tf-idf(ym+1,n  ) − tf-idfmin         rate of 0.001 and a batch size of 64. In each epoch, we sam-
 h
τm+1,n = τmin + (τmax − τmin )                                  ,       ple each image in the training set once and randomly pick
                                       tf-idfmax − tf-idfmin
                                                                        one of the objects as a target. We track the running average of
where we set τmin = 0.5, τmax = 1.5, and set tf-idfmin = 0,             the game success rate and use it directly as the baseline, b, in
                                         h
tf-idfmax = 8. After the calculation of τm+1,n , we substitute          REINFORCE. We limit the maximum number of questions
the value into the formula at each time step and sample the             to 8 and the maximum number of words to 12. Simultane-
next word using                                                         ously, we train the Guesser model using REINFORCE with-
                                                                        out baseline and using SGD with a learning rate of 0.0001.
              τh           h                                            The performance comparison between our baseline (# 3) with
              m+1,n
            ym+1,n  ∼ f τm+1,n (QGen(x | w)).
                                                                        methods from literature (# 1 & # 2) is shown in Tab. 1.
In the backward pass, the weights are updated using                        REINFORCE Baseline: From Tab. 1, we see that our
                                                                        models trained with REINFORCE (# 3) are about 9% better
                                  τh
                              m+1,n                                     than the state-of-the-art methods (# 1 & # 2). The improve-
    w = w + α(r − b)∇w lnf (ym+1,n  | QGen(x | w)).
                                                                        ments are due to using advanced mechanisms and techniques
For all four methods, we use greedy search in evaluation.               such as the Seq2Seq structure in the QGen, the memory and
                                                                        attention mechanisms in the Guesser, and the training of the
                                                                        Guesser model with reinforcement learning. One important
5   Experiment                                                          difference is that our QGen model is trained in question level.
We first train all the networks in a supervised fashion,                This means that the model first learns to query meaningfully,
and then further optimize the QGen and the Guesser                      step by step. Eventually, it learns to conduct a meaningful
model using reinforcement learning. The source code                     dialog. Compared to directly learning to manage a strate-
is available at https://github.com/ruizhaogit/                          gic conversation, this bottom-up training procedure helps the
GuessWhat-TemperedPolicyGradient, which uses                            model absorb knowledge, because it breaks large tasks down
Torch7 [Collobert et al., 2011].                                        into smaller, more manageable pieces. This makes the learn-


                                                                    5
                      Image                   Policy Gradient                          Tempered Policy Gradient
                                              Is it in left?                     No    Is it a person?                     No
                                              Is it in front?                    No    Is it a vehicle?                   Yes
                                              Is it in right?                   Yes    Is it a truck?                     Yes
                                              Is it in middle?                  Yes    Is it in front of photo?            No
                                              Is it person?                      No    In the left half?                   No
                                              Is it ball?                        No    In the middle of photo?            Yes
                                              Is it bat?                         No    Is it to the right photo?          Yes
                                              Is it car?                        Yes    Is it in the middle of photo?      Yes
                                              Status:                        Failure   Status:                         Success
                                              Is it in left?                     No    Is it a giraffe?                   Yes
                                              Is it in front?                   Yes    In front of photo?                 Yes
                                              Is it in right?                    No    In the left half?                  Yes
                                              Is it in middle?                  Yes    Is it in the middle of photo?      Yes
                                              Is it person?                      No    Is it to the left of photo?        Yes
                                              Is it giraffe?                    Yes    Is it to the right photo?           No
                                              Is in middle?                     Yes    In the left in photo?               No
                                              Is in middle?                     Yes    In the middle of photo?            Yes
                                              Status:                        Failure   Status:                         Success

Table 2: Some samples generated by our improved models using REINFORCE (left column: “Policy Gradient”) and Dynamic-TPG (right
column: “Tempered Policy Gradient”). The green bounding boxes highlight the target objects; the red boxes highlight the wrong guesses.


ing for QGen much easier. In the remainder of the section,            cle” and “trucks” give much more information than the single
we use our models, boosted with memory network, attention,            word “car”, and help the Guesser model identify the truck
and Seq2Seq, trained with REINFORCE as a strong baseline              among many cars. Lastly, similar to the category case, the
and analyse the performance improvements achieved by the              models trained with TPG can first ask a larger spatial range
TPGs.                                                                 of the object and follow up with a smaller range. In the sec-
   From Tab. 1, we see that compared to REINFORCE (# 3),              ond task (lower half of Tab. 2), we see that the TPG-trained
Single-TPG (# 4) with τglobal = 1.5 achieves a comparable             model first asks “In the left half?”, which refers to all the
performance. With two different temperatures τ1 = 1.0 and             three giraffes in the left half, and the answer is “Yes”. Then
τ2 = 1.5, Parallel-TPG (# 5) achieves an improvement of ap-           it asks “Is it to the left of photo?”, which refers to the sec-
proximately 4%. Parallel-TPG requires more computational              ond left giraffe, and the answer is “Yes”. Eventually the
resources. Compared to Parallel-TPG, Dynamic-TPG only                 QGen asks “In the left in photo?”, which refers to the most
uses the same computational power as REINFORCE does and               left giraffe, and the answer is “No”. These specific ques-
still gives a larger improvement by using a dynamic temper-           tions about locations are not learned using REINFORCE. The
ature, τth ∈ [0.5, 1.5]. After comparison, we can see that the        REINFORCE-trained model can only ask a similar question
best model is Dynamic-TPG (# 6), which gives a 4.65% im-              with the word “left”. In this task, there are many giraffes in
provement upon our strong baseline. Here, we have shown               the left part of the image. The top-down spatial questions
that our proposed methods contribute orthogonally, in the             help the Guesser model find the correct giraffe. To summa-
sense that they further improve the models already boosted            rize, the TPG-trained models use longer and more informa-
with advanced modules such as memory network, attention,              tive sentences than the REINFORCE-trained models.
and Seq2Seq.
   TPG Dialogue Samples: The generated dialogue samples
in Tab. 2 can give some interesting insights in explaining why        6   Conclusion
TPG methods give a better result. First of all, the sentences
generated from TPG-trained models are on average longer               Our paper makes two contributions. Firstly, by extending ex-
and use slightly more complex structures, such as “Is it in           isting models with Seq2Seq and Memory Networks we could
the middle of photo?” instead of a simple form “Is it in mid-         improve the performance of a goal-oriented dialogue sys-
dle?”. Secondly, TPGs enable the models to explore better             tem by 9%. Secondly, we introduced TPG, a novel class of
and comprehend more words. For example, in the first task             temperature-based policy gradient approaches. TPGs boosted
(upper half of Tab. 2), both models ask about the category.           the performance of the goal-oriented dialogue systems by an-
The REINFORCE-trained model can only ask with the sin-                other 4.7%. Among the three TPGs, Dynamic-TPG gave the
gle word “car” to query about the vehicle category. In con-           best performance, which helped the agent comprehend more
trast, the TPG-trained model can first ask a more general cat-        words, and produce more meaningful questions. TPG is a
egory with the word “vehicle” and follows up querying with            generic strategy to encourage word exploration on top of pol-
a more specific category “trucks”. These two words “vehi-             icy gradients and can be applied to any text-generating agents.


                                                                  6
References                                                           [Liu et al., 2017] Yang Liu, Prajit Ramachandran, Qiang
[Bahdanau et al., 2014] Dzmitry Bahdanau, Kyunghyun                     Liu, and Jian Peng. Stein variational policy gradient. arXiv
   Cho, and Yoshua Bengio. Neural machine translation                   preprint arXiv:1704.02399, 2017.
   by jointly learning to align and translate. arXiv preprint        [Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu,
   arXiv:1409.0473, 2014.                                               David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-
[Bordes and Weston, 2017] Antoine Bordes and Jason We-                  mare, Alex Graves, Martin Riedmiller, Andreas K Fidje-
   ston. Learning end-to-end goal-oriented dialog. In Inter-            land, Georg Ostrovski, et al. Human-level control through
   national Conference on Learning Representations (ICLR),              deep reinforcement learning. Nature, 2015.
   2017.                                                             [Nachum et al., 2016] Ofir Nachum, Mohammad Norouzi,
[Carmel and Markovitch, 1999] David Carmel and Shaul                    and Dale Schuurmans. Improving policy gradient by
   Markovitch. Exploration strategies for model-based learn-            exploring under-appreciated rewards.         arXiv preprint
   ing in multi-agent systems: Exploration strategies. Au-              arXiv:1611.09321, 2016.
   tonomous Agents and Multi-agent systems, 2(2):141–172,            [Robert, 2004] Christian P Robert. Monte carlo methods.
   1999.                                                                Wiley Online Library, 2004.
[Collobert et al., 2011] R. Collobert, K. Kavukcuoglu, and           [Silver et al., 2016] David Silver, Aja Huang, Chris J Maddi-
   C. Farabet. Torch7: A matlab-like environment for ma-                son, Arthur Guez, Laurent Sifre, George Van Den Driess-
   chine learning. In BigLearn, NIPS Workshop, 2011.                    che, Julian Schrittwieser, Ioannis Antonoglou, Veda Pan-
[Das et al., 2017] Abhishek Das, Satwik Kottur, José M.F.              neershelvam, Marc Lanctot, et al. Mastering the game of
   Moura, Stefan Lee, and Dhruv Batra. Learning coopera-                go with deep neural networks and tree search. Nature,
   tive visual dialog agents with deep reinforcement learning.          529(7587):484–489, 2016.
   In International Conference on Computer Vision (ICCV),            [Simonyan and Zisserman, 2014] Karen Simonyan and An-
   2017.                                                                drew Zisserman.       Very deep convolutional networks
[de Vries et al., 2017] Harm de Vries, Florian Strub, Sarath            for large-scale image recognition.          arXiv preprint
   Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron C.             arXiv:1409.1556, 2014.
   Courville. Guesswhat?! visual object discovery through            [Srivastava et al., 2014] Nitish Srivastava, Geoffrey E Hin-
   multi-modal dialogue. In Conference on Computer Vision               ton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
   and Pattern Recognition (CVPR), 2017.                                Salakhutdinov. Dropout: a simple way to prevent neural
[Dhingra et al., 2016] Bhuwan Dhingra, Lihong Li, Xiu-                  networks from overfitting. Journal of Machine Learning
   jun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed,                   Research, 15(1):1929–1958, 2014.
   and Li Deng. End-to-end reinforcement learning of di-             [Strub and de Vries, 2017] Florian Strub and Harm de Vries.
   alogue agents for information access. arXiv preprint                 Guesswhat?!        models.    https://github.com/
   arXiv:1609.00777, 2016.                                              GuessWhatGame/guesswhat/, 2017.
[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and               [Strub et al., 2017] Florian Strub, Harm de Vries, Jérémie
   Jürgen Schmidhuber. Long short-term memory. Neural                  Mary, Bilal Piot, Aaron C. Courville, and Olivier
   computation, 1997.                                                   Pietquin. End-to-end optimization of goal-driven and vi-
[Karpathy and Fei-Fei, 2015] Andrej Karpathy and Li Fei-                sually grounded dialogue systems. In International Joint
   Fei. Deep visual-semantic alignments for generating im-              Conference on Artificial Intelligence (IJCAI), 2017.
   age descriptions. In Proceedings of the IEEE Conference           [Sukhbaatar et al., 2015] Sainbayar Sukhbaatar, Jason We-
   on Computer Vision and Pattern Recognition, pages 3128–              ston, Rob Fergus, et al. End-to-end memory networks. In
   3137, 2015.                                                          Advances in neural information processing systems, pages
[Kingma and Ba, 2014] Diederik Kingma and Jimmy Ba.                     2440–2448, 2015.
   Adam: A method for stochastic optimization. arXiv                 [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and
   preprint arXiv:1412.6980, 2014.                                      Quoc V Le. Sequence to sequence learning with neural
[Leskovec et al., 2014] Jure Leskovec, Anand Rajaraman,                 networks. In Advances in neural information processing
   and Jeffrey David Ullman. Mining of massive datasets.                systems, pages 3104–3112, 2014.
   Cambridge university press, 2014.                                 [Sutton and Barto, 1998] Richard S Sutton and Andrew G
[Lewis et al., 2017] Mike Lewis, Denis Yarats, Yann N                   Barto. Reinforcement learning: An introduction. MIT
   Dauphin, Devi Parikh, and Dhruv Batra. Deal or no                    press, 1998.
   deal? end-to-end learning for negotiation dialogues. arXiv        [Thrun, 1992] Sebastian B Thrun. Efficient exploration in
   preprint arXiv:1706.05125, 2017.
                                                                        reinforcement learning. 1992.
[Lin et al., 2014] Tsung-Yi Lin, Michael Maire, Serge Be-
                                                                     [Williams, 1992] Ronald J Williams.          Simple statistical
   longie, James Hays, Pietro Perona, Deva Ramanan, Piotr
                                                                        gradient-following algorithms for connectionist reinforce-
   Dollár, and C Lawrence Zitnick. Microsoft coco: Com-
                                                                        ment learning. Machine learning, 1992.
   mon objects in context. In European conference on com-
   puter vision, pages 740–755. Springer, 2014.


                                                                 7