Improving Goal-Oriented Visual Dialog Agents via Advanced Recurrent Nets with Tempered Policy Gradient Rui Zhao and Volker Tresp Siemens AG, Corporate Technology, Munich, Germany Ludwig-Maximilians-Universität München, Munich, Germany {ruizhao, volker.tresp}@siemens.com Abstract a problem that has recently caught the attention of machine learning researchers. De Vries et al. [2017] have proposed Learning goal-oriented dialogues by means of as AI-testbed a visual grounded object guessing game called deep reinforcement learning has recently become GuessWhat?!. Das et al. [2017] formulated a visual dialogue a popular research topic. However, training text- system which is about two chatbots asking and answering generating agents efficiently is still a considerable questions to identify a specific image within a group of im- challenge. Commonly used policy-based dialogue ages. More practically, dialogue agents have been applied to agents often end up focusing on simple utterances negotiate a deal [Lewis et al., 2017] and access certain in- and suboptimal policies. To mitigate this prob- formation from knowledge bases [Dhingra et al., 2016]. The lem, we propose a class of novel temperature-based essential idea in these systems is to train different dialogue extensions for policy gradient methods, which are agents to accomplish the tasks. In those papers, the agents referred to as Tempered Policy Gradients (TPGs). have been trained with policy gradients, i.e. REINFORCE These methods encourage exploration with differ- [Williams, 1992]. ent temperature control strategies. We derive three variations of the TPGs and show their superior In order to improve the exploration quality of policy gradi- performance on a recently published AI-testbed, ents, we present three instances of temperature-based meth- i.e., the GuessWhat?! game. On the testbed, we ods. The first one is a single-temperature approach which achieve significant improvements with two innova- is very easy to apply. The second one is a parallel approach tions. The first one is an extension of the state-of- with multiple temperature policies running concurrently. This the-art solutions with Seq2Seq and Memory Net- second approach is more demanding on computational re- work structures that leads to an improvement of sources, but results in more stable solutions. The third one 9%. The second one is the application of our is a temperature policy approach that dynamically adjusts the newly developed TPG methods, which improves temperature for each action at each time-step, based on action the performance additionally by around 5% and, frequencies. This dynamic method is more sophisticated and even more importantly, helps produce more con- proves more efficient in the experiments. In the experiments, vincing utterances. TPG can easily be applied to all these methods demonstrate better exploration strategies in any goal-oriented dialogue systems. comparison to the plain policy gradient. We demonstrate our approaches using a real-world dataset called GuessWhat?!. The GuessWhat?! game [de Vries et al., 1 Introduction 2017] is a visual object discovery game between two players, In recent years, deep learning has shown convincing perfor- the Oracle and the Questioner. The Questioner tries to iden- mance in various areas such as image recognition, speech tify an object by asking the Oracle questions. The original recognition, and natural language processing (NLP). Deep works [de Vries et al., 2017; Strub et al., 2017] first pro- neural nets are capable of learning complex dependencies posed supervised learning to simulate and optimize the game. from huge amounts of data and its human generated anno- Strub et al. [2017] showed that the performance could be im- tations in a supervised way. In contrast, reinforcement learn- proved by applying plain policy gradient reinforcement learn- ing agents [Sutton and Barto, 1998] can learn directly from ing, which maximizes the game success rate, as a second pro- their interactions with the environment without any super- cessing step. Building on these previous works, we propose vision and surpass human performance in several domains, two network architecture extensions. We utilize a Seq2Seq for instance in the game of GO [Silver et al., 2016], as well model [Sutskever et al., 2014] to process the image along as many computer games [Mnih et al., 2015]. In this paper with the historical dialogues for question generation. For the we are concerned with the application of both approaches to guessing task, we develop a Memory Network [Sukhbaatar et goal-oriented dialogue systems [Bordes and Weston, 2017; al., 2015] with Attention Mechanism [Bahdanau et al., 2014] de Vries et al., 2017; Das et al., 2017; Strub et al., 2017; to process the generated question-answer pairs. We first train Das et al., 2017; Lewis et al., 2017; Dhingra et al., 2016], these two models using the plain policy gradient and use them 1 as our baselines. Subsequently, we train the models with Barto, 1998]. To obtain a high reward, an agent must ex- our new TPG methods and compare the performances with ploit the actions that have already proved effective in get- the baselines. We show that the TPG method is compatible ting more rewards. However, to discover such actions, the with state-of-the-art architectures such as Seq2Seq and Mem- agent must try actions, which appear suboptimal, to explore ory Networks and contributes orthogonally to these advanced the action space. In a stochastic task like text generation, neural architectures. To the best of our knowledge, the pre- each action, i.e. a word, must be tried many times to find sented work is the first to propose temperature-based policy out whether it is a reliable choice or not. The exploration- gradient methods to leverage exploration and exploitation in exploitation dilemma has been intensively studied over many the field of goal-oriented dialogue systems. We demonstrate decades [Carmel and Markovitch, 1999; Nachum et al., 2016; the superior performance of our TPG methods by applying it Liu et al., 2017]. Finding the balance between exploration to the GuessWhat?! game. Our contributions are: and exploitation is considered crucial for the success of rein- • We introduce Tempered Policy Gradients in the context forcement learning [Thrun, 1992]. of goal-oriented dialogue systems, a novel class of ap- proaches to temperature control to better leverage explo- 3.2 Temperature Sampling ration and exploitation during training. In text generation, it is well-known that the simple trick • We extend the state-of-the-art solutions for the Guess- of temperature adjustment is sufficient to shift the language What?! game by integrating Seq2Seq and Memory Net- model to be more conservative or more diversified [Karpathy works. We show that TPGs are compatible with these and Fei-Fei, 2015]. In order to control the trade-off between advanced models and further improve the performance. exploration and exploitation, we borrow the strength of the temperature parameter τ ≥ 0 to control the sampling. The output probability of each word is transformed by a tempera- 2 Preliminaries ture function as: In our notation, we use x to denote the input to a policy net- 1 work π, and xi to denote the i-th element of the input vector. τ f (y = n | π(x | w)) τ f (y = n | π(x | w)) = 1 . Similarly, w denotes the weight vector of π, and wi denotes ΣN m=1 f (y = m | π(x | w)) τ the i-th element of the weight vector of that π. The output y is a multinoulli random variable with N states that follows We use notation f τ to denote a probability mass function f a probability mass function, f (y = n | π(x | w)), where that is transferred by a temperature function with temperature ΣNn=1 f (y = n | π(x | w)) = 1 and f (·) ≥ 0. In a nutshell, a τ . When the temperature is high, τ > 1, the distribution be- policy network parametrizes a probabilistic unit that produces comes more uniform; when the temperature is low, τ < 1, the sampled output, mathematically, y ∼ f (π(x | w)). the distribution appears more spiky. TPG is defined as an At this point, we have defined the policy neural net and extended algorithm of the Monte Carlo Policy Gradient ap- now discuss performance measures commonly used for op- proach. We use a higher temperature, τ > 1, to encourage timizations. Typically, the expected value of the accumu- the model to explore in the action space, and conversely, a lated reward, i.e. return, conditioned on the policy network lower temperature, τ < 1, to encourage exploitation. In the parameters E(r | w) is used. Here, E denotes the ex- extreme case, when τ = 0, we obtain greedy search. pectation operator, r the accumulated reward signal, and w the network weight vector. The objective of reinforcement 3.3 Tempered Policy Gradient Methods learning is to update the weights in a way that maximizes Here, we introduce three instances of TPGs in the domain the expected return at each trial. In particular, the REIN- of goal-oriented dialogues, including single, parallel, and dy- FORCE updating rule is: ∆wi = αi (r − bi )ei , where ∆wi namic tempered policy gradient methods. denotes the weight adjustment of weight wi , αi is a non- Single-TPG: The Single-TPG method simply uses a global negative learning rate factor, and bi is a reinforcement base- temperature τglobal during the whole training process, i.e., we line. The ei is the characteristic eligibility of wi , defined as use τglobal > 1 to encourage exploration. The forward pass ei = (∂f /∂wi )/f = ∂lnf /∂wi . Williams [1992] has proved is represented mathematically as: y τglobal ∼ f τglobal (π(x | that the updating quantity, (r − bi )∂lnf /∂wi , represents an w)), where π(x | w) represents a policy neural network that unbiased estimate of ∂E(r | w)/∂wi . parametrizes a distribution f τglobal over the vocabulary, and y τglobal means the word sampled from this tempered word 3 Tempered Policy Gradient distribution. After sampling, the weight of the neural net is In order to improve the exploration quality of REINFORCE updated using, in the task of optimizing policy-based dialogue agents, we ∆wi = αi (r − bi )∂lnf (y τglobal | π(x | w))/∂wi . attempt to find the optimal compromise between exploration and exploitation. In TPGs we introduce a parameter τ , the Noteworthy is that the actual gradient, ∂lnf (y τglobal | π(x | sampling temperature of the probabilistic output unit, which w))/∂wi , depends on the sampled word, y τglobal , however, allows us to explicitly control the strengths of the exploration. does not depend directly on the temperature, τ . We prefer to find the words that lead to a reward, so that the model 3.1 Exploration and Exploitation can learn quickly from these actions, otherwise, the neural The trade-off between exploration and exploitation is one of network only learns to avoid current failure actions. With the great challenges in reinforcement learning [Sutton and Single-TPG and τ > 1, the entire vocabulary of a dialogue 2 LSTM LSTM LSTM LSTM LSTM agent is explored more efficiently than by REINFORCE, be- cause nonpreferred words have a higher probability of be- Is it a person ? ing explored. This Single-TPG method is very easy to use Spatial No MLP MLP Yes and could yield a performance improvement after training N.A. because the goal-oriented dialogue optimization could bene- Category fit from increased exploration. The temperature is initialized MLP with τ = 1, then fine-tuned based on the learning curve on the validation sets, and subsequently left fixed.. Figure 1: Oracle model Parallel-TPG: A more advanced version of Single-TPG is the Parallel-TPG that deploys several Single-TPGs con- them less likely to be selected in the near future. In the for- currently with different temperatures, τ1 , ..., τn , and updates h h the weights based on all generated samples. During the for- ward pass, a word is sampled using y τt ∼ f τt (π(x | w)). In ward pass, multiple copies of the neural nets parameterize the backward pass, the weights are updated correspondingly, multiple word distributions. The words are sampled in par- using allel at various temperatures, mathematically, y τ1 , ..., y τn ∼ h f τ1 ,...,τn (π(x | w)). After sampling, in the backward pass ∆wi = αi (r − bi )∂lnf (y τt | π(x | w))/∂wi , the weights are updated with the sum of gradients. The for- where τth is the temperature calculated from the heuris- mula is given by tic function. Compared to Parallel-TPG, the advantage of ∆wi = Σk αi (r − bi )∂lnf (y τk | π(x | w))/∂wi , Dynamic-TPG is that it assigns temperature more appropri- ately, without increasing the computational load. where k ∈ {1, ..., n}. The combinational use of higher and lower temperatures ensures both exploration and exploitation at the same time. The sum over weight updates of paral- 4 GuessWhat?! Game lel policies gives a more accurate Monte Carlo estimate of We evaluate our concepts using a recent testbed for AI, called ∂E(r | w)/∂wi , due to the nature of Monte Carlo methods the GuessWhat?! game [de Vries et al., 2017], available at [Robert, 2004]. Thus, compared to Single-TPG, we would https://guesswhat.ai. The dataset consists of 155 k argue that Parallel-TPG is more robust and stable, although dialogues, including 822 k question-answer pairs, each com- Parallel-TPG needs more computational power. However, posed of around 5 k words, about 67 k images [Lin et al., these computations can be easily distributed in a parallel fash- 2014] and 134 k objects. The game is about visual object ion using state-of-the-art graphics processing units. discovery trough a multi-round QA among different players. Dynamic-TPG: As a third variant, we introduce the Formally, a GuessWhat?! game is represented by a tuple Dynamic-TPG, which is the most sophisticated approach in (I, D, O, o∗ ), where I ∈ RH×W denotes an image of height the current TPG family. The essential idea is that we use a H and width W ; D represents a dialogue composed of M heuristic function h to assign the temperature τ to the word rounds of question-answer pairs (QAs), D = (qm , am )M m=1 ; distribution at each time step, t. The temperature is bounded O stands for a list of K objects O = (ok )K k=1 ; and o∗ is in a predefined range [τmin , τmax ]. The heuristic function the target object. Each question is a sequence of words, we used here is based upon the term frequency inverse doc- qm = {ym,1 , ......, ym,Nm } with length Nm . The words are ument frequency, tf-idf [Leskovec et al., 2014]. In the con- taken from a defined vocabulary V , which consists of the text of goal-oriented dialogues, we use the counted num- words and a start token and an end token. Each answer is ber of each word as term frequency tf and the total num- either yes, no, or not applicable, i.e. am ∈ {yes, no, n.a.}. ber of generated dialogues during training as document fre- For each object ok , there is a corresponding object cate- quency df. We use the word that has the highest probabil- gory ck ∈ {1, ......, C} and a pixel-wise segmentation mask ity to be sampled at current time-step, yt∗ , as the input to Sk ∈ {0, 1}H×W . Finally, we use colon notation (:) to select the heuristic function h. Here, yt∗ is the maximizer of the a subset of a sequence, for instance, (q, a)1:m refers to the probability mass function f . Mathematically, it is defined as first m rounds of QAs in a dialogue. yt∗ = argmax(f (π(x | w))). We propose that tf-idf(yt∗ ) ap- proximates the concentration level of the distribution, which 4.1 Models and Pretraining means that if the same word is always sampled from a distri- bution, then the distribution is very concentrated. Too much Following [Strub et al., 2017], we first train all three models concentration prevents the model from exploration, so that a in a supervised fashion. higher temperature is needed. In order to achieve this effect, Oracle: The task of the Oracle is to answer questions the heuristic function is defined as regarding to the target object. We outline here the sim- ple neural network architecture that achieved the best per- τth = h(tf-idf(yt∗ )) formance in the study of [de Vries et al., 2017], and which tf-idf(yt∗ ) − tf-idfmin we also used in our experiments. The input information = τmin + (τmax − τmin ) . used here is of three modalities, namely the question q, the tf-idfmax − tf-idfmin spatial information x∗spatial and the category c∗ of the tar- With this heuristic, words that occur very often are depressed get object. For encoding the question, de Vries et al. first by applying a higher temperature to those words, making use a lookup table to learn the embedding of words, then 3 Encoder Spatial Objects MLP MLP . Decoder Category MLP SoftMax Image ... ... ... ... CNN Object person Spatial Is this ...... MLP MLP Category MLP LSTM LSTM LSTM LSTM ...... Attention Mechanism Is this person QA 1 LSTM Is it a person? Yes History Is it a person? Yes QA 2 ...... LSTM Is this person in the left? No LSTM + ... Is he on the right side? Yes QA 3 ...... LSTM QA 4 Weighted Sum LSTM Memory Figure 2: Question-Generator model Figure 3: Guesser model use a one layer long-short-term-memory (LSTM) [Hochre- iter and Schmidhuber, 1997] to encode the whole ques- the decoder, is trained end-to-end to minimize the negative tion. For spatial information, de Vries et al. extract an log-likelihood cost. During testing, the decoder gets a start 8-dimensional vector of the location of the bounding box token and the representation from the encoder, and then gen- [xmin , ymin , xmax , ymax , xcenter , ycenter , wbox , hbox ], erates each word at each time step until it encounters a ques- where x, y denote the coordinates and wbox , hbox denote tion mark token, QGen := p(ym+1,n | (q, a)1:m , I). The the width and height of the bounding box, respectively. De output is a complete question. After several question-answer Vries et al. normalize the image width and height so that the rounds, the QGen outputs an end-of-dialogue token, and stops coordinates range from -1 to 1. The origin is at the image asking questions. The overall structure of the QGen model is center. The category embedding of the object is also learned illustrated in Fig. 2. with a lookup table during training. At the last step, de Vries Guesser: The goal of the Guesser model is to find out et al. concatenate all three embeddings into one feature vec- which object the Oracle model is referring to, given the com- tor and fed it into a one hidden layer multilayer perceptron plete history of the dialogue and a list of objects in the im- (MLP). The softmax output layer predicts the distribution, age, p(o∗ | (q, a)1:M , xO O spatial , c ). The Guesser model has Oracle := p(a | q, c∗ , x∗spatial ), over the three classes, access to the spatial, xspatial , and category information, cO , O including no, yes, and not applicable. The model is trained of the objects in the list. The task of the Guesser model is using the negative log-likelihood criterion. The Oracle struc- challenging because it needs to understand the dialogue and ture is shown in Fig. 1. to focus on the important content, and then guess the object. Question-Generator: The goal of the Question-Generator To accomplish this task, we decided to integrate the Mem- (QGen) is to ask the Oracle meaningful questions, qm+1 , ory [Sukhbaatar et al., 2015] and Attention [Bahdanau et al., given the whole image, I, and the historical question-answer 2014] modules into the Guesser architecture used in the previ- pairs, (q, a)1:m . In previous work [Strub et al., 2017], the ous work [Strub et al., 2017]. First, we use an LSTM header state transition function was modelled as an LSTM, which to process the varying lengths of question-answer pairs in par- was trained using whole dialogues so that the model mem- allel into multiple fixed-size vectors. Here, each QA-pair has orizes the historical QAs. We refer to this as dialogue level been encoded into some facts, Factm = LSTM((q, a)m ), training. We develop a novel QGen architecture using a mod- and stored into a memory base. Later, we use the sum of ified version of the Seq2Seq model [Sutskever et al., 2014]. the spatial and category embeddings of all objects as a key, The modified Seq2Seq model enables question level training, Key1 = MLP(xO O spatial , c ), to query the memory and calcu- which means that the model is fed with historical QAs, and late an attention mask, Attention1 (Factm ) = Factm key1 , then learns to produce a new question. Following [Strub et over each fact. Next, we use the sum of attended facts and the al., 2017], we first encode the whole image into a fixed-size first key to calculate the second key. Further, we use the sec- feature vector using the VGG-net [Simonyan and Zisserman, ond key to query the memory base again to have a more accu- 2014]. The features come from the fc-8 layer of the VGG- rate attention. These are the so called “two-hops” of attention net. For processing historical QAs, we use a lookup table to in the literature [Sukhbaatar et al., 2015]. Finally, we com- learn the word embeddings, then again use an LSTM encoder pare the attended facts with each object embedding in the list to encode the history information into a fixed-size latent rep- using a dot product. The most similar object to these facts is resentation, and concatenate it with the image representation: the prediction, Guesser := p(o∗ | (q, a)1:M , xO O spatial , c ). senc m,N m = encoder((LSTM(q, a)1:m ), VGG(I)). The intention of using the attention module here is to find out the most relevant descriptions or facts concerning the candi- The encoder and decoder are coupled by initializing the date objects. We train the whole Guesser network end-to-end decoder state with the last encoder state, mathematically, using the negative log-likelihood criterion. A more graphical sdec enc m+1,0 = sm,N m . The LSTM decoder generates each word description of the Guesser model is shown in Fig. 3. based on the concatenated representation and the previous generated word (note the first word is a start token): 4.2 Reinforcement Learning ym+1,n = decoder(LSTM((ym+1,n−1 , sdec Now, we post-train the QGen and the Guesser model with m+1,n−1 )). reinforcement learning. We keep the Oracle model fixed. In The decoder shares the same lookup table weights as the en- each game episode, when the models find the correct object, coder. The Seq2Seq model, consisting of the encoder and r = 1, otherwise, r = 0. 4 Next, we can assign credits for each action of the QGen # Method Accuracy and the Guesser models. In the case of the QGen model, we 1 [Strub et al., 2017] 52.30% spread the reward uniformly over the sequence of actions in 2 [Strub and de Vries, 2017] 60.30% the episode. The baseline function, b, used here is the running 3 REINFORCE 69.66% average of the game success rate. Consider that the Guesser 4 Single-TPG 69.76% model has only one action in each episode, i.e., taking the 5 Parallel-TPG 73.86% guess. If the Guesser finds the correct object, then it gets an 6 Dynamic-TPG 74.31% immediate reward and the Guesser’s parameters are updated using the REINFORCE rule without baseline. The QGen is Table 1: Performance comparison of our methods to other methods trained using the following four methods. reported in literature after reinforcement learning REINFORCE: The baseline method used here is RE- INFORCE [Williams, 1992]. During training, in the for- 5.1 Pretraining ward pass the words are sampled with τ = 1, ym+1,n ∼ f (QGen(x | w)). In the backward pass, the weights are up- We train all three models using 0.5 dropout [Srivastava et al., dated using REINFORCE, as, 2014] during training, using the ADAM optimizer [Kingma and Ba, 2014]. We use a learning rate of 0.0001 for the Or- w = w + α(r − b)∇w lnf (ym+1,n | QGen(x | w)). acle model and the Guesser model, and a learning rate of 0.001 for QGen. All the models are trained with at most 30 Single-TPG: We use temperature τglobal = 1.5 during train- epochs and early stopped within five epochs without improve- τglobal ing to encourage exploration, mathematically, ym+1,n ∼ ment on the validation set. We report the performance on the f τglobal (QGen(x | w)). In the backward pass, the weights test set which consists of images not used in training. We are updated using report the game success rate as the performance metric for τ global all three models, which equals to the number of succeeded w = w + α(r − b)∇w lnf (ym+1,n | QGen(x | w)). games divided by the total number of all games. Compared Parallel-TPG: For Parallel-TPG, we use two tempera- to previous works [de Vries et al., 2017; Strub et al., 2017; tures τ1 = 1.0 and τ2 = 1.5 to encourage the explo- Strub and de Vries, 2017], after supervised training, our mod- ration. The words are sampled in the forward pass using els obtain a game success rate of 48.77%, that is 4% higher τ1 ym+1,n τ2 , ym+1,n ∼ f τ1 ,τ2 (QGen(x | w)). In the backward than state-of-the-art methods [Strub and de Vries, 2017], pass, the weights are updated using which has 44.6% accuracy. τk 5.2 Reinforcement Learning w = w + Σ2k=1 α(r − b)∇w lnf (ym+1,n | QGen(x | w)). We first initialize all models with pre-trained parameters from Dynamic-TPG: The last method we evaluated is Dynamic- supervised learning and then post-train the QGen using either TPG. We use a heuristic function to calculate the temperature REINFORCE or TPG for 80 epochs. We update the parame- for each word at each time step: ters using stochastic gradient descent (SGD) with a learning ∗ tf-idf(ym+1,n ) − tf-idfmin rate of 0.001 and a batch size of 64. In each epoch, we sam- h τm+1,n = τmin + (τmax − τmin ) , ple each image in the training set once and randomly pick tf-idfmax − tf-idfmin one of the objects as a target. We track the running average of where we set τmin = 0.5, τmax = 1.5, and set tf-idfmin = 0, the game success rate and use it directly as the baseline, b, in h tf-idfmax = 8. After the calculation of τm+1,n , we substitute REINFORCE. We limit the maximum number of questions the value into the formula at each time step and sample the to 8 and the maximum number of words to 12. Simultane- next word using ously, we train the Guesser model using REINFORCE with- out baseline and using SGD with a learning rate of 0.0001. τh h The performance comparison between our baseline (# 3) with m+1,n ym+1,n ∼ f τm+1,n (QGen(x | w)). methods from literature (# 1 & # 2) is shown in Tab. 1. In the backward pass, the weights are updated using REINFORCE Baseline: From Tab. 1, we see that our models trained with REINFORCE (# 3) are about 9% better τh m+1,n than the state-of-the-art methods (# 1 & # 2). The improve- w = w + α(r − b)∇w lnf (ym+1,n | QGen(x | w)). ments are due to using advanced mechanisms and techniques For all four methods, we use greedy search in evaluation. such as the Seq2Seq structure in the QGen, the memory and attention mechanisms in the Guesser, and the training of the Guesser model with reinforcement learning. One important 5 Experiment difference is that our QGen model is trained in question level. We first train all the networks in a supervised fashion, This means that the model first learns to query meaningfully, and then further optimize the QGen and the Guesser step by step. Eventually, it learns to conduct a meaningful model using reinforcement learning. The source code dialog. Compared to directly learning to manage a strate- is available at https://github.com/ruizhaogit/ gic conversation, this bottom-up training procedure helps the GuessWhat-TemperedPolicyGradient, which uses model absorb knowledge, because it breaks large tasks down Torch7 [Collobert et al., 2011]. into smaller, more manageable pieces. This makes the learn- 5 Image Policy Gradient Tempered Policy Gradient Is it in left? No Is it a person? No Is it in front? No Is it a vehicle? Yes Is it in right? Yes Is it a truck? Yes Is it in middle? Yes Is it in front of photo? No Is it person? No In the left half? No Is it ball? No In the middle of photo? Yes Is it bat? No Is it to the right photo? Yes Is it car? Yes Is it in the middle of photo? Yes Status: Failure Status: Success Is it in left? No Is it a giraffe? Yes Is it in front? Yes In front of photo? Yes Is it in right? No In the left half? Yes Is it in middle? Yes Is it in the middle of photo? Yes Is it person? No Is it to the left of photo? Yes Is it giraffe? Yes Is it to the right photo? No Is in middle? Yes In the left in photo? No Is in middle? Yes In the middle of photo? Yes Status: Failure Status: Success Table 2: Some samples generated by our improved models using REINFORCE (left column: “Policy Gradient”) and Dynamic-TPG (right column: “Tempered Policy Gradient”). The green bounding boxes highlight the target objects; the red boxes highlight the wrong guesses. ing for QGen much easier. In the remainder of the section, cle” and “trucks” give much more information than the single we use our models, boosted with memory network, attention, word “car”, and help the Guesser model identify the truck and Seq2Seq, trained with REINFORCE as a strong baseline among many cars. Lastly, similar to the category case, the and analyse the performance improvements achieved by the models trained with TPG can first ask a larger spatial range TPGs. of the object and follow up with a smaller range. In the sec- From Tab. 1, we see that compared to REINFORCE (# 3), ond task (lower half of Tab. 2), we see that the TPG-trained Single-TPG (# 4) with τglobal = 1.5 achieves a comparable model first asks “In the left half?”, which refers to all the performance. With two different temperatures τ1 = 1.0 and three giraffes in the left half, and the answer is “Yes”. Then τ2 = 1.5, Parallel-TPG (# 5) achieves an improvement of ap- it asks “Is it to the left of photo?”, which refers to the sec- proximately 4%. Parallel-TPG requires more computational ond left giraffe, and the answer is “Yes”. Eventually the resources. Compared to Parallel-TPG, Dynamic-TPG only QGen asks “In the left in photo?”, which refers to the most uses the same computational power as REINFORCE does and left giraffe, and the answer is “No”. These specific ques- still gives a larger improvement by using a dynamic temper- tions about locations are not learned using REINFORCE. The ature, τth ∈ [0.5, 1.5]. After comparison, we can see that the REINFORCE-trained model can only ask a similar question best model is Dynamic-TPG (# 6), which gives a 4.65% im- with the word “left”. In this task, there are many giraffes in provement upon our strong baseline. Here, we have shown the left part of the image. The top-down spatial questions that our proposed methods contribute orthogonally, in the help the Guesser model find the correct giraffe. To summa- sense that they further improve the models already boosted rize, the TPG-trained models use longer and more informa- with advanced modules such as memory network, attention, tive sentences than the REINFORCE-trained models. and Seq2Seq. TPG Dialogue Samples: The generated dialogue samples in Tab. 2 can give some interesting insights in explaining why 6 Conclusion TPG methods give a better result. First of all, the sentences generated from TPG-trained models are on average longer Our paper makes two contributions. Firstly, by extending ex- and use slightly more complex structures, such as “Is it in isting models with Seq2Seq and Memory Networks we could the middle of photo?” instead of a simple form “Is it in mid- improve the performance of a goal-oriented dialogue sys- dle?”. Secondly, TPGs enable the models to explore better tem by 9%. Secondly, we introduced TPG, a novel class of and comprehend more words. For example, in the first task temperature-based policy gradient approaches. TPGs boosted (upper half of Tab. 2), both models ask about the category. the performance of the goal-oriented dialogue systems by an- The REINFORCE-trained model can only ask with the sin- other 4.7%. Among the three TPGs, Dynamic-TPG gave the gle word “car” to query about the vehicle category. In con- best performance, which helped the agent comprehend more trast, the TPG-trained model can first ask a more general cat- words, and produce more meaningful questions. TPG is a egory with the word “vehicle” and follows up querying with generic strategy to encourage word exploration on top of pol- a more specific category “trucks”. These two words “vehi- icy gradients and can be applied to any text-generating agents. 6 References [Liu et al., 2017] Yang Liu, Prajit Ramachandran, Qiang [Bahdanau et al., 2014] Dzmitry Bahdanau, Kyunghyun Liu, and Jian Peng. Stein variational policy gradient. arXiv Cho, and Yoshua Bengio. Neural machine translation preprint arXiv:1704.02399, 2017. by jointly learning to align and translate. arXiv preprint [Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu, arXiv:1409.0473, 2014. David Silver, Andrei A Rusu, Joel Veness, Marc G Belle- [Bordes and Weston, 2017] Antoine Bordes and Jason We- mare, Alex Graves, Martin Riedmiller, Andreas K Fidje- ston. Learning end-to-end goal-oriented dialog. In Inter- land, Georg Ostrovski, et al. Human-level control through national Conference on Learning Representations (ICLR), deep reinforcement learning. Nature, 2015. 2017. [Nachum et al., 2016] Ofir Nachum, Mohammad Norouzi, [Carmel and Markovitch, 1999] David Carmel and Shaul and Dale Schuurmans. Improving policy gradient by Markovitch. Exploration strategies for model-based learn- exploring under-appreciated rewards. arXiv preprint ing in multi-agent systems: Exploration strategies. Au- arXiv:1611.09321, 2016. tonomous Agents and Multi-agent systems, 2(2):141–172, [Robert, 2004] Christian P Robert. Monte carlo methods. 1999. Wiley Online Library, 2004. [Collobert et al., 2011] R. Collobert, K. Kavukcuoglu, and [Silver et al., 2016] David Silver, Aja Huang, Chris J Maddi- C. Farabet. Torch7: A matlab-like environment for ma- son, Arthur Guez, Laurent Sifre, George Van Den Driess- chine learning. In BigLearn, NIPS Workshop, 2011. che, Julian Schrittwieser, Ioannis Antonoglou, Veda Pan- [Das et al., 2017] Abhishek Das, Satwik Kottur, José M.F. neershelvam, Marc Lanctot, et al. Mastering the game of Moura, Stefan Lee, and Dhruv Batra. Learning coopera- go with deep neural networks and tree search. Nature, tive visual dialog agents with deep reinforcement learning. 529(7587):484–489, 2016. In International Conference on Computer Vision (ICCV), [Simonyan and Zisserman, 2014] Karen Simonyan and An- 2017. drew Zisserman. Very deep convolutional networks [de Vries et al., 2017] Harm de Vries, Florian Strub, Sarath for large-scale image recognition. arXiv preprint Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron C. arXiv:1409.1556, 2014. Courville. Guesswhat?! visual object discovery through [Srivastava et al., 2014] Nitish Srivastava, Geoffrey E Hin- multi-modal dialogue. In Conference on Computer Vision ton, Alex Krizhevsky, Ilya Sutskever, and Ruslan and Pattern Recognition (CVPR), 2017. Salakhutdinov. Dropout: a simple way to prevent neural [Dhingra et al., 2016] Bhuwan Dhingra, Lihong Li, Xiu- networks from overfitting. Journal of Machine Learning jun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, Research, 15(1):1929–1958, 2014. and Li Deng. End-to-end reinforcement learning of di- [Strub and de Vries, 2017] Florian Strub and Harm de Vries. alogue agents for information access. arXiv preprint Guesswhat?! models. https://github.com/ arXiv:1609.00777, 2016. GuessWhatGame/guesswhat/, 2017. [Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and [Strub et al., 2017] Florian Strub, Harm de Vries, Jérémie Jürgen Schmidhuber. Long short-term memory. Neural Mary, Bilal Piot, Aaron C. Courville, and Olivier computation, 1997. Pietquin. End-to-end optimization of goal-driven and vi- [Karpathy and Fei-Fei, 2015] Andrej Karpathy and Li Fei- sually grounded dialogue systems. In International Joint Fei. Deep visual-semantic alignments for generating im- Conference on Artificial Intelligence (IJCAI), 2017. age descriptions. In Proceedings of the IEEE Conference [Sukhbaatar et al., 2015] Sainbayar Sukhbaatar, Jason We- on Computer Vision and Pattern Recognition, pages 3128– ston, Rob Fergus, et al. End-to-end memory networks. In 3137, 2015. Advances in neural information processing systems, pages [Kingma and Ba, 2014] Diederik Kingma and Jimmy Ba. 2440–2448, 2015. Adam: A method for stochastic optimization. arXiv [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and preprint arXiv:1412.6980, 2014. Quoc V Le. Sequence to sequence learning with neural [Leskovec et al., 2014] Jure Leskovec, Anand Rajaraman, networks. In Advances in neural information processing and Jeffrey David Ullman. Mining of massive datasets. systems, pages 3104–3112, 2014. Cambridge university press, 2014. [Sutton and Barto, 1998] Richard S Sutton and Andrew G [Lewis et al., 2017] Mike Lewis, Denis Yarats, Yann N Barto. Reinforcement learning: An introduction. MIT Dauphin, Devi Parikh, and Dhruv Batra. Deal or no press, 1998. deal? end-to-end learning for negotiation dialogues. arXiv [Thrun, 1992] Sebastian B Thrun. Efficient exploration in preprint arXiv:1706.05125, 2017. reinforcement learning. 1992. [Lin et al., 2014] Tsung-Yi Lin, Michael Maire, Serge Be- [Williams, 1992] Ronald J Williams. Simple statistical longie, James Hays, Pietro Perona, Deva Ramanan, Piotr gradient-following algorithms for connectionist reinforce- Dollár, and C Lawrence Zitnick. Microsoft coco: Com- ment learning. Machine learning, 1992. mon objects in context. In European conference on com- puter vision, pages 740–755. Springer, 2014. 7