=Paper= {{Paper |id=Vol-2481/paper66 |storemode=property |title=Jointly Learning to See, Ask, Decide when to Stop, and then GuessWhat |pdfUrl=https://ceur-ws.org/Vol-2481/paper66.pdf |volume=Vol-2481 |authors=Ravi Shekhar,Alberto Testoni,Raquel Fernández,Raffaella Bernardi |dblpUrl=https://dblp.org/rec/conf/clic-it/ShekharTFB19 }} ==Jointly Learning to See, Ask, Decide when to Stop, and then GuessWhat== https://ceur-ws.org/Vol-2481/paper66.pdf
    Jointly Learning to See, Ask, Decide when to Stop, and then GuessWhat

       Ravi Shekhar† , Alberto Testoni† , Raquel Fernández∗ and Raffaella Bernardi†
                       † University of Trento, ∗ University of Amsterdam

                ravi.shekhar@unitn.it alberto.testoni@unitn.it
              raquel.fernandez@uva.nl raffaella.bernardi@unitn.it



                        Abstract                                Learning or Reinforcement Learning (de Vries
                                                                et al., 2017; Strub et al., 2017). In contrast,
     We augment a task-oriented visual dia-                     Shekhar et al. (2019) model these two modules
     logue model with a decision-making mod-                    jointly. They show that thanks to its joint archi-
     ule that decides which action needs to be                  tecture, their Questioner model leads to dialogues
     performed next given the current dialogue                  with higher linguistic quality in terms of richness
     state, i.e. whether to ask a follow-up ques-               of the vocabulary and variability of the questions,
     tion or stop the dialogue. We show that, on                while reaching a performance similar to the state
     the GuessWhat?! game, the new module                       of the art with Reinforcement Learning. They ar-
     enables the agent to succeed at the game                   gue that achieving high task success is not the only
     with shorter and hence less error-prone di-                criterion by which a visually-grounded conversa-
     alogues, despite a slightly decrease in task               tional agent should be judged. Crucially, the dia-
     accuracy. We argue that both dialogue                      logue should be coherent, with no unnatural rep-
     quality and task accuracy are essential fea-               etitions nor irrelevant questions. We claim that
     tures to evaluate dialogue systems.1                       to achieve this, a conversational agent needs to
                                                                learn a strategy to decide how to respond at each
1    Introduction                                               dialogue turn, based on the dialogue history and
                                                                the current context. In particular, the Questioner
The development of conversational agents that
                                                                model has to learn when it has gathered enough
ground language in visual information is a chal-
                                                                information and it is therefore ready to guess the
lenging problem that requires the integration of di-
                                                                target.
alogue management skills with multimodal under-
standing. A common test-bed to make progress
in this area are guessing tasks where two dialogue
participants interact with the goal of letting one
of them guess a visual target (Das et al., 2017a;
de Vries et al., 2017; Das et al., 2017b). We fo-                  In this work, we extend the joint Questioner
cus on the GuessWhat?! game, which consists in                  architecture proposed by Shekhar et al. (2019)
guessing a target object within an image which is               with a decision-making component that decides
visible to both participants. One participant (the              whether to ask a follow-up question to identify
Questioner) is tasked with identifying the target               the target referent, or to stop the conversation to
object by asking yes-no questions to the other par-             make a guess. Shekhar et al. (2018) had added
ticipant (the Oracle), who is the only one who                  a similar module to the baseline architecture by
knows the target. Participants are free to go on                de Vries et al. (2017). Here we show that the
with the task for as many turns as required.                    novel joint architecture by Shekhar et al. (2019)
   Most models of the Questioner agent in the                   can also be augmented with a decision-making
GuessWhat?! game consist of two disconnected                    component and that this addition leads to further
modules, a Question Generator and a Guesser,                    improvements in the quality of the dialogues. Our
which are trained independently with Supervised                 extended Questioner agent reaches a task success
                                                                comparable to Shekhar et al. (2019), but it asks
    1Copyright © 2019 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       fewer questions, thus significantly reducing the
ternational (CC BY 4.0).                                        number of games with repetitions.
2     Task and Models                                                                 Encoder

2.1    Task                                                                                                                          Guesser




                                                                                                             Decision Maker
                                                                                                                             ess
                                                                                                                          gu
The GuessWhat?! dataset2 was collected via Ama-               ResNet-152
                                                                                                        ht
zon Mechanical Turk by de Vries et al. (2017).




                                                                                                                              as
                                                         Is it a car? No
The task involves two human participants who see




                                                                                                                              k
                                                         Is it a person? Yes                                                           QGen
                                                         The man with the hat? No
a real-world image, taken from the MS-COCO                                          visually-grounded
                                                                                    dialogue state
dataset (Lin et al., 2014). One of the participants                                                                                         Qt+1
(the Oracle) is assigned a target object in the image   Oracle provides
                                                           answers                                                            Is it the batter?
and the other participant (the Questioner) has to
guess it by asking Yes/No questions to the Oracle.      Figure 1: Proposed visually-grounded dialogue
There are no time constraints to play the game.         state encoder with a decision-making component.
Once the Questioner is ready to make a guess, the
list of candidate objects is provided and the game
is considered successful if the Questioner picks        of questions and answers), QGen produces a rep-
the target object. The dataset consists of around       resentation of the visually grounded dialogue (the
155k English dialogues about approximately 66k          RNN’s hidden state QHt−1 at time t − 1 in the dia-
different images. Dialogues contain on average 5.2      logue) that encodes information useful to generate
questions-answer pairs.                                 the next question qt . The best performing model
   We use the same train (70%), validation (15%),       of the Guesser by de Vries et al. (2017) represents
and test (15%) splits as de Vries et al. (2017). The    candidate objects by their object category and spa-
test set contains new images not seen during train-     tial coordinates. These features are passed through
ing. Following Shekhar et al. (2019), we use two        a Multi-Layer Perceptron (MLP) to get an embed-
experimental setups for the number of questions         ding for each object. The Guesser also takes as
to be asked by the Questioner, motivated by prior       input the dialogue history processed by an LSTM,
work: 5 questions (5Q) as de Vries et al. (2017),       whose hidden state GHt−1 is of the same size as
and 8 questions (8Q) as Strub et al. (2017).            the MLP output. A dot product between both re-
                                                        turns a score for each candidate object in the im-
2.2    Models                                           age.
                                                           Shekhar et al. (2018) extend the baseline archi-
We focus on developing a Questioner agent able
                                                        tecture of de Vries et al. (2017) with a third model,
to decide when it has asked enough information
                                                        a decision-making component that determines, af-
to identify the target object. We first describe the
                                                        ter each question/answer pair, whether the QGen
baseline model proposed by de Vries et al. (2017).
                                                        model should ask another question or whether the
Then we describe the model proposed by Shekhar
                                                        Guesser model should guess the target object.
et al. (2019) and extend it with a decision making
module.                                                 Grounded Dialogue State Encoder (GDSE)
Baseline de Vries et al. (2017) model the Ques-         Shekhar et al. (2019) address one of the fundamen-
tioner agent of the GuessWhat?! game as two dis-        tal weakness of the Questioner model by de Vries
joint models a Question Generator (QGen) and a          et al. (2017), i.e., having two disconnected QGen
Guesser trained independently. After a fixed num-       and Guesser modules. They tackle this issue with
ber of questions by QGen, the Guesser selects a         a multi-task approach, where a common visually-
candidate object.                                       grounded dialogue state encoder (GDSE) is used
                                                        to generate questions and guess the target object.
   QGen is implemented as a Recurrent Neural
                                                        Two learning paradigms are explored: supervised
Network (RNN) with a transition function han-
                                                        learning (SL) and co-operative learning (CL). In
dled with Long-Short-Term Memory (LSTM), on
                                                        SL, the Questioner model is trained using hu-
which a probabilistic sequence model is built with
                                                        man data. While in CL, the Questioner model is
a Softmax classifier. Given the overall image (en-
                                                        trained on both generated and human data. First,
coded by extracting its VGG features) and the cur-
                                                        the Guesser is trained on the generated questions
rent dialogue history (i.e., the previous sequence
                                                        and answers and then the QGen is “readapted”
    2Available at https://guesswhat.ai/download.        using the human data. Their results show that
training these two modules jointly improves the             Model              5Q                8Q
performance of the Questioner model, reaching a
                                                            Baseline           41.2              40.7
task success comparable to RL-based approaches
                                                            GDSE-SL            47.8              49.7
(Strub et al., 2017).
                                                            GDSE-CL            53.7 (±.83)       58.4 (±.12)
Adding a Decision Making module (GDSE-                      GDSE-SL-DM         46.78             49.12
DM) We extend the GDSE model of Shekhar                     GDSE-CL-DM         49.77(±1.16)      53.89(±.24)
et al. (2019) with a decision-making component
(DM). The DM determines whether QGen should             Table 1: Test set accuracy for each model (for se-
ask a follow-up question or the Guesser should          tups with 5 and 8 questions).
guess the target object, based on the image and
dialogue history. As shown in Figure 1, the DM
component is modelled as a binary classifier that       SL and CL will be referred to as SL-DM and CL-
uses the visually-grounded dialogue state ht to de-     DM, respectively. It has to be highlighted that the
cide whether to ask or guess. It is implemented by      tasks of generating a question and guessing the
a Multi Layer Perceptron (MLPd ) trained together       target object are not equally challenging: while
with the encoder with negative log-likelihood loss:     the Guesser has to learn the probability distribu-
                                                        tion of the set of possible objects in the image,
              L D = − log p(declabel )           (1)    QGen needs to fit the distribution of natural lan-
                                                        guage words, which is a much harder task. As
where declabel is the decision label, i.e., ‘ask’ or    in Shekhar et al. (2019), we address this issue by
‘guess’. The MLPd consists of three hidden lay-         making the learning schedule task-dependent us-
ers whose dimensions are 256, 64, and 16, re-           ing a modulo-n training setup. In the SL setting, n
spectively; after each hidden layer a ReLU non-         indicates after how many epochs of QGen training
linearity is applied.                                   the Guesser is updated together with QGen; for
   To train the DM, we need decision labels. For        CL, QGen is updated at every nth epoch, while the
the SL setting, we follow the label generation pro-     Guesser is updated at all other epochs. We found
cedure introduced by Shekhar et al. (2018): deci-       the optimal value of n to be equal to 5 for both the
sion labels are generated by annotating all the last    SL and the CL setting. The models are trained for
question-answer pairs in the games with guess and       100 epochs with Adam optimizer and a learning
all other question-answer pairs as ask. For the CL      rate of 0.0001 and we select the Questioner mod-
setting, we label the question/answer pairs based       ule with the best performance on the validation set.
on whether the Guesser module is able to correctly
predict the target object given the current dialogue.   3    Results
If the Guesser module is able to make a correct
prediction after a given question/answer pair, we       In this section, we report the task success accuracy
label that dialogue state with guess and otherwise      of our GDSE-DM model, which extends the joint
with ask. This process results in an unbalanced         GDSE architecture with a decision-making com-
dataset for the DM where the guess label makes          ponent. Following Shekhar et al. (2019), to neu-
up for only 20% of states. We address this class        tralize the effect of random sampling in CL train-
imbalance by adding a weighing factor, α, to the        ing, we use 3 runs and report mean and standard
loss. The balanced loss is given by                     deviation.
                                                           Table 1 gives an overview of the accuracy re-
         L D = αlabel · (− log p(declabel ))     (2)    sults obtained by the models. Our main goal is to
                                                        show the effect of adding a DM module to the joint
where αguess = 0.8 and αask = 0.2.                      GDSE architecture. We therefore do not compare
   The DM, for both SL and CL, is trained with          to other approaches that use RL.3 As we can see,
Cross Entropy loss in a supervised manner us-           adding a DM to the GDSE model decreases its ac-
ing decision labels after each question/answer pair.    curacy by 0.5-1% in the supervised learning set-
During inference, the model continues to ask ques-      ting and by 4-5% in the cooperative learning set-
tions unless the DM chooses to end the conver-
                                                            3For completeness, the RL model by Strub et al. (2017)
sation or the maximum number of questions has           has accuracy 56.2(±24) and 56.3(±.05) for the 5Q and 8Q
been reached. The GDSE-DM model trained with            settings, respectively.
                                                                       Lexical        Question       % Games with
    Model           5Q              8Q
                                                                       diversity      diversity      repeated Q’s
    GDSE-SL-DM      3.83            5.49                Baseline       0.030           1.60          93.50
    GDSE-CL-DM      4.02(±0.10)     5.46(±0.10)         GDSE-SL        0.101          13.61          55.80
                                                        GDSE-CL        0.115 (±.02)   14.15 (±3.0)   52.19 (±4.7)
                                                        GDSE-SL-DM     0.047          1.62           42.47
Table 2: Average number of questions asked by
                                                        GDSE-CL-DM     0.135(±.02)    10.25(±2.46)   32.51(±6.45)
the GDSE-DM models when the maximum num-
                                                        Humans         0.731          47.89          —
ber of questions is set to 5 or 8.
                                                        Table 3: Statistics of the linguistic output of all
ting. We believe that the higher drop in accuracy       models with the 8Q setting compared to human
of the CL-DM model can be attributed to the de-         dialogues in all test games.
cision labels used by this model. In the SL-DM
setting, the model is trained on human data, which
                                                        than 50% of dialogues included repetitions, which
leads to a more reliable decision label. In con-
                                                        make them unnatural. We can see how adding a
trast, in the CL-DM setting, the model is trained
                                                        DM component to GDSE addresses this important
on automatically generated data, which includes
                                                        problem: with the CL-DM setting, the percentage
possible errors by both the QGen and the Oracle.
                                                        of games with repeated questions goes down to
Overall, this results in more noisy dialogues. We
                                                        32.51% (-19.68%, from 52.19 to 32.51). The re-
think that, due to the accumulation of these errors,
                                                        duction is also substantial for the SL-DM model
the decision labels of the generated dialogue devi-
                                                        (-13.33%, from 55.80 to 42.47) albeit less impres-
ate significantly from the human data and thus the
                                                        sive.
DM fails to capture them.
                                                            Given that the number of questions asked by the
   Despite the drop in task success, the DM agent
                                                        DM-based models is lower (as shown in Table 2),
seems to be more efficient. Table 2 shows that the
                                                        it is to be expected that the lexical and question di-
average number of questions asked by the DM-
                                                        versity of the dialogues produced by these models
based models is lower: the GDSE model without
                                                        will also be somewhat lower. Indeed, we observe a
a DM always asks the maximum number of ques-
                                                        rather significant drop in diversity for the SL-DM
tions allowed (either 5 or 8 questions); while, on
                                                        setting. The CL-DM model, on the other hand, is
average, the GDSE-DM agent asks around 3.8 to
                                                        rather robust to diversity loss: in fact, lexical diver-
5.5 questions, even when it is allowed to ask up to
                                                        sity increases slightly with respect to GDSE-CL
8. As we shall see in the next section, this leads to
                                                        (0.135 vs. 0.115 on average), while question diver-
dialogues that are more natural and less repetitive.
                                                        sity decreases by a couple of points only, remain-
                                                        ing much higher than that of the baseline model.
4     Analysis
                                                            Following Shekhar et al. (2019), we also looked
In this section, we look into the advantage brought     into the distribution of the types of questions asked
about by the DM in terms of the quality of the di-      by the models. Questions are divided into two
alogues produced by the model.                          broad categories; entity ( about the target object
   Following Shekhar et al. (2019), we report           category, e.g., ‘is it a animal?’ ) and attribute
statistics about the dialogue produced by the mod-      (about the target object property, e.g., ‘is it green
els with respect to lexical diversity (measured as      one?’). entity questions are further sub-divided
type/token ratio over all games), question diver-       into ‘object category’ and ‘super-category’. at-
sity (measured as the percentage of unique ques-        tribute questions are divided into ‘color’, ‘shape’,
tions over all games), and percentage of games          ‘size’, ‘texture’, ‘location’, and ‘action’ questions.
with at least one repeated question (see Table 3).      Table 4 provides distribution of questions by dif-
The main drawback of the models asking a fixed          ferent models. Compared with their counterparts,
number of questions is that they repeat questions       the DM-based models ask more object questions.
within the same dialogue. While the introduc-           The SL-DM also lowers significantly the number
tion of the joint GDSE architecture by Shekhar          of location questions (from 37.09 to 21.70), which
et al. (2019) substantially reduced the percentage      are the type of question most commonly repeated
of games with repeated questions with respect to        by the various models, as shown by Shekhar et al.
the baseline model (from 93.5% to 52.16%), more         (2019). We also computed the Kullback-Leibler
                                                          GDSE-SL                         [success]      GDSE-CL                              [success]
                               Baseline [failure]
                                                          1. is it a person?                    no       1. is it a person?                          no
                               1. is it a person? no
                                                          2. is it a ski?                      yes       2. is it a tree?                            no
                               2. is it a tree? no
                                                          3. is it on the person in the front? yes       3. is it a ski?                            yes
                               3. is it a tree? no
                                                          DM stops asking to guess                       4. is it both skis of the person in front? yes
                               4. is it a tree? no
                                                          4. is it the left most person?       yes       DM stops asking to guess
                               5. is it a tree? no
                                                          5. both skis?                        yes       5. is it on the left?                      yes
                               6. is it a tree? no
                                                          6. both skis?                        yes       6. is it on the left?                      yes
                               7. is it a tree? no
                                                          7. the whole skis?                   yes       7. it is second to the right?               no
                               8. is it a tree? no
                                                          8. the whole skis?                   yes       8. is it the one on the left?              yes
                                                                     GDSE-SL                 [success]   GDSE-CL                                 [success]
                             Baseline                   [success]
                                                                     1. is it a person?            no    1. is it a person?                             no
                             1. is it the baby?               no
                                                                     2. is it a toothbrush?       yes    2. is it a toothbrush?                        yes
                             2. is it the toothbrush?        yes
                                                                     3. is it the whole thing?    yes    DM stops asking to guess
                             3. is it the toothbrush?        yes
                                                                     DM stops asking to guess            3. is it the one kid holding?                 yes
                             4. is it the toothbrush?        yes
                                                                     4. the one in the hand?      yes    4. is it the one left of the baby?             no
                             5. is it the toothbrush?        yes
                                                                     5. the one in the front?     yes    5. is it next to the one with the red handle? yes
                             6. is it the toothbrush?        yes
                                                                     6. the whole thing?          yes    6. is it next to the baby?                     no
                             7. is it the toothbrush?        yes
                                                                     7. the whole of it?          yes    7. is it one next to the baby?                 no
                             8. is it the entire toothbrush? yes
                                                                     8. all of it?                yes    8. next to the whole object?                  yes


Figure 2: Game examples where most models succeed at guessing the target object (framed). In red, the
point in the dialogue where the DM component decides to stop asking questions and guess the target.
Many of the questions asked after the decision point by the GDSE model without a DM are repeated,
and thus do not add any extra information.


(KL) divergence to measure how the output of                                    5     Conclusion
each model differs from the human distribution
of fine-grained question classes. We can see that                               We have enriched the Questioner agent in the goal-
GDSE-DM models have comparatively higher de-                                    oriented dialogue game GuessWhat?! with a De-
gree of divergence than GDSE, in particular the                                 cision Making (DM) component. Based on the
SL-DM model, which asks a substantially larger                                  visually grounded dialogue state, our Questioner
proportion of entity questions.                                                 model learns whether to ask a follow-up ques-
                                                                                tion or to stop the conversation to guess the tar-
   The sample dialogues in Figure 2 provide a
                                                                                get object. We show that the dialogue produced
qualitative illustration of the output of our mod-
                                                                                by our model has less repetitions and less unnec-
els, showing how the DM-based Questioner stops
                                                                                essary questions, thus potentially leading to more
asking questions when it has enough information
                                                                                efficient and less unnatural interactions – a well
to guess the target object.
                                                                                known limitation of current visual dialogue sys-
                                                                                tems. As in Shekhar et al. (2018), where a sim-
Question type     BL    SL        CL      SL-DM       CL-DM          H
                                                                                ple baseline model was extended with a DM com-
entity           49.00 48.07    46.51      71.03       51.36        38.11
super-cat         19.6 12.38    12.58      15.35       15.40        14.51       ponent, task accuracy slightly decreases while the
object            29.4 35.70    33.92      55.68       35.97        23.61       quality of the dialogues increases.
attribute        49.88 46.64    47.60      27.27       45.21        53.29
                                                                                    A first attempt to partially tackle the issue
color             2.75 13.00    12.51      10.57        8.41        15.50
shape             0.00 0.01      0.02        0.0        0.07         0.30       within the GuessWhat?! game was made by Strub
size              0.02 0.33      0.39       0.01        0.67         1.38       et al. (2017), who added a  token to the vo-
texture           0.00 0.13      0.15       0.01        0.25         0.89
                                                                                cabulary of the question generator module to learn
location         47.25 37.09    38.54      21.70       39.92        40.00
action            1.34 7.97      7.60       3.96        8.01         7.59       when to stop asking questions using Reinforce-
Not classified    1.12 5.28      5.90       1.70        3.43         8.60       ment Learning. This is a problematic approach as
KL wrt Human 0.953 0.042        0.038       1.48       0.055         —          it requires the QGen to generate probabilities over
                                                                                a non-linguistic token; further, the decision to ask
Table 4: Percentage of questions per question type                              more questions or guess is a binary decision and
in all the test set games played by humans (H) and                              thus it is not desirable to incorporate it within the
the models with the 8Q setting, and KL divergence                               large softmax output of the QGen.
from human distribution of fine-grained question                                    Jiaping et al. (2018) propose a hierarchical
types.                                                                          RL-based Questioner model for the GuessWhich
                                                                                image-guessing game introduced by Chattopad-
hyay et al. (2017) using the VisDial dataset (Das       Language (ViGIL).      ArXiv:1802.03881. Last
et al., 2017a). The first RL layer is a module that     version Feb. 2018.
learns to decide when to stop asking questions.       T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Per-
We believe that a decision making component for          ona, D. Ramanan, Dollar, P., and C. L. Zitnick.
the GuessWhich game is an ill-posed problem. In          2014. Microsoft COCO: Common objects in
this game, the Questioner does not see the pool of       context. In Proceedings of ECCV (European
candidate images while carrying out the dialogue.        Conference on Computer Vision).
Hence, it will never know when it has gathered
                                                      Ravi Shekhar, Tim Baumgärtner, Aashish
enough information to distinguish the target im-
                                                        Venkatesh, Elia Bruni, Raffaella Bernardi,
age from the distractors. In any case, our work
                                                        and Raquel Fernández. 2018. Ask no more:
shows that a simple approach can be used to aug-
                                                        Deciding when to guess in referential vi-
ment visually-grounded dialogue systems with a
                                                        sual dialogue. In Proceedings of the 27th
DM without having to use the high complexity of
                                                        International Conference on Computational
RL paradigms.
                                                        Linguistics (COLING), pages 1218–1233.
   Task accuracy and dialogue quality are equally
important aspects of visually-grounded dialogue       Ravi Shekhar, Aashish Venkatesh, Tim Baumgärt-
systems. It remains to be seen how such sys-            ner, Elia Bruni, Barbara Plank, Raffaella
tems can reach higher task accuracy while profit-       Bernardi, and Raquel Fernández. 2019. Beyond
ing from the better quality that DM-based models        task success: A closer look at jointly learning to
produce.                                                see, ask, and guesswhat. In NAACL.
                                                      Florian Strub, Harm de Vries, Jeremie Mary, Bi-
References                                              lal Piot, Aaron Courville, and Olivier Pietquin.
                                                        2017. End-to-end optimization of goal-driven
Prithvijit Chattopadhyay, Deshraj Yadav, Viraj          and visually grounded dialogue systems. In
  Prabhu, Arjun Chandrasekaran, Abhishek Das,           Proceedings of the International Joint Confer-
  Stefan Lee, Dhruv Batra, and Devi Parikh.             ence on Artificial Intelligence (IJCAI).
  2017. Evaluating visual conversational agents
                                                      Harm de Vries, Florian Strub, Sarath Chan-
  via cooperative human-ai games. In Proceed-
                                                        dar, Olivier Pietquin, Hugo Larochelle, and
  ings of the Fifth AAAI Conference on Human
                                                        Aaron C. Courville. 2017. Guesswhat?! Visual
  Computation and Crowdsourcing (HCOMP).
                                                        object discovery through multi-modal dialogue.
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi          In Conference on Computer Vision and Pattern
  Singh, Deshraj Yadav, José M.F. Moura, Devi           Recognition (CVPR).
  Parikh, and Dhruv Batra. 2017a. Visual Dia-         Junjie Zhang, Qi Wu, Chunhua Shen, Jian Zhang,
  log. In Proceedings of the IEEE Conference on         Jianfeng Lu, and Anton van den Hengel. 2018.
  Computer Vision and Pattern Recognition.              Goal-oriented visual question generation via in-
Abhishek Das, Satwik Kottur, José M.F. Moura,           termediate rewards. In Proceedings of the Euro-
  Stefan Lee, and Dhruv Batra. 2017b. Learning          pean Conference of Computer Vision (ECCV).
  cooperative visual dialog agents with deep re-      Yan Zhu, Shaoting Zhang, and Dimitris Metaxas.
  inforcement learning. In International Confer-        2017. Interactive reinforcement learning for ob-
  ence on Computer Vision (ICCV).                       ject grounding via self-talking. In NIPS Work-
Zhang Jiaping, Zhao Tiancheng, and Yu Zhou.             shop on Visually-Grounded Interaction and
  2018. Multimodal hierarchical reinforcement           Language (ViGIL).
  learning policy for task-oriented visual dialog.
  In Proceeding of the SigDial Conference, pages
  140–150. Association for Computational Lin-
  guistics.
Sang-Woo Lee, Yu-Jung Heo, and Byoung-Tak
  Zhang. 2017. Answerer in questioner’s mind for
  goal-oriented visual dialogue. In NIPS Work-
  shop on Visually-Grounded Interaction and