=Paper=
{{Paper
|id=Vol-2481/paper66
|storemode=property
|title=Jointly Learning to See, Ask, Decide when to Stop, and then GuessWhat
|pdfUrl=https://ceur-ws.org/Vol-2481/paper66.pdf
|volume=Vol-2481
|authors=Ravi Shekhar,Alberto Testoni,Raquel Fernández,Raffaella Bernardi
|dblpUrl=https://dblp.org/rec/conf/clic-it/ShekharTFB19
}}
==Jointly Learning to See, Ask, Decide when to Stop, and then GuessWhat==
Jointly Learning to See, Ask, Decide when to Stop, and then GuessWhat
Ravi Shekhar† , Alberto Testoni† , Raquel Fernández∗ and Raffaella Bernardi†
† University of Trento, ∗ University of Amsterdam
ravi.shekhar@unitn.it alberto.testoni@unitn.it
raquel.fernandez@uva.nl raffaella.bernardi@unitn.it
Abstract Learning or Reinforcement Learning (de Vries
et al., 2017; Strub et al., 2017). In contrast,
We augment a task-oriented visual dia- Shekhar et al. (2019) model these two modules
logue model with a decision-making mod- jointly. They show that thanks to its joint archi-
ule that decides which action needs to be tecture, their Questioner model leads to dialogues
performed next given the current dialogue with higher linguistic quality in terms of richness
state, i.e. whether to ask a follow-up ques- of the vocabulary and variability of the questions,
tion or stop the dialogue. We show that, on while reaching a performance similar to the state
the GuessWhat?! game, the new module of the art with Reinforcement Learning. They ar-
enables the agent to succeed at the game gue that achieving high task success is not the only
with shorter and hence less error-prone di- criterion by which a visually-grounded conversa-
alogues, despite a slightly decrease in task tional agent should be judged. Crucially, the dia-
accuracy. We argue that both dialogue logue should be coherent, with no unnatural rep-
quality and task accuracy are essential fea- etitions nor irrelevant questions. We claim that
tures to evaluate dialogue systems.1 to achieve this, a conversational agent needs to
learn a strategy to decide how to respond at each
1 Introduction dialogue turn, based on the dialogue history and
the current context. In particular, the Questioner
The development of conversational agents that
model has to learn when it has gathered enough
ground language in visual information is a chal-
information and it is therefore ready to guess the
lenging problem that requires the integration of di-
target.
alogue management skills with multimodal under-
standing. A common test-bed to make progress
in this area are guessing tasks where two dialogue
participants interact with the goal of letting one
of them guess a visual target (Das et al., 2017a;
de Vries et al., 2017; Das et al., 2017b). We fo- In this work, we extend the joint Questioner
cus on the GuessWhat?! game, which consists in architecture proposed by Shekhar et al. (2019)
guessing a target object within an image which is with a decision-making component that decides
visible to both participants. One participant (the whether to ask a follow-up question to identify
Questioner) is tasked with identifying the target the target referent, or to stop the conversation to
object by asking yes-no questions to the other par- make a guess. Shekhar et al. (2018) had added
ticipant (the Oracle), who is the only one who a similar module to the baseline architecture by
knows the target. Participants are free to go on de Vries et al. (2017). Here we show that the
with the task for as many turns as required. novel joint architecture by Shekhar et al. (2019)
Most models of the Questioner agent in the can also be augmented with a decision-making
GuessWhat?! game consist of two disconnected component and that this addition leads to further
modules, a Question Generator and a Guesser, improvements in the quality of the dialogues. Our
which are trained independently with Supervised extended Questioner agent reaches a task success
comparable to Shekhar et al. (2019), but it asks
1Copyright © 2019 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In- fewer questions, thus significantly reducing the
ternational (CC BY 4.0). number of games with repetitions.
2 Task and Models Encoder
2.1 Task Guesser
Decision Maker
ess
gu
The GuessWhat?! dataset2 was collected via Ama- ResNet-152
ht
zon Mechanical Turk by de Vries et al. (2017).
as
Is it a car? No
The task involves two human participants who see
k
Is it a person? Yes QGen
The man with the hat? No
a real-world image, taken from the MS-COCO visually-grounded
dialogue state
dataset (Lin et al., 2014). One of the participants Qt+1
(the Oracle) is assigned a target object in the image Oracle provides
answers Is it the batter?
and the other participant (the Questioner) has to
guess it by asking Yes/No questions to the Oracle. Figure 1: Proposed visually-grounded dialogue
There are no time constraints to play the game. state encoder with a decision-making component.
Once the Questioner is ready to make a guess, the
list of candidate objects is provided and the game
is considered successful if the Questioner picks of questions and answers), QGen produces a rep-
the target object. The dataset consists of around resentation of the visually grounded dialogue (the
155k English dialogues about approximately 66k RNN’s hidden state QHt−1 at time t − 1 in the dia-
different images. Dialogues contain on average 5.2 logue) that encodes information useful to generate
questions-answer pairs. the next question qt . The best performing model
We use the same train (70%), validation (15%), of the Guesser by de Vries et al. (2017) represents
and test (15%) splits as de Vries et al. (2017). The candidate objects by their object category and spa-
test set contains new images not seen during train- tial coordinates. These features are passed through
ing. Following Shekhar et al. (2019), we use two a Multi-Layer Perceptron (MLP) to get an embed-
experimental setups for the number of questions ding for each object. The Guesser also takes as
to be asked by the Questioner, motivated by prior input the dialogue history processed by an LSTM,
work: 5 questions (5Q) as de Vries et al. (2017), whose hidden state GHt−1 is of the same size as
and 8 questions (8Q) as Strub et al. (2017). the MLP output. A dot product between both re-
turns a score for each candidate object in the im-
2.2 Models age.
Shekhar et al. (2018) extend the baseline archi-
We focus on developing a Questioner agent able
tecture of de Vries et al. (2017) with a third model,
to decide when it has asked enough information
a decision-making component that determines, af-
to identify the target object. We first describe the
ter each question/answer pair, whether the QGen
baseline model proposed by de Vries et al. (2017).
model should ask another question or whether the
Then we describe the model proposed by Shekhar
Guesser model should guess the target object.
et al. (2019) and extend it with a decision making
module. Grounded Dialogue State Encoder (GDSE)
Baseline de Vries et al. (2017) model the Ques- Shekhar et al. (2019) address one of the fundamen-
tioner agent of the GuessWhat?! game as two dis- tal weakness of the Questioner model by de Vries
joint models a Question Generator (QGen) and a et al. (2017), i.e., having two disconnected QGen
Guesser trained independently. After a fixed num- and Guesser modules. They tackle this issue with
ber of questions by QGen, the Guesser selects a a multi-task approach, where a common visually-
candidate object. grounded dialogue state encoder (GDSE) is used
to generate questions and guess the target object.
QGen is implemented as a Recurrent Neural
Two learning paradigms are explored: supervised
Network (RNN) with a transition function han-
learning (SL) and co-operative learning (CL). In
dled with Long-Short-Term Memory (LSTM), on
SL, the Questioner model is trained using hu-
which a probabilistic sequence model is built with
man data. While in CL, the Questioner model is
a Softmax classifier. Given the overall image (en-
trained on both generated and human data. First,
coded by extracting its VGG features) and the cur-
the Guesser is trained on the generated questions
rent dialogue history (i.e., the previous sequence
and answers and then the QGen is “readapted”
2Available at https://guesswhat.ai/download. using the human data. Their results show that
training these two modules jointly improves the Model 5Q 8Q
performance of the Questioner model, reaching a
Baseline 41.2 40.7
task success comparable to RL-based approaches
GDSE-SL 47.8 49.7
(Strub et al., 2017).
GDSE-CL 53.7 (±.83) 58.4 (±.12)
Adding a Decision Making module (GDSE- GDSE-SL-DM 46.78 49.12
DM) We extend the GDSE model of Shekhar GDSE-CL-DM 49.77(±1.16) 53.89(±.24)
et al. (2019) with a decision-making component
(DM). The DM determines whether QGen should Table 1: Test set accuracy for each model (for se-
ask a follow-up question or the Guesser should tups with 5 and 8 questions).
guess the target object, based on the image and
dialogue history. As shown in Figure 1, the DM
component is modelled as a binary classifier that SL and CL will be referred to as SL-DM and CL-
uses the visually-grounded dialogue state ht to de- DM, respectively. It has to be highlighted that the
cide whether to ask or guess. It is implemented by tasks of generating a question and guessing the
a Multi Layer Perceptron (MLPd ) trained together target object are not equally challenging: while
with the encoder with negative log-likelihood loss: the Guesser has to learn the probability distribu-
tion of the set of possible objects in the image,
L D = − log p(declabel ) (1) QGen needs to fit the distribution of natural lan-
guage words, which is a much harder task. As
where declabel is the decision label, i.e., ‘ask’ or in Shekhar et al. (2019), we address this issue by
‘guess’. The MLPd consists of three hidden lay- making the learning schedule task-dependent us-
ers whose dimensions are 256, 64, and 16, re- ing a modulo-n training setup. In the SL setting, n
spectively; after each hidden layer a ReLU non- indicates after how many epochs of QGen training
linearity is applied. the Guesser is updated together with QGen; for
To train the DM, we need decision labels. For CL, QGen is updated at every nth epoch, while the
the SL setting, we follow the label generation pro- Guesser is updated at all other epochs. We found
cedure introduced by Shekhar et al. (2018): deci- the optimal value of n to be equal to 5 for both the
sion labels are generated by annotating all the last SL and the CL setting. The models are trained for
question-answer pairs in the games with guess and 100 epochs with Adam optimizer and a learning
all other question-answer pairs as ask. For the CL rate of 0.0001 and we select the Questioner mod-
setting, we label the question/answer pairs based ule with the best performance on the validation set.
on whether the Guesser module is able to correctly
predict the target object given the current dialogue. 3 Results
If the Guesser module is able to make a correct
prediction after a given question/answer pair, we In this section, we report the task success accuracy
label that dialogue state with guess and otherwise of our GDSE-DM model, which extends the joint
with ask. This process results in an unbalanced GDSE architecture with a decision-making com-
dataset for the DM where the guess label makes ponent. Following Shekhar et al. (2019), to neu-
up for only 20% of states. We address this class tralize the effect of random sampling in CL train-
imbalance by adding a weighing factor, α, to the ing, we use 3 runs and report mean and standard
loss. The balanced loss is given by deviation.
Table 1 gives an overview of the accuracy re-
L D = αlabel · (− log p(declabel )) (2) sults obtained by the models. Our main goal is to
show the effect of adding a DM module to the joint
where αguess = 0.8 and αask = 0.2. GDSE architecture. We therefore do not compare
The DM, for both SL and CL, is trained with to other approaches that use RL.3 As we can see,
Cross Entropy loss in a supervised manner us- adding a DM to the GDSE model decreases its ac-
ing decision labels after each question/answer pair. curacy by 0.5-1% in the supervised learning set-
During inference, the model continues to ask ques- ting and by 4-5% in the cooperative learning set-
tions unless the DM chooses to end the conver-
3For completeness, the RL model by Strub et al. (2017)
sation or the maximum number of questions has has accuracy 56.2(±24) and 56.3(±.05) for the 5Q and 8Q
been reached. The GDSE-DM model trained with settings, respectively.
Lexical Question % Games with
Model 5Q 8Q
diversity diversity repeated Q’s
GDSE-SL-DM 3.83 5.49 Baseline 0.030 1.60 93.50
GDSE-CL-DM 4.02(±0.10) 5.46(±0.10) GDSE-SL 0.101 13.61 55.80
GDSE-CL 0.115 (±.02) 14.15 (±3.0) 52.19 (±4.7)
GDSE-SL-DM 0.047 1.62 42.47
Table 2: Average number of questions asked by
GDSE-CL-DM 0.135(±.02) 10.25(±2.46) 32.51(±6.45)
the GDSE-DM models when the maximum num-
Humans 0.731 47.89 —
ber of questions is set to 5 or 8.
Table 3: Statistics of the linguistic output of all
ting. We believe that the higher drop in accuracy models with the 8Q setting compared to human
of the CL-DM model can be attributed to the de- dialogues in all test games.
cision labels used by this model. In the SL-DM
setting, the model is trained on human data, which
than 50% of dialogues included repetitions, which
leads to a more reliable decision label. In con-
make them unnatural. We can see how adding a
trast, in the CL-DM setting, the model is trained
DM component to GDSE addresses this important
on automatically generated data, which includes
problem: with the CL-DM setting, the percentage
possible errors by both the QGen and the Oracle.
of games with repeated questions goes down to
Overall, this results in more noisy dialogues. We
32.51% (-19.68%, from 52.19 to 32.51). The re-
think that, due to the accumulation of these errors,
duction is also substantial for the SL-DM model
the decision labels of the generated dialogue devi-
(-13.33%, from 55.80 to 42.47) albeit less impres-
ate significantly from the human data and thus the
sive.
DM fails to capture them.
Given that the number of questions asked by the
Despite the drop in task success, the DM agent
DM-based models is lower (as shown in Table 2),
seems to be more efficient. Table 2 shows that the
it is to be expected that the lexical and question di-
average number of questions asked by the DM-
versity of the dialogues produced by these models
based models is lower: the GDSE model without
will also be somewhat lower. Indeed, we observe a
a DM always asks the maximum number of ques-
rather significant drop in diversity for the SL-DM
tions allowed (either 5 or 8 questions); while, on
setting. The CL-DM model, on the other hand, is
average, the GDSE-DM agent asks around 3.8 to
rather robust to diversity loss: in fact, lexical diver-
5.5 questions, even when it is allowed to ask up to
sity increases slightly with respect to GDSE-CL
8. As we shall see in the next section, this leads to
(0.135 vs. 0.115 on average), while question diver-
dialogues that are more natural and less repetitive.
sity decreases by a couple of points only, remain-
ing much higher than that of the baseline model.
4 Analysis
Following Shekhar et al. (2019), we also looked
In this section, we look into the advantage brought into the distribution of the types of questions asked
about by the DM in terms of the quality of the di- by the models. Questions are divided into two
alogues produced by the model. broad categories; entity ( about the target object
Following Shekhar et al. (2019), we report category, e.g., ‘is it a animal?’ ) and attribute
statistics about the dialogue produced by the mod- (about the target object property, e.g., ‘is it green
els with respect to lexical diversity (measured as one?’). entity questions are further sub-divided
type/token ratio over all games), question diver- into ‘object category’ and ‘super-category’. at-
sity (measured as the percentage of unique ques- tribute questions are divided into ‘color’, ‘shape’,
tions over all games), and percentage of games ‘size’, ‘texture’, ‘location’, and ‘action’ questions.
with at least one repeated question (see Table 3). Table 4 provides distribution of questions by dif-
The main drawback of the models asking a fixed ferent models. Compared with their counterparts,
number of questions is that they repeat questions the DM-based models ask more object questions.
within the same dialogue. While the introduc- The SL-DM also lowers significantly the number
tion of the joint GDSE architecture by Shekhar of location questions (from 37.09 to 21.70), which
et al. (2019) substantially reduced the percentage are the type of question most commonly repeated
of games with repeated questions with respect to by the various models, as shown by Shekhar et al.
the baseline model (from 93.5% to 52.16%), more (2019). We also computed the Kullback-Leibler
GDSE-SL [success] GDSE-CL [success]
Baseline [failure]
1. is it a person? no 1. is it a person? no
1. is it a person? no
2. is it a ski? yes 2. is it a tree? no
2. is it a tree? no
3. is it on the person in the front? yes 3. is it a ski? yes
3. is it a tree? no
DM stops asking to guess 4. is it both skis of the person in front? yes
4. is it a tree? no
4. is it the left most person? yes DM stops asking to guess
5. is it a tree? no
5. both skis? yes 5. is it on the left? yes
6. is it a tree? no
6. both skis? yes 6. is it on the left? yes
7. is it a tree? no
7. the whole skis? yes 7. it is second to the right? no
8. is it a tree? no
8. the whole skis? yes 8. is it the one on the left? yes
GDSE-SL [success] GDSE-CL [success]
Baseline [success]
1. is it a person? no 1. is it a person? no
1. is it the baby? no
2. is it a toothbrush? yes 2. is it a toothbrush? yes
2. is it the toothbrush? yes
3. is it the whole thing? yes DM stops asking to guess
3. is it the toothbrush? yes
DM stops asking to guess 3. is it the one kid holding? yes
4. is it the toothbrush? yes
4. the one in the hand? yes 4. is it the one left of the baby? no
5. is it the toothbrush? yes
5. the one in the front? yes 5. is it next to the one with the red handle? yes
6. is it the toothbrush? yes
6. the whole thing? yes 6. is it next to the baby? no
7. is it the toothbrush? yes
7. the whole of it? yes 7. is it one next to the baby? no
8. is it the entire toothbrush? yes
8. all of it? yes 8. next to the whole object? yes
Figure 2: Game examples where most models succeed at guessing the target object (framed). In red, the
point in the dialogue where the DM component decides to stop asking questions and guess the target.
Many of the questions asked after the decision point by the GDSE model without a DM are repeated,
and thus do not add any extra information.
(KL) divergence to measure how the output of 5 Conclusion
each model differs from the human distribution
of fine-grained question classes. We can see that We have enriched the Questioner agent in the goal-
GDSE-DM models have comparatively higher de- oriented dialogue game GuessWhat?! with a De-
gree of divergence than GDSE, in particular the cision Making (DM) component. Based on the
SL-DM model, which asks a substantially larger visually grounded dialogue state, our Questioner
proportion of entity questions. model learns whether to ask a follow-up ques-
tion or to stop the conversation to guess the tar-
The sample dialogues in Figure 2 provide a
get object. We show that the dialogue produced
qualitative illustration of the output of our mod-
by our model has less repetitions and less unnec-
els, showing how the DM-based Questioner stops
essary questions, thus potentially leading to more
asking questions when it has enough information
efficient and less unnatural interactions – a well
to guess the target object.
known limitation of current visual dialogue sys-
tems. As in Shekhar et al. (2018), where a sim-
Question type BL SL CL SL-DM CL-DM H
ple baseline model was extended with a DM com-
entity 49.00 48.07 46.51 71.03 51.36 38.11
super-cat 19.6 12.38 12.58 15.35 15.40 14.51 ponent, task accuracy slightly decreases while the
object 29.4 35.70 33.92 55.68 35.97 23.61 quality of the dialogues increases.
attribute 49.88 46.64 47.60 27.27 45.21 53.29
A first attempt to partially tackle the issue
color 2.75 13.00 12.51 10.57 8.41 15.50
shape 0.00 0.01 0.02 0.0 0.07 0.30 within the GuessWhat?! game was made by Strub
size 0.02 0.33 0.39 0.01 0.67 1.38 et al. (2017), who added a token to the vo-
texture 0.00 0.13 0.15 0.01 0.25 0.89
cabulary of the question generator module to learn
location 47.25 37.09 38.54 21.70 39.92 40.00
action 1.34 7.97 7.60 3.96 8.01 7.59 when to stop asking questions using Reinforce-
Not classified 1.12 5.28 5.90 1.70 3.43 8.60 ment Learning. This is a problematic approach as
KL wrt Human 0.953 0.042 0.038 1.48 0.055 — it requires the QGen to generate probabilities over
a non-linguistic token; further, the decision to ask
Table 4: Percentage of questions per question type more questions or guess is a binary decision and
in all the test set games played by humans (H) and thus it is not desirable to incorporate it within the
the models with the 8Q setting, and KL divergence large softmax output of the QGen.
from human distribution of fine-grained question Jiaping et al. (2018) propose a hierarchical
types. RL-based Questioner model for the GuessWhich
image-guessing game introduced by Chattopad-
hyay et al. (2017) using the VisDial dataset (Das Language (ViGIL). ArXiv:1802.03881. Last
et al., 2017a). The first RL layer is a module that version Feb. 2018.
learns to decide when to stop asking questions. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Per-
We believe that a decision making component for ona, D. Ramanan, Dollar, P., and C. L. Zitnick.
the GuessWhich game is an ill-posed problem. In 2014. Microsoft COCO: Common objects in
this game, the Questioner does not see the pool of context. In Proceedings of ECCV (European
candidate images while carrying out the dialogue. Conference on Computer Vision).
Hence, it will never know when it has gathered
Ravi Shekhar, Tim Baumgärtner, Aashish
enough information to distinguish the target im-
Venkatesh, Elia Bruni, Raffaella Bernardi,
age from the distractors. In any case, our work
and Raquel Fernández. 2018. Ask no more:
shows that a simple approach can be used to aug-
Deciding when to guess in referential vi-
ment visually-grounded dialogue systems with a
sual dialogue. In Proceedings of the 27th
DM without having to use the high complexity of
International Conference on Computational
RL paradigms.
Linguistics (COLING), pages 1218–1233.
Task accuracy and dialogue quality are equally
important aspects of visually-grounded dialogue Ravi Shekhar, Aashish Venkatesh, Tim Baumgärt-
systems. It remains to be seen how such sys- ner, Elia Bruni, Barbara Plank, Raffaella
tems can reach higher task accuracy while profit- Bernardi, and Raquel Fernández. 2019. Beyond
ing from the better quality that DM-based models task success: A closer look at jointly learning to
produce. see, ask, and guesswhat. In NAACL.
Florian Strub, Harm de Vries, Jeremie Mary, Bi-
References lal Piot, Aaron Courville, and Olivier Pietquin.
2017. End-to-end optimization of goal-driven
Prithvijit Chattopadhyay, Deshraj Yadav, Viraj and visually grounded dialogue systems. In
Prabhu, Arjun Chandrasekaran, Abhishek Das, Proceedings of the International Joint Confer-
Stefan Lee, Dhruv Batra, and Devi Parikh. ence on Artificial Intelligence (IJCAI).
2017. Evaluating visual conversational agents
Harm de Vries, Florian Strub, Sarath Chan-
via cooperative human-ai games. In Proceed-
dar, Olivier Pietquin, Hugo Larochelle, and
ings of the Fifth AAAI Conference on Human
Aaron C. Courville. 2017. Guesswhat?! Visual
Computation and Crowdsourcing (HCOMP).
object discovery through multi-modal dialogue.
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi In Conference on Computer Vision and Pattern
Singh, Deshraj Yadav, José M.F. Moura, Devi Recognition (CVPR).
Parikh, and Dhruv Batra. 2017a. Visual Dia- Junjie Zhang, Qi Wu, Chunhua Shen, Jian Zhang,
log. In Proceedings of the IEEE Conference on Jianfeng Lu, and Anton van den Hengel. 2018.
Computer Vision and Pattern Recognition. Goal-oriented visual question generation via in-
Abhishek Das, Satwik Kottur, José M.F. Moura, termediate rewards. In Proceedings of the Euro-
Stefan Lee, and Dhruv Batra. 2017b. Learning pean Conference of Computer Vision (ECCV).
cooperative visual dialog agents with deep re- Yan Zhu, Shaoting Zhang, and Dimitris Metaxas.
inforcement learning. In International Confer- 2017. Interactive reinforcement learning for ob-
ence on Computer Vision (ICCV). ject grounding via self-talking. In NIPS Work-
Zhang Jiaping, Zhao Tiancheng, and Yu Zhou. shop on Visually-Grounded Interaction and
2018. Multimodal hierarchical reinforcement Language (ViGIL).
learning policy for task-oriented visual dialog.
In Proceeding of the SigDial Conference, pages
140–150. Association for Computational Lin-
guistics.
Sang-Woo Lee, Yu-Jung Heo, and Byoung-Tak
Zhang. 2017. Answerer in questioner’s mind for
goal-oriented visual dialogue. In NIPS Work-
shop on Visually-Grounded Interaction and