=Paper= {{Paper |id=Vol-2735/paper31 |storemode=property |title=Which Turn do Neural Models Exploit the Most to Solve GuessWhat? Diving into the Dialogue History Encoding in Transformers and LSTMs |pdfUrl=https://ceur-ws.org/Vol-2735/paper31.pdf |volume=Vol-2735 |authors=Claudio Greco,Alberto Testoni,Raffaella Bernardi |dblpUrl=https://dblp.org/rec/conf/aiia/0002TB19 }} ==Which Turn do Neural Models Exploit the Most to Solve GuessWhat? Diving into the Dialogue History Encoding in Transformers and LSTMs== https://ceur-ws.org/Vol-2735/paper31.pdf
Which Turn do Neural Models Exploit the Most
to Solve GuessWhat? Diving into the Dialogue
History Encoding in Transformers and LSTMs

          Claudio Greco1? , Alberto Testoni2? , and Raffaella Bernardi12
                       1
                         CIMeC - Center for Mind/Brain Sciences
           2
               DISI - Dept. of Information Engineering and Computer Science
                                    University of Trento
                                  name.surname@unitn.it

        Abstract. We focus on visually grounded dialogue history encoding.
        We show that GuessWhat?! can be used as a “diagnostic” dataset to un-
        derstand whether State-of-the-Art encoders manage to capture salient
        information in the dialogue history. We compare models across several
        dimensions: the architecture (Recurrent Neural Networks vs. Transform-
        ers), the input modalities (only language vs. language and vision), and
        the model background knowledge (trained from scratch vs. pre-trained
        and then fine-tuned on the downstream task). We show that pre-trained
        Transformers are able to identify the most salient information indepen-
        dently of the order in which the dialogue history is processed whereas
        LSTM based models do not.

        Keywords: Visual Dialogue · Language and Vision · History Encoding.


1     Introduction
Visual Dialogue tasks have a long tradition (e.g. [1]). Recently, several dialogue
tasks have been proposed as referential guessing games in which an agent asks
questions about an image to another agent and the referent they have been
speaking about has to be guessed at the end of the game [33, 4, 8, 7, 10, 31].
Among these games, GuessWhat?! and GuessWhich [33, 4] are asymmetrical –
the roles are fixed: one player asks questions (the Questioner) and the other (the
Oracle) answers. The game is considered successful if the Guesser, which can be
the Questioner itself or a third player, selects the correct target.
   Most Visual Dialogue systems proposed in the literature share the encoder-
decoder architecture [29] and are evaluated using the task-success of the Guesser.
By using this metric, multiple components are evaluated at once: the ability of
the Questioner to ask informative questions, of the Oracle to answer them, of the
Encoder to produce a visually grounded representation of the dialogue history
and of the Guesser to select the most probable target object given the image
and the dialogue history.
?
    Equal contribution. The first two authors are reported in alphabetic order.
    Copyright (c) 2020 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).


                                          29
                                          Questioner                 Oracle
                                          1. Is it on a wooden surface? Yes
                                          2. Is it red?                 No
                                          3. Is it white?               No
                                          4. Is it a scissor?           Yes
                                          5. Is it the scissor on
                                          the left of the picture?      Yes

  Fig. 1: GuessWhat?! human dialogues are short and with a clear division of
 roles between players; most of the last questions are answered positively, are
          long, and contain details suitable to guess the target object.




    In this paper, we disentangle the compressed task-success evaluation and
focus on the ability of the Encoder to produce a dialogue hidden state represen-
tation that encodes the information necessary for the Guesser to select the target
object. Therefore, we use the dialogue history generated by humans playing the
referential game so to be sure of the quality of the questions and of the answers.
    We run our analysis on GuessWhat?! since, as illustrated in Figure 1, its
dialogues are quite simple: a sequence of rather short questions answered by Yes
or No containing on average 30.1 (SD ± 17.6) tokens per dialogue. The simplicity
of the dialogue structure makes the dataset suitable to be used as a diagnostic
dataset.
    In [23], the authors have shown that neural models are not sensitive to the
order of turns in dialogues and conclude they do not use the history effectively.
In GuessWhat?! dialogues the order in which questions have been asked is not
crucial: we would be able to guess the target object even if the question-answer
pairs in Figure 1 were provided in the reversed order. Indeed, we are able to
use salient information independently of the turns where it occurs. We wonder
whether the same holds for neural models trained to solve the GuessWhat?! task.
As the example in the figure shows, the last question humans ask is usually quite
rich in detail about the target object and is answered positively. We exploit these
features of the dataset to run our in-depth analysis.
    We compare encoders with respect to the architecture (Recurrent Neural
Networks vs. Transformers), the input modalities (only language vs. language
and vision), and the model background knowledge (trained from scratch vs. pre-
trained and then fine-tuned on the downstream task). Our analysis shows that:

 – the GuessWhat?! dataset can be used as a diagnostic dataset to scrutinize
   models’ performance: dialogue length mirrors the level of difficulty of the
   game; most questions in the last turns are answered positively and are longer
   than earlier ones;
 – Trasformers are less sensitive than Recurrent Neural Network based models
   to ther order in which QA pairs are provided;

                                        30
 – pre-trained Transformers detect salient information, within the dialogue his-
   tory, independently of the position in which it is provided.

2   Related Work
Scrutinizing Visual Dialogues Encoding Interesting exploratory analysis has
been carried out to understand Visual Question Answering (VQA) systems and
highlight their weaknesses and strengths, e.g. [11, 25, 28, 12]. Less is known about
how well grounded conversational models encode the dialogue history.
    In [23], the authors study how neural dialogue models encode the dialogue
history when generating the next utterance. They show that neither recurrent
nor transformer based architectures are sensitive to perturbations in the dialogue
history and that Transformers are less sensitive than recurrent models to pertur-
bations that scramble the conversational structure; furthermore, their findings
suggest that models enhanced with attention mechanisms use more information
from the dialogue history than their vanilla counterpart. We take inspiration
from this study to understand how State-of-the-Art (SoA) models encode the
visually grounded dialogues generated by humans while playing the GuessWhat?!
game.
    In [13], the authors show that in many reading comprehension datasets, that
presumably require the combination of both questions and passages to predict
the correct answer, models can achieve quite a good accuracy by using only part
of the information provided. We investigate the role of each turn in GuessWhat?!
human dialogues and to what extent models encode the strategy seen during
training.

SoA LSTM Based Models on GuessWhat?! After the introduction of the su-
pervised baseline model [33], several models have been proposed. They exploit
either some form of reinforcement learning [22, 36, 37, 35, 6, 34, 21] or coopera-
tive learning [26, 21]; in both cases, the model is first trained with the supervised
learning regime and then the new paradigm is applied. This two-step process has
been shown to reach higher task success than the supervised approach when the
Questioner and Oracle models are put to play together. Since our focus is on the
Guesser and we are evaluating it on human dialogues, we will compare models
that have undergone only the supervised training step. We compare these recur-
rent models (based on LSTMs [24]) against models based on Transformers [32].

Transformer Based Models The last years have seen an increasing popularity of
transformer based models trained on several tasks to reach task-agnostic mul-
timodal representations [14, 17, 30, 2, 27, 20]. ViLBERT [17] has been recently
extended by means of multi-task training involving 12 datasets which include
GuessWhat?! [18] and has been fine-tuned to play the Answerer of VisDial [19].
Among these universal multimodal models, we choose LXMERT [30]. [3] pro-
pose methods for directly analyzing the attention heads aiming to understand
whether they specialize in some specific foundational aspect (like syntactic re-
lations) functional to the overall success of the model. We take inspiration from

                                         31
their work to shed light on how Transformers, that we adapt to play the Guess-
What?! game, encode the dialogues.


3     Dataset
The GuessWhat?! dataset was collected via Amazon Mechanical Turk by [33].
It is an asymmetric game involving two human participants who see a real-
world image taken from the MS-COCO dataset [15]. One of the participants
(the Oracle) is assigned a target object in the image and the other participant
(the Questioner) has to guess it by asking Yes/No questions to the Oracle. There
are no time constraints to play the game.
    The dataset contains 155K English dialogues about approximately 66K dif-
ferent images. The answers are respectively 52.2% No, 45.6% Yes, and 2.2% N/A
(not applicable); the training set contains 108K datapoints and the validation
and test sets 23K each. Dialogues contain on average 5.2 question-answer (QA)
pairs and the vocabulary consists of around 4900 words; each game has at least 3
and at most 20 candidates. We evaluate models using human dialogues, selecting
only the games on which humans have succeed finding the target and contain at
most 10 turns (total number of dialogues used: 90K in training and around 18K
both in validation and testing).3
    We run a careful analysis of the dataset aiming to find features useful to
better understand the performance of models. Although the overall number of
Yes/No answers is balanced, the shorter the dialogues, the higher the percentage
of Yes answers is: it goes from the 75% in dialogues with 2 turns to the 50% in
the 5 turn cluster to the 35% in the 10 turn cluster. Interestingly, most of the
questions in the last turns obtain a positive answer and these questions are on
average longer than earlier ones (see Figure 1 for an example). A model that
encodes these questions well has almost all the information to guess the target
object without actually using the full dialogue history. Not all games are equally
difficult: in shorter dialogues the area of the target object is bigger than the
one of target objects in longer dialogues, and their target object is quite often
a “person” – the most common target in the dataset; moreover, the number of
distractors in longer dialogues is much higher. Hence, the length of a dialogue is
a good proxy of the level of difficulty of the game. Figure 2 reports the statistics
of the training set; similar ones characterize the validation and the test sets.
    The length of the dialogue is a good proxy of the level of difficulty of the
game. Figure 3 shows that longer dialogues contain more distractors and in
particular more distractors of the same category of the target object, which are
supposed to be especially challenging for the models, since each candidate object
is represented simply by its category and coordinates. Moreover, the area occu-
pied by target objects is smaller in longer dialogues and the most representative
category among target objects (“person”) is less frequent.
    We will exploit these features of the dataset to scrutinize the behaviour of
models.
3
    The dataset of human dialogues is available at https://guesswhat.ai/download.


                                        32
                                                   80

                                                   70




                     Percentage of Yes answers
                                                   60

                                                   50

                                                   40

                                                   30

                                                   20

                                                   10

                                                       0
                                                                   2           3           4           5         6       7       8       9       10


                                                                                                   Dialogue Length

                                                                           % Yes - first turn                   % Yes - last turn
                                                 100
                                                 90
                                                 80
                 % Yes answers




                                                 70
                                                 60
                                                 50
                                                 40
                                                 30
                                                 20
                                                 10
                                                  0
                                                               2           3           4           5            6        7       8       9       10

                                                                                                   Dialogue length
                                                           3-turns dialogues                       5-turns dialogues             8-turns dialogues
               Average question length




                                                           1       2   3           1       2   3       4    5        1   2   3   4   5   6   7    8

                                                                                                       Turn

   Fig. 2: Statistics of the training set (the test set has similar distributions).
Dialogue length refers to the number of turns. Up: The distribution of Yes/No
  questions is very unbalanced across the clusters of games (the percentage of
Yes answers is much higher in shorter dialogues); Middle In the large majority
of games, the last question is answered positively; Bottom: The last questions
 are always longer (length of questions per turn for the clusters with dialogues
                              having 3, 5, and 8 turns).



                                                                                                           33
                                                 avg # distractors same category           avg # distractors           % of rare words
                                                  12.0
                                                                                                                                     30
                                                     0


                                                  10.0
                                                     0




               number of distractors




                                                                                                                                          % of rare words
                                                  8.00                                                                               20



                                                  6.00



                                                  4.00                                                                               10



                                                  2.00



                                                  0.00                                                                               0
                                                         1     2     3     4       5          6        7       8       9       10

                                                                         Human dialogue length


                                                               frequency 'person' target              % target area
                                                 0.50
               Person freq. and size of target




                                                 0.40




                                                 0.30




                                                 0.20




                                                 0.10




                                                 0.00
                                                         1     2     3         4       5          6        7       8       9        10

                                                                           Human dialogue length



   Fig. 3: Up: longer human dialogues contain more distractors and more
 distractors of the same category of the target object, and more rare words;
Down: The distribution of target objects is unbalanced, since “person” is the
                            most frequent target.



4   Models


All the evaluated models share the Guesser module proposed in [33]. Candidate
objects are represented by the embeddings obtained via a Multi-Layer Percep-
tron (MLP) starting from the category and spatial coordinates of each candidate
object. The representations so obtained are used to compute dot products with
the hidden dialogue state produced by an encoder. The scores of each candidate
object are given to a softmax classifier to choose the object with the highest
probability. The Guesser is trained in a supervised learning paradigm, receiving
the complete human dialogue history at once. The models we compare differ in
how the hidden dialogue state is computed. Figure 4 shows the shared skeleton.


                                                                                   34
                                                                   Guesser
                Image
                                                    bottle pos   food pos   cat       pos   pc       pos



                                                          .         .             .              .
                                              softmax(                                                 )




                                 Encoder
                                                         0.1        0.1           0.1            0.7




               History                 Hidden
          Is it the cat? No        dialogue state
          Is it the bottle? No
          Is it the pc? Yes


      Fig. 4: Shared skeleton. Blind models do not receive the image as input.


4.1     Language Encoders

LSTM As in [33], the representations of the candidates are fused with the last
hidden state obtained by an LSTM which processes only the dialogue history.

RoBERTa In the architecture of the model described above, we replace the
LSTM with the robustly-optimized version of BERT [5], RoBERTa, a SoA uni-
versal transformer based encoder introduced in [16].4 We use RoBERTaBASE
which has been pre-trained on 16GB of English text trained for 500K steps to
perform masked language modeling. It has 12 self-attention layers with 12 heads
each. It uses three special tokens, namely CLS, which is taken to be the represen-
tation of the given sequence, SEP, which separates sequences, and EOS, which
denotes the end of the input. We give the output corresponding to the CLS to-
ken to a linear layer and a tanh activation function to obtain the hidden state
which is given to the Guesser. To study the impact of the pre-training phase, we
have compared the publicly available pre-trained model, which we fine-tuned on
GuessWhat?! (RoBERTa), against its counterpart trained from scratch only
on the game (RoBERTa-S).


4.2     Multimodal Encoders

V-LSTM We enhance the LSTM model described above with the visual modality
by concatenating the linguistic and visual representation and scaling its result
with an MLP; the result is passed through a linear layer and a tanh activation
function to obtain the hidden state which is used as input for the Guesser mod-
ules. We use a frozen ResNet-152 pre-trained on ImageNet [9] to extract the
visual vectors.
4
    We have also tried BERT, but we obtained higher accuracy with RoBERTa.


                                             35
LXMERT To evaluate the performance of a universal multimodal encoder,
we employ LXMERT (Learning Cross-Modality Encoder Representations from
Transformers) [30]. It represents an image by the set of position-aware object em-
beddings for the 36 most salient regions detected by a Faster R-CNN and it pro-
cesses the text input by position-aware randomly-initialized word embeddings.
Both the visual and linguistic representations are processed by a specialized
transformer encoder based on self-attention layers; their outputs are then pro-
cessed by a cross-modality encoder that through a cross-attention mechanism
generates representations of the single modality (language and visual output)
enhanced with the other modality as well as their joint representation (cross-
modality output). As RoBERTa, LXMERT uses the special tokens CLS and
SEP. Differently from RoBERTa, LXMERT uses the special token SEP both to
separate sequences and to denote the end of the textual input. LXMERT has
been pre-trained on five tasks.5 It has 19 attention layers: 9 and 5 self-attention
layers in the language and visual encoders, respectively and 5 cross-attention
layers. We process the output corresponding to the CLS token as in RoBERTa.
Similarly, we consider both the pre-trained version (LXMERT) and the one
trained from scratch (LXMERT-S).


5     Experiments

We compare the models described above using human dialogues aiming to shed
lights on how the encoders capture the information that is salient to guess the
target object.


5.1   Task Success

Dialogues asked by human players of the GuessWhat?! games are expected to
contain, together with the image they are about, the information necessary to
detect the target object among the candidates. We refer to them as Ground
Truth (GT) dialogues. As we can see in Table 1, the Guesser based on a blind
encoder (LSTM or RoBERTa from scratch or pre-trained) obtains results higher
than or comparable with V-LSTM.6
    Table 2 reports the accuracy by clusters of games based on the dialogue
length. All models reach a very high and similar accuracy in short games and
differ more in longer ones. Most of the boost obtained by RoBERTa seems to
come in longer dialogues where its from scratch version (RoBERTa-S) performs
on a par with the other models.
5
  Masked cross-modality language modeling, masked object prediction via RoI-feature
  regression, masked object prediction via detected-label classification, cross-modality
  matching, and image question answering.
6
  The model proposed in [18] based on ViLBERT obtains an accuracy on GuessWhat?!
  with human dialogues of 65.04% when trained together with the other 11 tasks and
  62.81% when trained only on it.


                                          36
                                         GT Reversed
                                 LSTM    64.7 56.0




                           Blind
                               RoBERTa-S 64.2 57.8
                                RoBERTa 67.9  66.5
                                V-LSTM 64.5   51.3




                           MM
                               LXMERT-S 64.7  58.3
                                LXMERT 64.7   60.3
    Table 1: We compare the accuracy of models on the test set containing
   dialogues in the Ground Truth (GT) order of turns vs. the reversed order
                                 (reversed).


               LSTM RoBERTa-S RoBERTa V-LSTM LXMERT-S LXMERT
         All    64.7   64.2     67.9    64.5   64.7     64.7
          3     72.5   72.7     75.3    71.9    73      73.8
          5     59.3   58.3     60.1    59.3   59.2     58.7
          8     47.3   45.1     51.0    47.2   46.8     43.3
 Table 2: Accuracy with GT dialogues: results for all games, and for those of
                          3/5/8 dialogue length.



    These results show that the human dialogue history alone is quite informative
to accomplish the task. If we go back to the example in Figure 1, we realize it is
possible to succeed in that game if we are given the dialogue only and are asked
to select the target object (the scissor on the left) among candidates for which
we are told the category and the coordinates – as it is the case for the Guesser.
    In the following, we are running an in-depth analysis to understand whether
models are able to identify salient information indipendently of the position in
which they occur.

5.2   Are Models Sensitive to the Strategy Seen during Training?
In Section 3, we have seen that human dialogues tend to share a specific strategy,
i.e. questions that are asked in first turns are rather short whereas those in
the last turns provide relevant details about the most probable target object.
We wonder whether the models under analysis become sensitive to the above-
mentioned strategy and learn to focus on some turns more than others rather
than on the actual salient QA pair.
    Following [23], we perturb the dialogue history in the test set by reversing
the order of turns from the last to the first one (reversed). Differently from them,
given the nature of the GuessWhat?! dialogue history, we value positively mod-
els that are robust to this change in the dialogue history order. Our experiment
(Table 1) shows that Transformers are less sensitive than LSTMs to the order
in which QA pairs are provided. Interestingly, the pre-training phase seems to
mitigate the effect of the change of the order even more: while RoBERTa has
a drop of just -1.4, the accuracy of its from-scratch counterpart drops of -6.4.


                                        37
In other words, (pre-trained) Transformers seem to be able to identify
salient information independently of the position in which it is pro-
vided within the dialogue history.

5.3     The role of the last question
Table 3 reports the results of the models when receiving the dialogue history
without the last turn. As we can see all models undergo a similar drop in accu-
racy. This means that all models identify the last turn as the most informative
one equally well. It is worth noting that the superiority of RoBERTa compared to
other models pops up even when removing the last turn, showing that RoBERTa
is indeed able to better encode the full dialogue history and not only parts of
it. This holds for different dialogue lengths as shown in the Table. On average,
removing the last turn affects more the performance of multimodal models. For
5-turns long dialogues, the accuracy drops by -12.2 for blind models and by -
14.3 for multimodal models. Similarly, or 8-Q dialogues the accuracy drops by
-8 (blind) and -9.3 (multimodal).


                        3-Q                5-Q                8-Q
        Model
                          W/o last           W/o last           W/o last
                All turns          All turns          All turns
                          turn               turn               turn
        LSTM       72.5      53.4     59.3      46.8     47.3      38.4
      RoBERTa-S    72.7      55.4     58.3      44.9      45       38.9
       RoBERTa     75.3      58.2     60.1      49.3      51        42
       V-LSTM      71.9      53.8     59.3      43.7     47.2      36.5
      LXMERT-S      73       55.8     59.2       45      46.8      38.8
       LXMERT      73.8      55.3     58.7      45.6     43.3      34.1
   Table 3: Accuracy of the models when receiving all turns of the dialogue
 history and when removing the last turn for dialogues with 3, 5, and 8 turns.




5.4     How attention is distributed across turns
In Section 3 we have seen that the last turn is usually answered positively and it is
quite informative to detect the target object. We wonder whether this is reflected
on how models distribute their attention across turns within a dialogue. To this
end, we analyze how much each turn contributes to the overall self-attention
within a dialogue by summing the attention of each token within a turn. We run
this analysis for LXMERT and RoBERTa in their various versions: all models
put more attention on the last turn when the GT order of turns is given.
    In Table 1, we have seen that Transformers are more robust than the other
models when the dialogue history is presented in the reversed order (the first
QA pair of the GT is presented as the last turn and the last QA pair is presented
as first turn). Our analysis of the attention heads of RoBERTa and LXMERT


                                         38
                               LXMERT-S GT    LXMERT-S reversed
                  0.50


                  0.40


                  0.30


                  0.20


                  0.10


                  0.00
                         CLS   QA 1   QA 2    QA 3   QA 4    QA 5   SEP




Fig. 5: Attention assigned by LXMERT-S to each turn in a dialogue when the
   dialogue history is given in the GT order (from QA1 to QA5) or in the
                      reversed order (from QA5 to QA1).


shows that these models, both in their from scratch and pre-trained version,
focus more on the question asked last also in the reversed test set where it is
presented in the first position. This shows they are still able to identify the most
salient information. In Figure 5, we report the attention per turn of LXMERT-S
when receiving the GT and the reversed test set.

5.5   Details for reproducibility
We used the GuessWhat?! dataset in our experiments (http://guesswhat.ai/
download). The dataset contains 155000 English dialogues about approximately
66000 different images. The Train split contains 108000 datapoints, the Valida-
tion split 23000 datapoints, and the Test split 23000 datapoints. We considered
only the dialogues corresponding to the games succeeded by humans and having
less or equal than 10 turns.
    For training LSTM based models we adapted the source codes available
at https://github.com/shekharRavi/Beyond-Task-Success-NAACL2019 and
at https://github.com/GuessWhatGame/guesswhat/. For training transformer
based models we adapted the source code available at https://github.com/
huggingface/transformers. The scripts for all the experiments and the mod-
ified models will be made available upon acceptance. For all models, we used
the same hyperparameters of the original works. When adapting Transformers
to the GuessWhat?! task, we scaled the representation of the CLS token from
768 to 512. We used PyTorch 1.0.1 for all models except for LSTM, for which
we have used Tensorflow 1.3. All models are trained with Adam optimizer. For
transformer based models we used a batch size equal to 16, a weight decay equal
to 0.01, gradient clipping equal to 5, and a learning rate which is warmed up over
the first 10% iterations to a peak value of 0.00001 and then linearly decayed.
    Regarding the infrastructure, we used 1 Titan V GPU. LSTM based models
took about 15 hours for completing 100 training epochs. Transformer based


                                             39
models took about 4 days for completing 25 training epochs. Each experiment
took about 10 minutes to evaluate the best trained models.
   Details on the best epoch, the validation accuracy, and the number of pa-
rameters of each model are reported in table 4.


            Model     Best epoch Validation accuracy Parameters
            LSTM      18         65.6                5,030,144
            RoBERTa 6            68.7                125,460,992
            RoBERTa-S 13         64.7                125,460,992
            V-LSTM    8          65.2                10,952,818
            LXMERT-S 17          65.4                208,900,978
            LXMERT 11            65.1                208,900,978
    Table 4: Epoch, validation set accuracy and number of parameters for each
                                    best model.




6     Conclusion

Our detailed analysis of the GuessWhat?! dataset has revealed features of its
games that we have exploited to run a diagnostic analysis of SoA models.
    Our comparative analysis has shown that Trasformers are less sensitive than
LSTMs to the order in which QA pairs are provided and that their pre-trained
versions are even stronger in detecting salient information, within the dialogue
history, independently of the position in which it is provided.
    Furthermore, our results shows that RoBERTa is the encoder that provides
the Guesser with the most informative representation of the dialogue history.
Its advantage is particularly strong in longer dialogues. The dialogue contains
already all the information necessary to guess the candidates: both with LSTM
and trasformer based models the blind version obtain results higher than or
comparable with their multimodal counterpart. We conjecture that this is due
to the fact that the Guesser has access to the category of the target object.
Important progress has been made on multimodal models since the introduction
of the GuessWhat?! game. It would be interesting to see how SoA models would
perform when they have to rely on visual information rather than raw category.


Acknowledgments

We kindly acknowledge the support of NVIDIA Corporation with the donation
of the GPUs used in our research at the University of Trento. We acknowledge
SAP for sponsoring the work.


                                       40
References
 1. Anderson, A.H., Bader, M., Bard, E.G., Boyle, E., Doherty, G., Garrod, S., Isard,
    S., Kowtko, J., McAllister, J., Miller, J., Sotillo, C., Thompson, H., Weinert, R.:
    The HCRC map task corpus. Language and Speech 34, 351–366 (1991)
 2. Chen, Y.C., Li, L., Yu, L., Kholy, A.E., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.:
    UNITER: Learning universal image-text representations (2019), arXiv:1909.11740
 3. Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does BERT look at?
    An analysis of BERT’s attention. In: Proceedings of the 2019 ACL Workshop
    BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. pp. 276–286
    (2019)
 4. Das, A., Kottur, S., Moura, J.M., Lee, S., Batra, D.: Learning Cooperative Visual
    Dialog Agents with Deep Reinforcement Learning. In: 2017 IEEE International
    Conference on Computer Vision. pp. 2951–2960 (2017)
 5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep
    bidirectional transformers for language understanding. In: Proceedings of the 2019
    Conference of the North American Chapter of the Association for Computa-
    tional Linguistics: Human Language Technologies, Volume 1 (Long and Short
    Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapo-
    lis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/N19-1423, https://www.
    aclweb.org/anthology/N19-1423
 6. Gan, Z., Cheng, Y., Kholy, A.E., Li, L., Liu, J., Gao, J.: Multi-step reasoning
    via recurrent dual attention for visual dialog. In: Proceedings of the 57th Annual
    Meeting of the Association for Computational Linguistics. pp. 6463–6474 (2019)
 7. Haber, J., Baumgärtner, T., Takmaz, E., Gelderloos, L., Bruni, E., Fernández, R.:
    The PhotoBook dataset: Building common ground through visually-grounded dia-
    logue. In: Proceedings of the 57th Annual Meeting of the Association for Compu-
    tational Linguistics. pp. 1895–1910 (Jul 2019). https://doi.org/10.18653/v1/P19-
    1184, https://www.aclweb.org/anthology/P19-1184
 8. He, H., Balakrishnan, A., Eric, M., Liang, P.: Learning symmetric collaborative
    dialogue agents with dynamic knowledge graph embeddings. In: Proceedings of
    the 55th Annual Meeting of the Association for Computational Linguistics. pp.
    1766–1776 (2017)
 9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
    Proceedings of the IEEE conference on computer vision and pattern recognition.
    pp. 770–778 (2016)
10. Ilinykh, N., Zarrieß, S., Schlangen, D.: Tell Me More: A Dataset of Visual Scene
    Description Sequences. In: Proceedings of the 12th International Conference on
    Natural Language Generation. pp. 152–157 (2019), https://www.aclweb.org/
    anthology/W19-8621
11. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick,
    R.B.: CLEVR: A diagnostic dataset for compositional language and elementary vi-
    sual reasoning. In: IEEE Conference on Computer Vision and Pattern Recognition.
    vol. abs/1612.06890 (2017)
12. Kafle, K., Kanan, C.: An analysis of visual question answering algorithms. In:
    Proceedings of the IEEE International Conference on Computer Vision. pp. 1965–
    1973 (2017)
13. Kaushik, D., Lipton, Z.C.: How much reading does reading comprehension require?
    A critical investigation of popular benchmarks. In: Proceedings of the 2018 Confer-
    ence on Empirical Methods in Natural Language Processing. pp. 5010–5015 (2018)


                                           41
14. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: A simple
    and performant baseline for vision and language (2019), arXiv:1908.03557
15. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,
    Zitnick, C.L.: Microsoft COCO: Common objects in context. In: Proceedings of
    ECCV (European Conference on Computer Vision). pp. 740–755 (2014)
16. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
    Zettlemoyer, L., Stoyanov, V.: RoERTa: A robustly optimized bert pretraining
    approach (2019), arXiv:1907.11692
17. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: Pretraining task-agnostic visi-
    olinguistic representations for vision-and-language tasks. In: Advances in Neural
    Information Processing Systems. pp. 13–23 (2019)
18. Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: Multi-task vision
    and language representation learning. In: Proceedings of CVPR (2020)
19. Murahari, V., Batra, D., Parikh, D., Das, A.: Large-scale pretraining for visual
    dialog: A simple state-of-the-art baseline. arXiv preprint arXiv:1912.02379 (2019)
20. amd Nan Duan, G.L., Fang, Y., Gong, M., Jiang, D., Zhou, M.: Unicoder-vl: A uni-
    versal encoder for vision and language by cross-modal pre-training. In: Proceedings
    of AAAI (2020)
21. Pang, W., Wang, X.: Visual dialogue state tracking for question generation. In:
    Proceedings of 34th AAAI Conference on Artificial Intelligence (2020)
22. Sang-Woo, L., Tong, G., Sohee, Y., Jaejun, Y., Jung-Woo, H.: Large-scale an-
    swerer in questioner’s mind for visual dialog question generation. In: Proceedings
    of International Conference on Learning Representations, ICLR (2019)
23. Sankar, C., Subramanian, S., Pal, C., Chandar, S., Bengio, Y.: Do neural dialog
    systems use the conversation history effectively? An empirical study. In: Proceed-
    ings of the 57th Annual Meeting of the Association for Computational Linguistics.
    pp. 32–37 (2019), https://www.aclweb.org/anthology/P19-1004
24. Schmidhuber, J., Hochreiter, S.: Long short-term memory. Neural Comput 9(8),
    1735–1780 (1997)
25. Shekhar, R., Pezzelle, S., Klimovich, Y., Herbelot, A., Nabi, M., Sangineto, E.,
    Bernardi, R.: FOIL it! Find one mismatch between image and language caption.
    In: Proceedings of the 55th Annual Meeting of the Association for Computational
    Linguistics (Volume 1: Long Papers). pp. 255–265 (2017)
26. Shekhar, R., Venkatesh, A., Baumgärtner, T., Bruni, E., Plank, B., Bernardi,
    R., Fernández, R.: Beyond task success: A closer look at jointly learning
    to see, ask, and GuessWhat. In: Proceedings of the 2019 Conference of the
    North American Chapter of the Association for Computational Linguistics: Hu-
    man Language Technologies, Volume 1 (Long and Short Papers). pp. 2578–
    2587 (2019). https://doi.org/10.18653/v1/N19-1265, https://www.aclweb.org/
    anthology/N19-1265
27. Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: VL-BERT: Pre-training
    of generic visual-linguistic representations. In: ICLR (2020)
28. Suhr, A., Lewis, M., Yeh, J., Artzi, Y.: A corpus of natural language for visual
    reasoning. In: Proceedings of the Annual Meeting of the Association for Com-
    putational Linguistics. pp. 217–223. Association for Computational Linguistics,
    Vancouver, Canada (July 2017), http://aclweb.org/anthology/P17-2034
29. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural
    networks. In: Advances in neural information processing systems. pp. 3104–3112
    (2014)


                                          42
30. Tan, H., Bansal, M.: LXMERT: Learning cross-modality encoder representations
    from transformers. In: Proceedings of the 2019 Conference on Empirical Methods
    in Natural Language Processing and the 9th International Joint Conference on
    Natural Language Processing (EMNLP-IJCNLP). pp. 5103–5114 (2019)
31. Udagawa, T., Aizawa, A.: A natural language corpus of common grounding under
    continuous and partially-observable context. In: Proceedings of the AAAI Confer-
    ence on Artificial Intelligence. vol. 33, pp. 7120–7127 (2019)
32. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
    L., Polosukhin, I.: Attention is all you need. In: Advances in neural information
    processing systems. pp. 5998–6008 (2017)
33. de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.C.:
    GuessWhat?! Visual object discovery through multi-modal dialogue. In: 2017 IEEE
    Conference on Computer Vision and Pattern Recognition. pp. 5503–5512 (2017)
34. Yang, T., Zha, Z.J., Zhang, H.: Making history matter: History-advantage sequence
    training for visual dialog. In: Proceedings of the International Conference on Com-
    puter Vision (ICCV) (2019)
35. Zhang, J., Zhao, T., Yu, Z.: Multimodal hierarchical reinforcement learning policy
    for task-oriented visual dialog. In: Proceedings of the 19th Annual SIGdial Meet-
    ing on Discourse and Dialogue. pp. 140–150 (2018), https://www.aclweb.org/
    anthology/W18-5015
36. Zhang, J., Wu, Q., Shen, C., Zhang, J., Lu, J., van den Hengel, A.: Goal-oriented
    visual question generation via intermediate rewards. In: Proceedings of the Euro-
    pean Conference of Computer Vision (ECCV). pp. 186–201 (2018)
37. Zhao, R., Tresp, V.: Improving goal-oriented visual dialog agents via advanced
    recurrent nets with tempered policy gradient. In: Proceedings of IJCAI (2018)




                                          43