<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Which Turn do Neural Models Exploit the Most to Solve GuessWhat? Diving into the Dialogue History Encoding in Transformers and LSTMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Claudio Greco</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Testoni</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ra aella Bernardi</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CIMeC - Center for Mind/Brain Sciences</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DISI - Dept. of Information Engineering and Computer Science University of Trento</institution>
        </aff>
      </contrib-group>
      <fpage>29</fpage>
      <lpage>43</lpage>
      <abstract>
        <p>We focus on visually grounded dialogue history encoding. We show that GuessWhat?! can be used as a \diagnostic" dataset to understand whether State-of-the-Art encoders manage to capture salient information in the dialogue history. We compare models across several dimensions: the architecture (Recurrent Neural Networks vs. Transformers), the input modalities (only language vs. language and vision), and the model background knowledge (trained from scratch vs. pre-trained and then ne-tuned on the downstream task). We show that pre-trained Transformers are able to identify the most salient information independently of the order in which the dialogue history is processed whereas LSTM based models do not.</p>
      </abstract>
      <kwd-group>
        <kwd>Visual Dialogue • Language and Vision • History Encoding</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Visual Dialogue tasks have a long tradition (e.g. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). Recently, several dialogue
tasks have been proposed as referential guessing games in which an agent asks
questions about an image to another agent and the referent they have been
speaking about has to be guessed at the end of the game [
        <xref ref-type="bibr" rid="ref10 ref31 ref33 ref4 ref7 ref8">33, 4, 8, 7, 10, 31</xref>
        ].
Among these games, GuessWhat?! and GuessWhich [
        <xref ref-type="bibr" rid="ref33 ref4">33, 4</xref>
        ] are asymmetrical {
the roles are xed: one player asks questions (the Questioner) and the other (the
Oracle) answers. The game is considered successful if the Guesser, which can be
the Questioner itself or a third player, selects the correct target.
      </p>
      <p>
        Most Visual Dialogue systems proposed in the literature share the
encoderdecoder architecture [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] and are evaluated using the task-success of the Guesser.
By using this metric, multiple components are evaluated at once: the ability of
the Questioner to ask informative questions, of the Oracle to answer them, of the
Encoder to produce a visually grounded representation of the dialogue history
and of the Guesser to select the most probable target object given the image
and the dialogue history.
      </p>
      <p>Questioner</p>
      <p>Oracle</p>
    </sec>
    <sec id="sec-2">
      <title>1. Is it on a wooden surface? Yes</title>
      <p>2. Is it red?</p>
    </sec>
    <sec id="sec-3">
      <title>3. Is it white?</title>
    </sec>
    <sec id="sec-4">
      <title>4. Is it a scissor?</title>
    </sec>
    <sec id="sec-5">
      <title>5. Is it the scissor on the left of the picture? No No</title>
      <p>Yes
Yes
Fig. 1: GuessWhat?! human dialogues are short and with a clear division of
roles between players; most of the last questions are answered positively, are
long, and contain details suitable to guess the target object.</p>
      <p>In this paper, we disentangle the compressed task-success evaluation and
focus on the ability of the Encoder to produce a dialogue hidden state
representation that encodes the information necessary for the Guesser to select the target
object. Therefore, we use the dialogue history generated by humans playing the
referential game so to be sure of the quality of the questions and of the answers.</p>
      <p>We run our analysis on GuessWhat?! since, as illustrated in Figure 1, its
dialogues are quite simple: a sequence of rather short questions answered by Yes
or No containing on average 30.1 (SD 17.6) tokens per dialogue. The simplicity
of the dialogue structure makes the dataset suitable to be used as a diagnostic
dataset.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], the authors have shown that neural models are not sensitive to the
order of turns in dialogues and conclude they do not use the history e ectively.
In GuessWhat?! dialogues the order in which questions have been asked is not
crucial: we would be able to guess the target object even if the question-answer
pairs in Figure 1 were provided in the reversed order. Indeed, we are able to
use salient information independently of the turns where it occurs. We wonder
whether the same holds for neural models trained to solve the GuessWhat?! task.
As the example in the gure shows, the last question humans ask is usually quite
rich in detail about the target object and is answered positively. We exploit these
features of the dataset to run our in-depth analysis.
      </p>
      <p>We compare encoders with respect to the architecture (Recurrent Neural
Networks vs. Transformers), the input modalities (only language vs. language
and vision), and the model background knowledge (trained from scratch vs.
pretrained and then ne-tuned on the downstream task). Our analysis shows that:
{ the GuessWhat?! dataset can be used as a diagnostic dataset to scrutinize
models' performance: dialogue length mirrors the level of di culty of the
game; most questions in the last turns are answered positively and are longer
than earlier ones;
{ Trasformers are less sensitive than Recurrent Neural Network based models
to ther order in which QA pairs are provided;
{ pre-trained Transformers detect salient information, within the dialogue
history, independently of the position in which it is provided.
2</p>
      <sec id="sec-5-1">
        <title>Related Work</title>
        <p>
          Scrutinizing Visual Dialogues Encoding Interesting exploratory analysis has
been carried out to understand Visual Question Answering (VQA) systems and
highlight their weaknesses and strengths, e.g. [
          <xref ref-type="bibr" rid="ref11 ref12 ref25 ref28">11, 25, 28, 12</xref>
          ]. Less is known about
how well grounded conversational models encode the dialogue history.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], the authors study how neural dialogue models encode the dialogue
history when generating the next utterance. They show that neither recurrent
nor transformer based architectures are sensitive to perturbations in the dialogue
history and that Transformers are less sensitive than recurrent models to
perturbations that scramble the conversational structure; furthermore, their ndings
suggest that models enhanced with attention mechanisms use more information
from the dialogue history than their vanilla counterpart. We take inspiration
from this study to understand how State-of-the-Art (SoA) models encode the
visually grounded dialogues generated by humans while playing the GuessWhat?!
game.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], the authors show that in many reading comprehension datasets, that
presumably require the combination of both questions and passages to predict
the correct answer, models can achieve quite a good accuracy by using only part
of the information provided. We investigate the role of each turn in GuessWhat?!
human dialogues and to what extent models encode the strategy seen during
training.
        </p>
        <p>
          SoA LSTM Based Models on GuessWhat?! After the introduction of the
supervised baseline model [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ], several models have been proposed. They exploit
either some form of reinforcement learning [
          <xref ref-type="bibr" rid="ref21 ref22 ref34 ref35 ref36 ref37 ref6">22, 36, 37, 35, 6, 34, 21</xref>
          ] or
cooperative learning [
          <xref ref-type="bibr" rid="ref21 ref26">26, 21</xref>
          ]; in both cases, the model is rst trained with the supervised
learning regime and then the new paradigm is applied. This two-step process has
been shown to reach higher task success than the supervised approach when the
Questioner and Oracle models are put to play together. Since our focus is on the
Guesser and we are evaluating it on human dialogues, we will compare models
that have undergone only the supervised training step. We compare these
recurrent models (based on LSTMs [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]) against models based on Transformers [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ].
Transformer Based Models The last years have seen an increasing popularity of
transformer based models trained on several tasks to reach task-agnostic
multimodal representations [
          <xref ref-type="bibr" rid="ref14 ref17 ref2 ref20 ref27 ref30">14, 17, 30, 2, 27, 20</xref>
          ]. ViLBERT [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] has been recently
extended by means of multi-task training involving 12 datasets which include
GuessWhat?! [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] and has been ne-tuned to play the Answerer of VisDial [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
Among these universal multimodal models, we choose LXMERT [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
propose methods for directly analyzing the attention heads aiming to understand
whether they specialize in some speci c foundational aspect (like syntactic
relations) functional to the overall success of the model. We take inspiration from
their work to shed light on how Transformers, that we adapt to play the
GuessWhat?! game, encode the dialogues.
3
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Dataset</title>
        <p>
          The GuessWhat?! dataset was collected via Amazon Mechanical Turk by [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ].
It is an asymmetric game involving two human participants who see a
realworld image taken from the MS-COCO dataset [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. One of the participants
(the Oracle) is assigned a target object in the image and the other participant
(the Questioner) has to guess it by asking Yes/No questions to the Oracle. There
are no time constraints to play the game.
        </p>
        <p>The dataset contains 155K English dialogues about approximately 66K
different images. The answers are respectively 52.2% No, 45.6% Yes, and 2.2% N/A
(not applicable); the training set contains 108K datapoints and the validation
and test sets 23K each. Dialogues contain on average 5.2 question-answer (QA)
pairs and the vocabulary consists of around 4900 words; each game has at least 3
and at most 20 candidates. We evaluate models using human dialogues, selecting
only the games on which humans have succeed nding the target and contain at
most 10 turns (total number of dialogues used: 90K in training and around 18K
both in validation and testing).3</p>
        <p>We run a careful analysis of the dataset aiming to nd features useful to
better understand the performance of models. Although the overall number of
Yes/No answers is balanced, the shorter the dialogues, the higher the percentage
of Yes answers is: it goes from the 75% in dialogues with 2 turns to the 50% in
the 5 turn cluster to the 35% in the 10 turn cluster. Interestingly, most of the
questions in the last turns obtain a positive answer and these questions are on
average longer than earlier ones (see Figure 1 for an example). A model that
encodes these questions well has almost all the information to guess the target
object without actually using the full dialogue history. Not all games are equally
di cult: in shorter dialogues the area of the target object is bigger than the
one of target objects in longer dialogues, and their target object is quite often
a \person" { the most common target in the dataset; moreover, the number of
distractors in longer dialogues is much higher. Hence, the length of a dialogue is
a good proxy of the level of di culty of the game. Figure 2 reports the statistics
of the training set; similar ones characterize the validation and the test sets.</p>
        <p>The length of the dialogue is a good proxy of the level of di culty of the
game. Figure 3 shows that longer dialogues contain more distractors and in
particular more distractors of the same category of the target object, which are
supposed to be especially challenging for the models, since each candidate object
is represented simply by its category and coordinates. Moreover, the area
occupied by target objects is smaller in longer dialogues and the most representative
category among target objects (\person") is less frequent.</p>
        <p>We will exploit these features of the dataset to scrutinize the behaviour of
models.
3 The dataset of human dialogues is available at https://guesswhat.ai/download.
% Yes - first turn</p>
        <p>% Yes - last turn
2
3</p>
        <p>4
3-turns dialogues
5 6 7
Dialogue length
5-turns dialogues
8
9</p>
        <p>10
8-turns dialogues
1 2 3 4 5</p>
        <p>Turn
1 2 3</p>
        <p>
          Fig. 2: Statistics of the training set (the test set has similar distributions).
Dialogue length refers to the number of turns. Up: The distribution of Yes/No
questions is very unbalanced across the clusters of games (the percentage of
Yes answers is much higher in shorter dialogues); Middle In the large majority
of games, the last question is answered positively; Bottom: The last questions
are always longer (length of questions per turn for the clusters with dialogues
having 3, 5, and 8 turns).
avg # distractors same category
12.0
0
avg # distractors
% of rare words
30
All the evaluated models share the Guesser module proposed in [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ]. Candidate
objects are represented by the embeddings obtained via a Multi-Layer
Perceptron (MLP) starting from the category and spatial coordinates of each candidate
object. The representations so obtained are used to compute dot products with
the hidden dialogue state produced by an encoder. The scores of each candidate
object are given to a softmax classi er to choose the object with the highest
probability. The Guesser is trained in a supervised learning paradigm, receiving
the complete human dialogue history at once. The models we compare di er in
how the hidden dialogue state is computed. Figure 4 shows the shared skeleton.
        </p>
        <p>History
Is it the cat? No
Is it the bottle? No
Is it the pc? Yes
r
e
d
o
c
n
E</p>
        <p>Hidden
dialogue state</p>
        <p>
          Guesser
bottle pos food pos cat pos pc pos
softmax(
.
.
.
.
LSTM As in [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ], the representations of the candidates are fused with the last
hidden state obtained by an LSTM which processes only the dialogue history.
RoBERTa In the architecture of the model described above, we replace the
LSTM with the robustly-optimized version of BERT [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], RoBERTa, a SoA
universal transformer based encoder introduced in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].4 We use RoBERTaBASE
which has been pre-trained on 16GB of English text trained for 500K steps to
perform masked language modeling. It has 12 self-attention layers with 12 heads
each. It uses three special tokens, namely CLS, which is taken to be the
representation of the given sequence, SEP, which separates sequences, and EOS, which
denotes the end of the input. We give the output corresponding to the CLS
token to a linear layer and a tanh activation function to obtain the hidden state
which is given to the Guesser. To study the impact of the pre-training phase, we
have compared the publicly available pre-trained model, which we ne-tuned on
GuessWhat?! (RoBERTa), against its counterpart trained from scratch only
on the game (RoBERTa-S).
4.2
        </p>
        <sec id="sec-5-2-1">
          <title>Multimodal Encoders</title>
          <p>
            V-LSTM We enhance the LSTM model described above with the visual modality
by concatenating the linguistic and visual representation and scaling its result
with an MLP; the result is passed through a linear layer and a tanh activation
function to obtain the hidden state which is used as input for the Guesser
modules. We use a frozen ResNet-152 pre-trained on ImageNet [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] to extract the
visual vectors.
4 We have also tried BERT, but we obtained higher accuracy with RoBERTa.
LXMERT To evaluate the performance of a universal multimodal encoder,
we employ LXMERT (Learning Cross-Modality Encoder Representations from
Transformers) [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ]. It represents an image by the set of position-aware object
embeddings for the 36 most salient regions detected by a Faster R-CNN and it
processes the text input by position-aware randomly-initialized word embeddings.
Both the visual and linguistic representations are processed by a specialized
transformer encoder based on self-attention layers; their outputs are then
processed by a cross-modality encoder that through a cross-attention mechanism
generates representations of the single modality (language and visual output)
enhanced with the other modality as well as their joint representation
(crossmodality output). As RoBERTa, LXMERT uses the special tokens CLS and
SEP. Di erently from RoBERTa, LXMERT uses the special token SEP both to
separate sequences and to denote the end of the textual input. LXMERT has
been pre-trained on ve tasks.5 It has 19 attention layers: 9 and 5 self-attention
layers in the language and visual encoders, respectively and 5 cross-attention
layers. We process the output corresponding to the CLS token as in RoBERTa.
Similarly, we consider both the pre-trained version (LXMERT) and the one
trained from scratch (LXMERT-S).
5
          </p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>Experiments</title>
        <p>We compare the models described above using human dialogues aiming to shed
lights on how the encoders capture the information that is salient to guess the
target object.
5.1</p>
        <sec id="sec-5-3-1">
          <title>Task Success</title>
          <p>Dialogues asked by human players of the GuessWhat?! games are expected to
contain, together with the image they are about, the information necessary to
detect the target object among the candidates. We refer to them as Ground
Truth (GT) dialogues. As we can see in Table 1, the Guesser based on a blind
encoder (LSTM or RoBERTa from scratch or pre-trained) obtains results higher
than or comparable with V-LSTM.6</p>
          <p>
            Table 2 reports the accuracy by clusters of games based on the dialogue
length. All models reach a very high and similar accuracy in short games and
di er more in longer ones. Most of the boost obtained by RoBERTa seems to
come in longer dialogues where its from scratch version (RoBERTa-S) performs
on a par with the other models.
5 Masked cross-modality language modeling, masked object prediction via RoI-feature
regression, masked object prediction via detected-label classi cation, cross-modality
matching, and image question answering.
6 The model proposed in [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ] based on ViLBERT obtains an accuracy on GuessWhat?!
with human dialogues of 65.04% when trained together with the other 11 tasks and
62.81% when trained only on it.
          </p>
          <p>These results show that the human dialogue history alone is quite informative
to accomplish the task. If we go back to the example in Figure 1, we realize it is
possible to succeed in that game if we are given the dialogue only and are asked
to select the target object (the scissor on the left) among candidates for which
we are told the category and the coordinates { as it is the case for the Guesser.</p>
          <p>In the following, we are running an in-depth analysis to understand whether
models are able to identify salient information indipendently of the position in
which they occur.
5.2</p>
        </sec>
        <sec id="sec-5-3-2">
          <title>Are Models Sensitive to the Strategy Seen during Training?</title>
          <p>In Section 3, we have seen that human dialogues tend to share a speci c strategy,
i.e. questions that are asked in rst turns are rather short whereas those in
the last turns provide relevant details about the most probable target object.
We wonder whether the models under analysis become sensitive to the
abovementioned strategy and learn to focus on some turns more than others rather
than on the actual salient QA pair.</p>
          <p>
            Following [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ], we perturb the dialogue history in the test set by reversing
the order of turns from the last to the rst one (reversed). Di erently from them,
given the nature of the GuessWhat?! dialogue history, we value positively
models that are robust to this change in the dialogue history order. Our experiment
(Table 1) shows that Transformers are less sensitive than LSTMs to the order
in which QA pairs are provided. Interestingly, the pre-training phase seems to
mitigate the e ect of the change of the order even more: while RoBERTa has
a drop of just -1.4, the accuracy of its from-scratch counterpart drops of -6.4.
          </p>
        </sec>
        <sec id="sec-5-3-3">
          <title>In other words, (pre-trained) Transformers seem to be able to identify salient information independently of the position in which it is provided within the dialogue history. 5.3</title>
        </sec>
        <sec id="sec-5-3-4">
          <title>The role of the last question</title>
          <p>All turns tWur/no last All turns tWur/no last All turns tWur/no last
In Section 3 we have seen that the last turn is usually answered positively and it is
quite informative to detect the target object. We wonder whether this is re ected
on how models distribute their attention across turns within a dialogue. To this
end, we analyze how much each turn contributes to the overall self-attention
within a dialogue by summing the attention of each token within a turn. We run
this analysis for LXMERT and RoBERTa in their various versions: all models
put more attention on the last turn when the GT order of turns is given.</p>
          <p>In Table 1, we have seen that Transformers are more robust than the other
models when the dialogue history is presented in the reversed order (the rst
QA pair of the GT is presented as the last turn and the last QA pair is presented
as rst turn). Our analysis of the attention heads of RoBERTa and LXMERT
0.50
0.40
0.30
0.20
0.10
0.00</p>
          <p>LXMERT-S GT LXMERT-S reversed
CLS</p>
          <p>QA 1 QA 2 QA 3 QA 4 QA 5 SEP
Fig. 5: Attention assigned by LXMERT-S to each turn in a dialogue when the
dialogue history is given in the GT order (from QA1 to QA5) or in the
reversed order (from QA5 to QA1).
shows that these models, both in their from scratch and pre-trained version,
focus more on the question asked last also in the reversed test set where it is
presented in the rst position. This shows they are still able to identify the most
salient information. In Figure 5, we report the attention per turn of LXMERT-S
when receiving the GT and the reversed test set.
5.5</p>
        </sec>
        <sec id="sec-5-3-5">
          <title>Details for reproducibility</title>
          <p>We used the GuessWhat?! dataset in our experiments (http://guesswhat.ai/
download). The dataset contains 155000 English dialogues about approximately
66000 di erent images. The Train split contains 108000 datapoints, the
Validation split 23000 datapoints, and the Test split 23000 datapoints. We considered
only the dialogues corresponding to the games succeeded by humans and having
less or equal than 10 turns.</p>
          <p>For training LSTM based models we adapted the source codes available
at https://github.com/shekharRavi/Beyond-Task-Success-NAACL2019 and
at https://github.com/GuessWhatGame/guesswhat/. For training transformer
based models we adapted the source code available at https://github.com/
huggingface/transformers. The scripts for all the experiments and the
modi ed models will be made available upon acceptance. For all models, we used
the same hyperparameters of the original works. When adapting Transformers
to the GuessWhat?! task, we scaled the representation of the CLS token from
768 to 512. We used PyTorch 1.0.1 for all models except for LSTM, for which
we have used Tensor ow 1.3. All models are trained with Adam optimizer. For
transformer based models we used a batch size equal to 16, a weight decay equal
to 0:01, gradient clipping equal to 5, and a learning rate which is warmed up over
the rst 10% iterations to a peak value of 0:00001 and then linearly decayed.</p>
          <p>Regarding the infrastructure, we used 1 Titan V GPU. LSTM based models
took about 15 hours for completing 100 training epochs. Transformer based
models took about 4 days for completing 25 training epochs. Each experiment
took about 10 minutes to evaluate the best trained models.</p>
          <p>Details on the best epoch, the validation accuracy, and the number of
parameters of each model are reported in table 4.
Our detailed analysis of the GuessWhat?! dataset has revealed features of its
games that we have exploited to run a diagnostic analysis of SoA models.</p>
          <p>Our comparative analysis has shown that Trasformers are less sensitive than
LSTMs to the order in which QA pairs are provided and that their pre-trained
versions are even stronger in detecting salient information, within the dialogue
history, independently of the position in which it is provided.</p>
          <p>Furthermore, our results shows that RoBERTa is the encoder that provides
the Guesser with the most informative representation of the dialogue history.
Its advantage is particularly strong in longer dialogues. The dialogue contains
already all the information necessary to guess the candidates: both with LSTM
and trasformer based models the blind version obtain results higher than or
comparable with their multimodal counterpart. We conjecture that this is due
to the fact that the Guesser has access to the category of the target object.
Important progress has been made on multimodal models since the introduction
of the GuessWhat?! game. It would be interesting to see how SoA models would
perform when they have to rely on visual information rather than raw category.</p>
        </sec>
      </sec>
      <sec id="sec-5-4">
        <title>Acknowledgments</title>
        <p>We kindly acknowledge the support of NVIDIA Corporation with the donation
of the GPUs used in our research at the University of Trento. We acknowledge
SAP for sponsoring the work.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>A.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bader</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bard</surname>
            ,
            <given-names>E.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boyle</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doherty</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garrod</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isard</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kowtko</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McAllister</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sotillo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thompson</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weinert</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>The HCRC map task corpus</article-title>
          .
          <source>Language and Speech</source>
          <volume>34</volume>
          ,
          <issue>351</issue>
          {
          <fpage>366</fpage>
          (
          <year>1991</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kholy</surname>
            ,
            <given-names>A.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahmed</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , Cheng, Y.,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.: UNITER</given-names>
          </string-name>
          :
          <article-title>Learning universal image-text representations (</article-title>
          <year>2019</year>
          ), arXiv:
          <year>1909</year>
          .11740
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khandelwal</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.:
          <article-title>What does BERT look at? An analysis of BERT's attention</article-title>
          .
          <source>In: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</source>
          . pp.
          <volume>276</volume>
          {
          <issue>286</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kottur</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moura</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning</article-title>
          .
          <source>In: 2017 IEEE International Conference on Computer Vision</source>
          . pp.
          <volume>2951</volume>
          {
          <issue>2960</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          : BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers). pp.
          <volume>4171</volume>
          {
          <fpage>4186</fpage>
          . Association for Computational Linguistics, Minneapolis,
          <source>Minnesota (Jun</source>
          <year>2019</year>
          ). https://doi.org/10.18653/v1/
          <fpage>N19</fpage>
          -1423, https://www. aclweb.org/anthology/N19-1423
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , Cheng, Y.,
          <string-name>
            <surname>Kholy</surname>
            ,
            <given-names>A.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Multi-step reasoning via recurrent dual attention for visual dialog</article-title>
          .
          <source>In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          . pp.
          <volume>6463</volume>
          {
          <issue>6474</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Haber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Baumgartner, T.,
          <string-name>
            <surname>Takmaz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelderloos</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bruni</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>The PhotoBook dataset: Building common ground through visually-grounded dialogue</article-title>
          .
          <source>In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          . pp.
          <source>1895{1910 (Jul</source>
          <year>2019</year>
          ). https://doi.org/10.18653/v1/
          <fpage>P19</fpage>
          - 1184, https://www.aclweb.org/anthology/P19-1184
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balakrishnan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eric</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Learning symmetric collaborative dialogue agents with dynamic knowledge graph embeddings</article-title>
          .
          <source>In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics</source>
          . pp.
          <volume>1766</volume>
          {
          <issue>1776</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          . pp.
          <volume>770</volume>
          {
          <issue>778</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ilinykh</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zarrie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schlangen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Tell Me More: A Dataset of Visual Scene Description Sequences</article-title>
          .
          <source>In: Proceedings of the 12th International Conference on Natural Language Generation</source>
          . pp.
          <volume>152</volume>
          {
          <issue>157</issue>
          (
          <year>2019</year>
          ), https://www.aclweb.org/ anthology/W19-8621
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Johnson</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Hariharan</surname>
          </string-name>
          , B.,
          <string-name>
            <surname>van der Maaten</surname>
          </string-name>
          , L.,
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zitnick</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
          </string-name>
          , R.B.:
          <article-title>CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning</article-title>
          .
          <source>In: IEEE Conference on Computer Vision and Pattern Recognition</source>
          . vol.
          <source>abs/1612</source>
          .06890 (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Ka e,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Kanan</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>An analysis of visual question answering algorithms</article-title>
          .
          <source>In: Proceedings of the IEEE International Conference on Computer Vision</source>
          . pp.
          <year>1965</year>
          {
          <year>1973</year>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kaushik</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lipton</surname>
            ,
            <given-names>Z.C.</given-names>
          </string-name>
          :
          <article-title>How much reading does reading comprehension require? A critical investigation of popular benchmarks</article-title>
          .
          <source>In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          . pp.
          <volume>5010</volume>
          {
          <issue>5015</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yatskar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsieh</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>K.W.:</given-names>
          </string-name>
          <article-title>VisualBERT: A simple and performant baseline for vision and language (</article-title>
          <year>2019</year>
          ), arXiv:
          <year>1908</year>
          .03557
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>T.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dollar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zitnick</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Microsoft</surname>
            <given-names>COCO</given-names>
          </string-name>
          :
          <article-title>Common objects in context</article-title>
          .
          <source>In: Proceedings of ECCV (European Conference on Computer Vision)</source>
          . pp.
          <volume>740</volume>
          {
          <issue>755</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ott</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoyanov</surname>
          </string-name>
          , V.:
          <article-title>RoERTa: A robustly optimized bert pretraining approach (</article-title>
          <year>2019</year>
          ), arXiv:
          <year>1907</year>
          .11692
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          . pp.
          <volume>13</volume>
          {
          <issue>23</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goswami</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohrbach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>12-in-1: Multi-task vision and language representation learning</article-title>
          .
          <source>In: Proceedings of CVPR</source>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Murahari</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Large-scale pretraining for visual dialog: A simple state-of-the-art baseline</article-title>
          . arXiv preprint arXiv:
          <year>1912</year>
          .
          <volume>02379</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20. amd Nan Duan,
          <string-name>
            <given-names>G.L.</given-names>
            ,
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training</article-title>
          .
          <source>In: Proceedings of AAAI</source>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Visual dialogue state tracking for question generation</article-title>
          .
          <source>In: Proceedings of 34th AAAI Conference on Arti cial Intelligence</source>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Sang-Woo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tong</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sohee</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaejun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jung-Woo</surname>
          </string-name>
          , H.:
          <article-title>Large-scale answerer in questioner's mind for visual dialog question generation</article-title>
          .
          <source>In: Proceedings of International Conference on Learning Representations, ICLR</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Sankar</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Subramanian</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pal</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chandar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Do neural dialog systems use the conversation history e ectively? An empirical study</article-title>
          .
          <source>In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          . pp.
          <volume>32</volume>
          {
          <issue>37</issue>
          (
          <year>2019</year>
          ), https://www.aclweb.org/anthology/P19-1004
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural Comput</source>
          <volume>9</volume>
          (
          <issue>8</issue>
          ),
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Shekhar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pezzelle</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klimovich</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herbelot</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nabi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sangineto</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernardi</surname>
          </string-name>
          , R.:
          <article-title>FOIL it! Find one mismatch between image and language caption</article-title>
          .
          <source>In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          . pp.
          <volume>255</volume>
          {
          <issue>265</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Shekhar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Venkatesh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Baumgartner, T.,
          <string-name>
            <surname>Bruni</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plank</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernardi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandez</surname>
          </string-name>
          , R.:
          <article-title>Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat</article-title>
          . In:
          <article-title>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)</article-title>
          . pp.
          <volume>2578</volume>
          {
          <issue>2587</issue>
          (
          <year>2019</year>
          ). https://doi.org/10.18653/v1/
          <fpage>N19</fpage>
          -1265, https://www.aclweb.org/ anthology/N19-1265
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>VL-BERT: Pre-training of generic visual-linguistic representations</article-title>
          .
          <source>In: ICLR</source>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Suhr</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yeh</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Artzi</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>A corpus of natural language for visual reasoning</article-title>
          .
          <source>In: Proceedings of the Annual Meeting of the Association for Computational Linguistics</source>
          . pp.
          <volume>217</volume>
          {
          <fpage>223</fpage>
          . Association for Computational Linguistics, Vancouver, Canada (
          <year>July 2017</year>
          ), http://aclweb.org/anthology/P17-2034
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          :
          <article-title>Sequence to sequence learning with neural networks</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>3104</volume>
          {
          <issue>3112</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>LXMERT: Learning cross-modality encoder representations from transformers</article-title>
          .
          <source>In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          . pp.
          <volume>5103</volume>
          {
          <issue>5114</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Udagawa</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aizawa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A natural language corpus of common grounding under continuous and partially-observable context</article-title>
          .
          <source>In: Proceedings of the AAAI Conference on Arti cial Intelligence</source>
          . vol.
          <volume>33</volume>
          , pp.
          <volume>7120</volume>
          {
          <issue>7127</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parmar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uszkoreit</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>A.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaiser</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polosukhin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Attention is all you need</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>5998</volume>
          {
          <issue>6008</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33. de Vries, H.,
          <string-name>
            <surname>Strub</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chandar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pietquin</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          : GuessWhat?!
          <article-title>Visual object discovery through multi-modal dialogue</article-title>
          .
          <source>In: 2017 IEEE Conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <volume>5503</volume>
          {
          <issue>5512</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zha</surname>
            ,
            <given-names>Z.J.</given-names>
          </string-name>
          , Zhang, H.:
          <article-title>Making history matter: History-advantage sequence training for visual dialog</article-title>
          .
          <source>In: Proceedings of the International Conference on Computer Vision</source>
          (ICCV) (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Multimodal hierarchical reinforcement learning policy for task-oriented visual dialog</article-title>
          .
          <source>In: Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue</source>
          . pp.
          <volume>140</volume>
          {
          <issue>150</issue>
          (
          <year>2018</year>
          ), https://www.aclweb.org/ anthology/W18-5015
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36.
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Lu</surname>
          </string-name>
          , J., van den Hengel, A.:
          <article-title>Goal-oriented visual question generation via intermediate rewards</article-title>
          .
          <source>In: Proceedings of the European Conference of Computer Vision (ECCV)</source>
          . pp.
          <volume>186</volume>
          {
          <issue>201</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          37.
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tresp</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Improving goal-oriented visual dialog agents via advanced recurrent nets with tempered policy gradient</article-title>
          .
          <source>In: Proceedings of IJCAI</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>