=Paper=
{{Paper
|id=Vol-1926/paper4
|storemode=property
|title=Criteria for Human-Compatible AI in Two-Player Vision-Language Tasks
|pdfUrl=https://ceur-ws.org/Vol-1926/paper4.pdf
|volume=Vol-1926
|authors=Cheolho Han,Sang-Woo Lee,Yujung Heo,Wooyoung Kang,Jaehyun Jun,Byoung-Tak Zhang
|dblpUrl=https://dblp.org/rec/conf/ijcai/HanLHKJZ17
}}
==Criteria for Human-Compatible AI in Two-Player Vision-Language Tasks==
<pdf width="1500px">https://ceur-ws.org/Vol-1926/paper4.pdf</pdf>
<pre>
        Criteria for Human-Compatible AI in Two-Player Vision-Language Tasks

Cheolho Han1,† , Sang-Woo Lee1,† , Yujung Heo1 , Wooyoung Kang1 , Jaehyun Jun2 , Byoung-Tak Zhang1,2
                 1
                   School of Computer Science and Engineering, Seoul National University
                   2
                     Interdisciplinary Program in Neuroscience, Seoul National University
                          {chhan, slee, yjheo, wykang, jhjun, btzhang}@bi.snu.ac.kr


                             Abstract
       We propose rule-based search systems that out-
       perform not only the state-of-the-art but the hu-
       man performance, measured in accuracy, in Guess-
       What?!, a vision-language game where either of
       two players can be a human. Although those sys-
       tems achieve the high accuracy, they do not meet
       other requirements to be considered as an AI sys-                 Figure 1: An example of the GuessWhat?! game. The correct
       tem that communicates effectively with humans.                    object is highlighted by a green mask.
       To clarify what they lack, we suggest the use of
       three criteria to enable effective communication                  two (or possibly more) agents communicate over the image.
       with humans in vision-language tasks. These crite-                Visual dialogues involving agents with specific roles or tasks
       ria also apply to other two-player vision-language                were mainly studied up to the present.
       tasks that require communication with humans,                        ReferIt game [Kazemzadeh et al., 2014] is an example of
       e.g., ReferIt.                                                    visual dialogue. It is a two player game referring to objects
                                                                         in an image of natural scenes. One player is shown an image
  1   Introduction                                                       with a target object and has to explain it to distinguish from
                                                                         the other, where what it says is called the referring expression.
  Recent advances in computer vision and natural language                The other player is shown the same image and the referring
  processing have led researchers’ attentions to the intersec-           expression written by the other player and guesses the target
  tion of these two areas, vision-language tasks. An initiative          object.
  to this kind of task was image description [Kiros et al., 2014;
                                                                            GuessWhat?! game is also an example of visual dialogue
  Vinyals et al., 2015; Xu et al., 2015; Johnson et al., 2016b;
                                                                         (Fig. 1). It contains dialogue on question answering about the
  Mao et al., 2016]. In the image description task, an image
                                                                         given image over two players. Its goal is to locate an unknown
  is given, and the model is supposed to generate descriptions
                                                                         object in a rich image scene by asking a sequence of ques-
  or captions on the image. However, generated descriptions
                                                                         tions. One player is randomly assigned an object in the image
  have been difficult to evaluate, and results have not been di-
                                                                         and the other player is to locate the hidden object with a series
  rectly related to how well the model understand the image.
                                                                         of Yes/No questions.
  To show the comprehension of the model, visual question
  answering (VQA) was introduced [Antol et al., 2015; John-                 On the two tasks mentioned above, each player also can
  son et al., 2016a; Agrawal et al., 2017; Fukui et al., 2016;           be human or agent. If one player is human and the other is
  Kim et al., 2016]. In VQA, an image and a question about               agent, the agent must generate meaningful dialog with hu-
  the image is given, and the model is supposed to answer the            mans in natural and conversational language about a visual
  question. However, the communication occurs one-way, and               image. This aspect is crucial to solve the task successfully.
  the model has the passive role only to answer questions.               Such tasks have been evaluated in terms of task-specific per-
     Some vision-language tasks require active bidirectional             formance metrics such as accuracy or success rate of the task.
  communication between two (or possibly more) agents. To                However, they are not the only criteria to enable efficient bidi-
  interactively communicate over an image, visual dialogues              rectional communication with humans.
  were introduced [Lazaridou et al., 2016; Das et al., 2016;                In this paper, we pose the problem that we need more crite-
  Mao et al., 2016; de Vries et al., 2017; Strub et al., 2017;           ria to measure and analyze the bidirectional communication
  Das et al., 2017]. In visual dialogues, an image is given, and         between human and agent other than metrics. To demonstrate
                                                                         that, we first try to tackle GuessWhat?! game which we men-
      † authors contributed equally.                                     tioned above and show that our proposed rule-based search


                                                                    28
systems outperformed not only state-of-the-art, but the hu-              this game, both agents should cooperate and learn the rela-
man performance measure, the success rate of the task. Then,             tion between vision and language. [Lazaridou et al., 2016]
we suggest the use of some criteria to enable efficient bidi-            designed referIt game between two agents, constitute the re-
rectional communication between human and agent in vision-               ferring expression by a binary vector between two agents, for-
language tasks such as GuessWhat?! game.                                 mulated the game as classification, and solved the classifica-
   The rest of the paper is organized as follows. First we re-           tion problem by neural networks. On the task, these agents
view related works to vision-language tasks on Section 2 and             develop their own artificial language from the need to com-
we proposed rule-based search systems which outperformed                 municate in order to succeed at the game. It showed some
state-of-the-art performance on Section 3. Then, we suggest              correlation with human language, but also showed some mis-
the use of some criteria for measure and analyze the bidirec-            matches. If one player is agent and the other is human, the
tional communication between human and agent on Section                  referring expression may not carry the exact meaning and
4. Finally, we discuss conclusions and future work on Section            cause the confusion between the players. In terms of mean-
5.                                                                       inful vision-language integration between human and agent,
                                                                         we argue that we need to analyze details of these referring
2     Related Works                                                      expressions other than metrics such as success rate of game
                                                                         and accuracy.
2.1    Image Description
Automatic image description is a challenging problem that                2.3   GuessWhat?!
involves analyzing an image, reasoning contextual informa-               GuessWhat?! is a cooperative two-player guessing game pro-
tion between existing objects in the image and generating                posed by [de Vries et al., 2017]. The goal of the game is to
textual descriptions. It has been first stage on research about          locate an unknown object in a rich image scene by asking a
vision-language grounding. [Vinyals et al., 2015] proposed               sequence of questions. One player who called Oracle is ran-
neural image caption (NIC) generator inspired by advances                domly assigned an object in the image and the other player
in machine translation. They replaced encoding step which                who called the Questioner does not know the object assigned
extracts abstract representations of source language using               to the Oracle. The goal of the Questioner is to locate the hid-
RNN to using CNN fed into given image. Encouraged by ad-                 den object with a series of Yes/No questions which are an-
vances in employing attention in machine translation and ob-             swered by the Oracle. If the Questioner selects a right object,
ject recognition, attention mechanism is introduced by [Xu et            we consider the game successful.
al., 2015]. The mechanism can attend to salient part of given               [de Vries et al., 2017] collect a large-scale human-
image while generating its caption so that demonstrated the              played GuessWhat?! game dataset consisting of 800K visual
learned alignments correspond very well to human intuition.              question-answering pairs on 66K images and propose base-
   Many previous papers on image description have focused                line deep learning model. To solve the proposed task suc-
on describing the entire image. On the other hand, [Johnson              cessfully, [de Vries et al., 2017] suggested that an agent is
et al., 2016b] address a new task, dense captioning, which re-           required higher-level image understanding, like spatial rea-
quires a model to predict a set of descriptions of regions of            soning, visual properties, object taxonomy, and interaction.
given image. It is important to understand each object or part           The authors also proposed that the agent should understand
not an entire image for high-level scene understanding. [Mao             the relationships between objects and how they are expressed
et al., 2016] also focused on generating an unambiguous de-              in natural language. The baseline model consists of 3 parts:
scription of a specific object or region in an image. They con-          Oracle, Guesser and Question generator. Oracle is a model-
sidered both description generation and description compre-              based a simple neural network, which fed embedded inputs
hension and jointly modeled both tasks combining CNN with                and classifies answer among Yes/No or N/A. The role of
RNN.                                                                     guesser is to predict one hidden object. It compares dot-
   These models are just passive roles only to generate de-              products of embedded vectors of image, dialogue, and infor-
scription about given image on the task, not bidirectional               mation of candidate objects and classifies the most probable
communication. And, generated descriptions have been dif-                object among the candidates. Question generator is to gener-
ficult to evaluate, and results have not been directly related to        ate questions reflecting the context of the previous question
how well the model understand the image. Therefore, the task             answering pairs based on the Hierarchical Recurrent Encoder
could be extended for bidirectional communication task such              Decoder(HRED) model.
as ReferIt or GuessWhat?! game and it needs further consid-                 As a follow-up research, [Strub et al., 2017] present end-
eration for evaluation and analysis.                                     to-end reinforcement learning optimization for question gen-
                                                                         eration task to find the correct object efficiently. They define
2.2    ReferIt                                                           GuessWhat?! game as a Markov Decision Process: A state xt
ReferIt [Kazemzadeh et al., 2014; Lazaridou et al., 2016] is             is the tokens generated on the dialogue until time t and an ac-
a two player game referring an object in an image of natu-               tion ut is to select a new word with zero-one reward depend-
ral scenes. One player is shown an image with a target object            ing on the Questioner’s choice. They train the question gener-
and has to explain it to distinguish from the other, where what          ator with policy gradient and obtain about 17% improvement
it says is called the referring expression. The other player is          of accuracy as compared with the baseline model.
shown the same image and the referring expression written by                GuessWhat?! game have been measured only by the suc-
the other player and guesses the target object. To accomplish            cess rate of the game. However, as following Section 3, we


                                                                    29
show rule-based search system which attains not only state-
of-the-art but also human performance. At the point, we un-                Table 1: Test accuracy on the GuessWhat?! dataset. Even if
derline that it is not enough to measure the bidirectional com-            the image information is not used, the proposed method out-
munications only by the metrics such as success rate of the                performs the state-of-the-art deep learning methods in two
game or accuracy and propose criteria for meaningful evalu-                turns and exceeds human performance in four or five turns.
ation on Section 4.                                                        We improved the system by tuning the first division of an
                                                                           image, utilizing statistics on the spatial information (denoted
                                                                           Fine-Tune). To explore the effect of segmentation, we stole
3     Rule-Based Search Systems                                            a look at the segmentation information of the candidates (de-
                                                                           noted Segment Info). The deep learning methods constructed
3.1    Methods
                                                                           the question generator by the hierarchical recurrent encoder-
We constructed rule-based search systems using only the spa-               decoder (HRED) or the recurrent neural network (RNN) with
tial information of the target object for GuessWhat?!. We can              reinforcement learning (RL).
divide an image evenly into three parts by two vertical lines
and divide each part continuously by horizontal lines or verti-                    Model                                  Accuracy
cal lines in turns (Fig. 2). Then in a current region of interests                 Baseline                                16.04
divided by two vertical lines, we say the left of the left-side                    1 Question                              38.96
region, the right of the right-side region, and N/A of the mid-                    2 Questions                             56.25
dle region. Similarly, we say the top, the bottom, and N/A                         3 Questions                             76.61
when a region is divided by horizontal lines. Through this                         4 Questions                             85.85
protocol or language, we can ask and answer for the location                       5 Questions                             94.34
of the center of the target object. Rule-based search systems                      1 Question w/ Fine-Tune                 39.82
use this simple language to locate the target object. We may                       2 Questions w/ Fine-Tune                59.40
also utilize statistics or the distribution of the spatial informa-                1 Question w/ Segment Info              48.12
tion. For the first turn, we may divide an image not evenly.                       2 Questions w/ Segment Info             87.67
We found that the range of the middle side of 0.18 covers 1/3                      HRED [de Vries et al., 2017]             46.8
target objects, and the other sides of 0.82 covers 2/3 target                      RNN w/ RL [Strub et al., 2017]           52.3
objects, which means the distribution of target objects was                        Human [de Vries et al., 2017]            90.8
denser in the evenly-divided middle side. Therefore, we set                        Human [Strub et al., 2017]               84.4
the vertical lines to locate at 0.41 and 0.59.
   Given a segmentation model, we can further improve our
system. To explore this case, we stole a look at the segmen-               generator. Therefore, the oracle and the guesser do not bene-
tation information of the candidates, which is supposed to                 fit from interaction. In contrast, in proposed rule-based search
be known when the guesser selects a candidate after a series               systems, the oracle and the questioner share a set of strict
of question-answer pairs. Then we can implement the binary                 rules. Proposed systems outperformed the state-of-the-art ac-
search based on the spatial information of the candidates. If a            curacy in two turns and the human accuracy in four or five
segmentation model gives the segmentation similar with that                turns. This result is remarkable improvement of accuracy con-
of candidates given in the dataset, we can get an algorithm                sidering that [de Vries et al., 2017] and [Strub et al., 2017]
close to the binary search, which is optimal.                              generated 5 and 8 questions respectively to get their accu-
   Proposed rule-based search systems do not break the rule                racy and that the minimum, mode, mean, and maximum of
of GuessWhat?! game. The spatial information is commonly                   the number of questions by humans are 1, 3, 5.2, and 24, re-
used by humans as appears in the dataset. Moreover, we can                 spectively. Furthermore, we improved the system by tuning
substitute the spatial information with other features or prop-            the first division of an image, utilizing statistics on spatial in-
erties. We can choose a real-valued feature without the point              formation (denoted Fine-Tune). To explore the effect of seg-
mass like the area and the color. Then the feature gives an-               mentation, we stole a look at the segmentation information of
other rule-based search algorithm. This search algorithm is                the candidates (denoted Segment Info).
optimal (for the brightness of the center of the target ob-
ject) or near-optimal (given an optimal segmentation model                 4   More Criteria
for the area.) We can also use real-valued features with the
                                                                           The search system exploits low-level features and strictly fol-
point mass or unordered (nominal or categorical) features by
                                                                           lows a predefined set of rules without considering the uncer-
choosing a feature that evenly divide the candidates, which
                                                                           tainty or ambiguity. If either the oracle or the questioner is
gives a near-optimal algorithm.
                                                                           a human, the search system may not successfully communi-
                                                                           cate with the human. If the search system takes the part of
3.2    Results                                                             the questioner, then it may explain how it works at the be-
[de Vries et al., 2017] and [Strub et al., 2017] constructed               ginning. If the search system plays the role of the oracle, the
the question generator by the hierarchical recurrent encoder-              search system is unlikely to take the initiative, so it may not
encoder (HRED) and recurrent neural network (RNN) with                     have a chance to show how it works. To develop more satis-
reinforcement learning, respectively (Table 1). The oracle and             fying systems, we investigate characteristics of the effective
the guesser was trained before interacting with the question               AI system for bidirectional communication with humans in


                                                                      30
                          Figure 2: A sequence of divisions of an image by rule-based search systems


vision-language tasks. We will present some considerations
in designing such systems. Then we will review some criteria
in vision-language tasks and suggest the use of some criteria
to be adopted in GuessWhat?!.
   To develop effective AI systems for bidirectional commu-
nication with humans in vision-language tasks, we need to
formulate a problem or task first. Many vision-language tasks
have been proposed for this purpose. A task naturally gives a
main set of criteria, called objective functions and constraints        Figure 3: GuessWhat?! setting with a judge for complement
in optimization. However, this main set may not be enough,              or substitution with human evaluation
then we need to add more criteria. Several criteria in vision-
language tasks have been made. We may group these criteria
into subjective, task-specific, or similarity criteria.                 tion (ROUGE) [Lin, 2004], and consensus-based image de-
   The subjective evaluation by humans is widely performed              scription evaluation (CIDEr) [Vedantam et al., 2015]. These
in many AI researches as well as vision-language areas. Peo-            similarity criteria are calculated systematically, so they cost
ple observe or interact with systems and then evaluate how              less than subjective criteria.
well or alike humans systems behave. The subjective evalua-                Adversarial evaluation through neural networks [Bowman
tion is not objective so may not be considered scientific, but          et al., 2015; Kannan and Vinyals, 2017; Li et al., 2017] was
the subjective evaluation is crucial because it tells how hu-           suggested as a similarity criterion. Unlike previous similar-
mans actually feel about the system. However, the subjective            ity criteria, the adversarial evaluation is not fixed. Instead, it
evaluation is costly in general.                                        changes when a neural network called a discriminator learns
   Task-specific criteria are essential to show how well the            whether the speaker is a human or not. After learning, the
system performs in each task. In vision-language tasks, some            discriminator tells the human from the other for the test data.
cross-modal classification or retrieval metrics have been used          Since the discriminator is a neural network, it may catch vari-
including accuracy, median rank (mRank), and precision / re-            ous complex patterns which could not be found by other sim-
call at k (P/R@k) [Kent et al., 1955]. These criteria are mea-          ilarity metrics. However, training the discriminator is neces-
sured on data that was collected before constructing the sys-           sary unlike other similarity metrics.
tem, so they cost less than subjective criteria, which require             Beyond the accuracy, we need to adopt more criteria in
additional human efforts.                                               GuessWhat?!. Basically, more criteria are considered the bet-
   Similarity criteria evaluate how human-like systems be-              ter in evaluating vision-language tasks, which is partially be-
have. In language generations like machine translations                 cause we do not have a unique criterion that everyone agrees
and text summarizations, some language similarity metrics               with. However, we have limited resources, so we have to
have been used including bilingual evaluation understudy                choose some criteria among them. The accuracy is the ob-
(BLEU) [Papineni et al., 2002], metric for evaluation of                vious task-specific criterion, but people would not be satis-
translation with explicit ordering (METEOR) [Banerjee and               fied only with the accuracy achieved by two AI players. We
Lavie, 2005], recall-oriented understudy for gisting evalua-            need to solve the following constrained optimization prob-


                                                                   31
lem to develop systems that can communicate with humans in             the results, we argue that we need to measure the performance
GuessWhat?!. Given an objective functional f (the accuracy             of the system and analyze details of the results with more con-
in GuessWhat?!), human oracle Oh , and human questioner                crete criteria not just task-specific metrics such as the success
Qh , the optimal oracle O∗ and questioner Q∗ are given by              rate of game and accuracy. We suggested the use of criteria
solving                                                                for bidirectional communication between humans and agents
               max f (O, Q)                                            in vision-language tasks. The adversarial evaluation can be
                O,Q
                                                                       considered as a promising similarity criterion.
                  s.t. O is compatible with Qh             (1)
                       Q is compatible with Oh                         Acknowledgments
where, for a threshold t > 0,                                          This work was supported by the Institute for Information
        O is compatible with Qh                                        & Communications Technology Promotion (2015-0-00310-
             if f (O, Qh ) > (1 − t) · max f (O, Qh )                  SW.StarLab) and Korea Evaluation Institute of Industrial
                                        O                              Technology (10044009-HRI.MESSI, 10060086-RISF)
                                                           (2)
        Q is compatible with Oh
             if f (Oh , Q) > (1 − t) · max f (Oh , Q)                  References
                                        Q
                                                                       [Agrawal et al., 2017] Aishwarya Agrawal, Aniruddha
This problem is a joint optimization problem and is difficult             Kembhavi, Dhruv Batra, and Devi Parikh. C-vqa: A
to solve because it involves two optimization functions and               compositional split of the visual question answering (vqa)
the interaction with human oracles as well as the observation             v1. 0 dataset. arXiv preprint arXiv:1704.08243, 2017.
of human questioners. Instead, a two-phase greedy optimiza-
tion was commonly used in previous works. It involves the              [Antol et al., 2015] Stanislaw Antol, Aishwarya Agrawal, Ji-
observation of human questioners like the previous joint opti-            asen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zit-
mization problem, but it involves only one optimization func-             nick, and Devi Parikh. Vqa: Visual question answering.
tion at each phase and the interaction with an oracle model               In Proceedings of the IEEE International Conference on
instead of human oracles.                                                 Computer Vision, pages 2425–2433, 2015.
                  Ô = argmax f (O, Qh )                   (3)         [Banerjee and Lavie, 2005] Satanjeev Banerjee and Alon
                           O                                              Lavie. Meteor: An automatic metric for mt evaluation with
                                                                          improved correlation with human judgments. In Proceed-
                   Q̂ = argmax f (Ô, Q)                   (4)            ings of the acl workshop on intrinsic and extrinsic evalua-
                            Q
                                                                          tion measures for machine translation and/or summariza-
However, Ô and Q̂ are marginally optimal, and Q̂ may not be              tion, volume 29, pages 65–72, 2005.
compatible with Oh . We may employ human oracles to deter-             [Bowman et al., 2015] Samuel R Bowman, Luke Vilnis,
mine whether Q̂ is compatible with Oh even if it costs large.             Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and
Under the assumption that a questioner Q similar with the                 Samy Bengio. Generating sentences from a continuous
human questioner Qh is compatible with Oh , similarity cri-               space. arXiv preprint arXiv:1511.06349, 2015.
teria may complement or substitute with human evaluation.
Criteria that distinguish the AI system from humans are nec-           [Das et al., 2016] Abhishek Das, Satwik Kottur, Khushi
essary for this purpose. The adversarial metric is a promising            Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi
criterion among them because it involves the neural network,              Parikh, and Dhruv Batra. Visual dialog. arXiv preprint
which can learn complex patterns, as a discriminator or judge             arXiv:1611.08669, 2016.
who determines whether two players are human-like (Fig. 3).            [Das et al., 2017] Abhishek Das, Satwik Kottur, José MF
   We reviewed some criteria in vision-language tasks and                 Moura, Stefan Lee, and Dhruv Batra. Learning coopera-
suggested the use of some criteria to be adopted in Guess-                tive visual dialog agents with deep reinforcement learning.
What?!. We grouped the criteria into subjective, task-specific,           arXiv preprint arXiv:1703.06585, 2017.
or similarity criteria and gave some examples. In Guess-
                                                                       [de Vries et al., 2017] Harm de Vries, Florian Strub, Sarath
What?!, we need more criteria other than the accuracy. If we
are affordable, then we can employ human evaluation. Oth-                 Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron
erwise, we may choose similarity criteria. We recommended                 Courville. Guesswhat?! visual object discovery through
the adversarial evaluation as a promising similarity criterion.           multi-modal dialogue. In Proceedings of the IEEE Confer-
                                                                          ence on Computer Vision and Pattern Recognition, 2017.
5   Conclusion                                                         [Fukui et al., 2016] Akira Fukui, Dong Huk Park, Daylen
We proposed rule-based search systems which used only                     Yang, Anna Rohrbach, Trevor Darrell, and Marcus
the spatial information of the target object for GuessWhat?!              Rohrbach. Multimodal compact bilinear pooling for visual
game. It can be regarded as the artificial language between               question answering and visual grounding. arXiv preprint
agents on the GuessWhat?! task. Our rule-based search sys-                arXiv:1606.01847, 2016.
tem outperformed the state-of-the-art accuracy in two turns            [Johnson et al., 2016a] Justin Johnson, Bharath Hariharan,
and the human accuracy in four or five turns. In the view of              Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick,


                                                                  32
   and Ross Girshick. Clevr: A diagnostic dataset for compo-              End-to-end optimization of goal-driven and visu-
   sitional language and elementary visual reasoning. arXiv               ally grounded dialogue systems.          arXiv preprint
   preprint arXiv:1612.06890, 2016.                                       arXiv:1703.05423, 2017.
[Johnson et al., 2016b] Justin Johnson, Andrej Karpathy,               [Vedantam et al., 2015] Ramakrishna              Vedantam,
   and Li Fei-Fei. Densecap: Fully convolutional localization             C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-
   networks for dense captioning. In Proceedings of the IEEE              based image description evaluation. In Proceedings of
   Conference on Computer Vision and Pattern Recognition,                 the IEEE Conference on Computer Vision and Pattern
   pages 4565–4574, 2016.                                                 Recognition, pages 4566–4575, 2015.
[Kannan and Vinyals, 2017] Anjuli Kannan and Oriol                     [Vinyals et al., 2015] Oriol Vinyals, Alexander Toshev,
   Vinyals. Adversarial evaluation of dialogue models. arXiv              Samy Bengio, and Dumitru Erhan. Show and tell: A neu-
   preprint arXiv:1701.08198, 2017.                                       ral image caption generator. In Proceedings of the IEEE
                                                                          Conference on Computer Vision and Pattern Recognition,
[Kazemzadeh et al., 2014] Sahar Kazemzadeh, Vicente Or-
                                                                          pages 3156–3164, 2015.
   donez, Mark Matten, and Tamara L Berg. Referitgame:
   Referring to objects in photographs of natural scenes. In           [Xu et al., 2015] Kelvin Xu, Jimmy Ba, Ryan Kiros,
   EMNLP, pages 787–798, 2014.                                            Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdi-
                                                                          nov, Richard S Zemel, and Yoshua Bengio. Show, attend
[Kent et al., 1955] Allen Kent, Madeline M Berry, Fred U                  and tell: Neural image caption generation with visual at-
   Luehrs, and James W Perry. Machine literature search-                  tention. In ICML, volume 14, pages 77–81, 2015.
   ing viii. operational criteria for designing information re-
   trieval systems. Journal of the Association for Information
   Science and Technology, 6(2):93–101, 1955.
[Kim et al., 2016] Jin-Hwa Kim, Sang-Woo Lee, Donghyun
   Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, and
   Byoung-Tak Zhang. Multimodal residual learning for vi-
   sual qa. In Advances in Neural Information Processing
   Systems, pages 361–369, 2016.
[Kiros et al., 2014] Ryan Kiros, Ruslan Salakhutdinov, and
   Richard S Zemel. Multimodal neural language models. In
   Icml, volume 14, pages 595–603, 2014.
[Lazaridou et al., 2016] Angeliki Lazaridou, Nghia The
   Pham, and Marco Baroni.                Towards multi-agent
   communication-based language learning. arXiv preprint
   arXiv:1605.07133, 2016.
[Li et al., 2017] Jiwei Li, Will Monroe, Tianlin Shi, Alan
   Ritter, and Dan Jurafsky. Adversarial learning for neu-
   ral dialogue generation. arXiv preprint arXiv:1701.06547,
   2017.
[Lin, 2004] Chin-Yew Lin. Rouge: A package for auto-
   matic evaluation of summaries. In Text summarization
   branches out: Proceedings of the ACL-04 workshop, vol-
   ume 8. Barcelona, Spain, 2004.
[Mao et al., 2016] Junhua Mao, Jonathan Huang, Alexander
   Toshev, Oana Camburu, Alan L Yuille, and Kevin Mur-
   phy. Generation and comprehension of unambiguous ob-
   ject descriptions. In Proceedings of the IEEE Conference
   on Computer Vision and Pattern Recognition, pages 11–
   20, 2016.
[Papineni et al., 2002] Kishore Papineni, Salim Roukos,
   Todd Ward, and Wei-Jing Zhu. Bleu: a method for auto-
   matic evaluation of machine translation. In Proceedings of
   the 40th annual meeting on association for computational
   linguistics, pages 311–318. Association for Computational
   Linguistics, 2002.
[Strub et al., 2017] Florian Strub, Harm de Vries, Jeremie
   Mary, Bilal Piot, Aaron Courville, and Olivier Pietquin.


                                                                  33

</pre>