=Paper=
{{Paper
|id=Vol-1926/paper4
|storemode=property
|title=Criteria for Human-Compatible AI in Two-Player Vision-Language Tasks
|pdfUrl=https://ceur-ws.org/Vol-1926/paper4.pdf
|volume=Vol-1926
|authors=Cheolho Han,Sang-Woo Lee,Yujung Heo,Wooyoung Kang,Jaehyun Jun,Byoung-Tak Zhang
|dblpUrl=https://dblp.org/rec/conf/ijcai/HanLHKJZ17
}}
==Criteria for Human-Compatible AI in Two-Player Vision-Language Tasks==
Criteria for Human-Compatible AI in Two-Player Vision-Language Tasks Cheolho Han1,† , Sang-Woo Lee1,† , Yujung Heo1 , Wooyoung Kang1 , Jaehyun Jun2 , Byoung-Tak Zhang1,2 1 School of Computer Science and Engineering, Seoul National University 2 Interdisciplinary Program in Neuroscience, Seoul National University {chhan, slee, yjheo, wykang, jhjun, btzhang}@bi.snu.ac.kr Abstract We propose rule-based search systems that out- perform not only the state-of-the-art but the hu- man performance, measured in accuracy, in Guess- What?!, a vision-language game where either of two players can be a human. Although those sys- tems achieve the high accuracy, they do not meet other requirements to be considered as an AI sys- Figure 1: An example of the GuessWhat?! game. The correct tem that communicates effectively with humans. object is highlighted by a green mask. To clarify what they lack, we suggest the use of three criteria to enable effective communication two (or possibly more) agents communicate over the image. with humans in vision-language tasks. These crite- Visual dialogues involving agents with specific roles or tasks ria also apply to other two-player vision-language were mainly studied up to the present. tasks that require communication with humans, ReferIt game [Kazemzadeh et al., 2014] is an example of e.g., ReferIt. visual dialogue. It is a two player game referring to objects in an image of natural scenes. One player is shown an image 1 Introduction with a target object and has to explain it to distinguish from the other, where what it says is called the referring expression. Recent advances in computer vision and natural language The other player is shown the same image and the referring processing have led researchers’ attentions to the intersec- expression written by the other player and guesses the target tion of these two areas, vision-language tasks. An initiative object. to this kind of task was image description [Kiros et al., 2014; GuessWhat?! game is also an example of visual dialogue Vinyals et al., 2015; Xu et al., 2015; Johnson et al., 2016b; (Fig. 1). It contains dialogue on question answering about the Mao et al., 2016]. In the image description task, an image given image over two players. Its goal is to locate an unknown is given, and the model is supposed to generate descriptions object in a rich image scene by asking a sequence of ques- or captions on the image. However, generated descriptions tions. One player is randomly assigned an object in the image have been difficult to evaluate, and results have not been di- and the other player is to locate the hidden object with a series rectly related to how well the model understand the image. of Yes/No questions. To show the comprehension of the model, visual question answering (VQA) was introduced [Antol et al., 2015; John- On the two tasks mentioned above, each player also can son et al., 2016a; Agrawal et al., 2017; Fukui et al., 2016; be human or agent. If one player is human and the other is Kim et al., 2016]. In VQA, an image and a question about agent, the agent must generate meaningful dialog with hu- the image is given, and the model is supposed to answer the mans in natural and conversational language about a visual question. However, the communication occurs one-way, and image. This aspect is crucial to solve the task successfully. the model has the passive role only to answer questions. Such tasks have been evaluated in terms of task-specific per- Some vision-language tasks require active bidirectional formance metrics such as accuracy or success rate of the task. communication between two (or possibly more) agents. To However, they are not the only criteria to enable efficient bidi- interactively communicate over an image, visual dialogues rectional communication with humans. were introduced [Lazaridou et al., 2016; Das et al., 2016; In this paper, we pose the problem that we need more crite- Mao et al., 2016; de Vries et al., 2017; Strub et al., 2017; ria to measure and analyze the bidirectional communication Das et al., 2017]. In visual dialogues, an image is given, and between human and agent other than metrics. To demonstrate that, we first try to tackle GuessWhat?! game which we men- † authors contributed equally. tioned above and show that our proposed rule-based search 28 systems outperformed not only state-of-the-art, but the hu- this game, both agents should cooperate and learn the rela- man performance measure, the success rate of the task. Then, tion between vision and language. [Lazaridou et al., 2016] we suggest the use of some criteria to enable efficient bidi- designed referIt game between two agents, constitute the re- rectional communication between human and agent in vision- ferring expression by a binary vector between two agents, for- language tasks such as GuessWhat?! game. mulated the game as classification, and solved the classifica- The rest of the paper is organized as follows. First we re- tion problem by neural networks. On the task, these agents view related works to vision-language tasks on Section 2 and develop their own artificial language from the need to com- we proposed rule-based search systems which outperformed municate in order to succeed at the game. It showed some state-of-the-art performance on Section 3. Then, we suggest correlation with human language, but also showed some mis- the use of some criteria for measure and analyze the bidirec- matches. If one player is agent and the other is human, the tional communication between human and agent on Section referring expression may not carry the exact meaning and 4. Finally, we discuss conclusions and future work on Section cause the confusion between the players. In terms of mean- 5. inful vision-language integration between human and agent, we argue that we need to analyze details of these referring 2 Related Works expressions other than metrics such as success rate of game and accuracy. 2.1 Image Description Automatic image description is a challenging problem that 2.3 GuessWhat?! involves analyzing an image, reasoning contextual informa- GuessWhat?! is a cooperative two-player guessing game pro- tion between existing objects in the image and generating posed by [de Vries et al., 2017]. The goal of the game is to textual descriptions. It has been first stage on research about locate an unknown object in a rich image scene by asking a vision-language grounding. [Vinyals et al., 2015] proposed sequence of questions. One player who called Oracle is ran- neural image caption (NIC) generator inspired by advances domly assigned an object in the image and the other player in machine translation. They replaced encoding step which who called the Questioner does not know the object assigned extracts abstract representations of source language using to the Oracle. The goal of the Questioner is to locate the hid- RNN to using CNN fed into given image. Encouraged by ad- den object with a series of Yes/No questions which are an- vances in employing attention in machine translation and ob- swered by the Oracle. If the Questioner selects a right object, ject recognition, attention mechanism is introduced by [Xu et we consider the game successful. al., 2015]. The mechanism can attend to salient part of given [de Vries et al., 2017] collect a large-scale human- image while generating its caption so that demonstrated the played GuessWhat?! game dataset consisting of 800K visual learned alignments correspond very well to human intuition. question-answering pairs on 66K images and propose base- Many previous papers on image description have focused line deep learning model. To solve the proposed task suc- on describing the entire image. On the other hand, [Johnson cessfully, [de Vries et al., 2017] suggested that an agent is et al., 2016b] address a new task, dense captioning, which re- required higher-level image understanding, like spatial rea- quires a model to predict a set of descriptions of regions of soning, visual properties, object taxonomy, and interaction. given image. It is important to understand each object or part The authors also proposed that the agent should understand not an entire image for high-level scene understanding. [Mao the relationships between objects and how they are expressed et al., 2016] also focused on generating an unambiguous de- in natural language. The baseline model consists of 3 parts: scription of a specific object or region in an image. They con- Oracle, Guesser and Question generator. Oracle is a model- sidered both description generation and description compre- based a simple neural network, which fed embedded inputs hension and jointly modeled both tasks combining CNN with and classifies answer among Yes/No or N/A. The role of RNN. guesser is to predict one hidden object. It compares dot- These models are just passive roles only to generate de- products of embedded vectors of image, dialogue, and infor- scription about given image on the task, not bidirectional mation of candidate objects and classifies the most probable communication. And, generated descriptions have been dif- object among the candidates. Question generator is to gener- ficult to evaluate, and results have not been directly related to ate questions reflecting the context of the previous question how well the model understand the image. Therefore, the task answering pairs based on the Hierarchical Recurrent Encoder could be extended for bidirectional communication task such Decoder(HRED) model. as ReferIt or GuessWhat?! game and it needs further consid- As a follow-up research, [Strub et al., 2017] present end- eration for evaluation and analysis. to-end reinforcement learning optimization for question gen- eration task to find the correct object efficiently. They define 2.2 ReferIt GuessWhat?! game as a Markov Decision Process: A state xt ReferIt [Kazemzadeh et al., 2014; Lazaridou et al., 2016] is is the tokens generated on the dialogue until time t and an ac- a two player game referring an object in an image of natu- tion ut is to select a new word with zero-one reward depend- ral scenes. One player is shown an image with a target object ing on the Questioner’s choice. They train the question gener- and has to explain it to distinguish from the other, where what ator with policy gradient and obtain about 17% improvement it says is called the referring expression. The other player is of accuracy as compared with the baseline model. shown the same image and the referring expression written by GuessWhat?! game have been measured only by the suc- the other player and guesses the target object. To accomplish cess rate of the game. However, as following Section 3, we 29 show rule-based search system which attains not only state- of-the-art but also human performance. At the point, we un- Table 1: Test accuracy on the GuessWhat?! dataset. Even if derline that it is not enough to measure the bidirectional com- the image information is not used, the proposed method out- munications only by the metrics such as success rate of the performs the state-of-the-art deep learning methods in two game or accuracy and propose criteria for meaningful evalu- turns and exceeds human performance in four or five turns. ation on Section 4. We improved the system by tuning the first division of an image, utilizing statistics on the spatial information (denoted Fine-Tune). To explore the effect of segmentation, we stole 3 Rule-Based Search Systems a look at the segmentation information of the candidates (de- noted Segment Info). The deep learning methods constructed 3.1 Methods the question generator by the hierarchical recurrent encoder- We constructed rule-based search systems using only the spa- decoder (HRED) or the recurrent neural network (RNN) with tial information of the target object for GuessWhat?!. We can reinforcement learning (RL). divide an image evenly into three parts by two vertical lines and divide each part continuously by horizontal lines or verti- Model Accuracy cal lines in turns (Fig. 2). Then in a current region of interests Baseline 16.04 divided by two vertical lines, we say the left of the left-side 1 Question 38.96 region, the right of the right-side region, and N/A of the mid- 2 Questions 56.25 dle region. Similarly, we say the top, the bottom, and N/A 3 Questions 76.61 when a region is divided by horizontal lines. Through this 4 Questions 85.85 protocol or language, we can ask and answer for the location 5 Questions 94.34 of the center of the target object. Rule-based search systems 1 Question w/ Fine-Tune 39.82 use this simple language to locate the target object. We may 2 Questions w/ Fine-Tune 59.40 also utilize statistics or the distribution of the spatial informa- 1 Question w/ Segment Info 48.12 tion. For the first turn, we may divide an image not evenly. 2 Questions w/ Segment Info 87.67 We found that the range of the middle side of 0.18 covers 1/3 HRED [de Vries et al., 2017] 46.8 target objects, and the other sides of 0.82 covers 2/3 target RNN w/ RL [Strub et al., 2017] 52.3 objects, which means the distribution of target objects was Human [de Vries et al., 2017] 90.8 denser in the evenly-divided middle side. Therefore, we set Human [Strub et al., 2017] 84.4 the vertical lines to locate at 0.41 and 0.59. Given a segmentation model, we can further improve our system. To explore this case, we stole a look at the segmen- generator. Therefore, the oracle and the guesser do not bene- tation information of the candidates, which is supposed to fit from interaction. In contrast, in proposed rule-based search be known when the guesser selects a candidate after a series systems, the oracle and the questioner share a set of strict of question-answer pairs. Then we can implement the binary rules. Proposed systems outperformed the state-of-the-art ac- search based on the spatial information of the candidates. If a curacy in two turns and the human accuracy in four or five segmentation model gives the segmentation similar with that turns. This result is remarkable improvement of accuracy con- of candidates given in the dataset, we can get an algorithm sidering that [de Vries et al., 2017] and [Strub et al., 2017] close to the binary search, which is optimal. generated 5 and 8 questions respectively to get their accu- Proposed rule-based search systems do not break the rule racy and that the minimum, mode, mean, and maximum of of GuessWhat?! game. The spatial information is commonly the number of questions by humans are 1, 3, 5.2, and 24, re- used by humans as appears in the dataset. Moreover, we can spectively. Furthermore, we improved the system by tuning substitute the spatial information with other features or prop- the first division of an image, utilizing statistics on spatial in- erties. We can choose a real-valued feature without the point formation (denoted Fine-Tune). To explore the effect of seg- mass like the area and the color. Then the feature gives an- mentation, we stole a look at the segmentation information of other rule-based search algorithm. This search algorithm is the candidates (denoted Segment Info). optimal (for the brightness of the center of the target ob- ject) or near-optimal (given an optimal segmentation model 4 More Criteria for the area.) We can also use real-valued features with the The search system exploits low-level features and strictly fol- point mass or unordered (nominal or categorical) features by lows a predefined set of rules without considering the uncer- choosing a feature that evenly divide the candidates, which tainty or ambiguity. If either the oracle or the questioner is gives a near-optimal algorithm. a human, the search system may not successfully communi- cate with the human. If the search system takes the part of 3.2 Results the questioner, then it may explain how it works at the be- [de Vries et al., 2017] and [Strub et al., 2017] constructed ginning. If the search system plays the role of the oracle, the the question generator by the hierarchical recurrent encoder- search system is unlikely to take the initiative, so it may not encoder (HRED) and recurrent neural network (RNN) with have a chance to show how it works. To develop more satis- reinforcement learning, respectively (Table 1). The oracle and fying systems, we investigate characteristics of the effective the guesser was trained before interacting with the question AI system for bidirectional communication with humans in 30 Figure 2: A sequence of divisions of an image by rule-based search systems vision-language tasks. We will present some considerations in designing such systems. Then we will review some criteria in vision-language tasks and suggest the use of some criteria to be adopted in GuessWhat?!. To develop effective AI systems for bidirectional commu- nication with humans in vision-language tasks, we need to formulate a problem or task first. Many vision-language tasks have been proposed for this purpose. A task naturally gives a main set of criteria, called objective functions and constraints Figure 3: GuessWhat?! setting with a judge for complement in optimization. However, this main set may not be enough, or substitution with human evaluation then we need to add more criteria. Several criteria in vision- language tasks have been made. We may group these criteria into subjective, task-specific, or similarity criteria. tion (ROUGE) [Lin, 2004], and consensus-based image de- The subjective evaluation by humans is widely performed scription evaluation (CIDEr) [Vedantam et al., 2015]. These in many AI researches as well as vision-language areas. Peo- similarity criteria are calculated systematically, so they cost ple observe or interact with systems and then evaluate how less than subjective criteria. well or alike humans systems behave. The subjective evalua- Adversarial evaluation through neural networks [Bowman tion is not objective so may not be considered scientific, but et al., 2015; Kannan and Vinyals, 2017; Li et al., 2017] was the subjective evaluation is crucial because it tells how hu- suggested as a similarity criterion. Unlike previous similar- mans actually feel about the system. However, the subjective ity criteria, the adversarial evaluation is not fixed. Instead, it evaluation is costly in general. changes when a neural network called a discriminator learns Task-specific criteria are essential to show how well the whether the speaker is a human or not. After learning, the system performs in each task. In vision-language tasks, some discriminator tells the human from the other for the test data. cross-modal classification or retrieval metrics have been used Since the discriminator is a neural network, it may catch vari- including accuracy, median rank (mRank), and precision / re- ous complex patterns which could not be found by other sim- call at k (P/R@k) [Kent et al., 1955]. These criteria are mea- ilarity metrics. However, training the discriminator is neces- sured on data that was collected before constructing the sys- sary unlike other similarity metrics. tem, so they cost less than subjective criteria, which require Beyond the accuracy, we need to adopt more criteria in additional human efforts. GuessWhat?!. Basically, more criteria are considered the bet- Similarity criteria evaluate how human-like systems be- ter in evaluating vision-language tasks, which is partially be- have. In language generations like machine translations cause we do not have a unique criterion that everyone agrees and text summarizations, some language similarity metrics with. However, we have limited resources, so we have to have been used including bilingual evaluation understudy choose some criteria among them. The accuracy is the ob- (BLEU) [Papineni et al., 2002], metric for evaluation of vious task-specific criterion, but people would not be satis- translation with explicit ordering (METEOR) [Banerjee and fied only with the accuracy achieved by two AI players. We Lavie, 2005], recall-oriented understudy for gisting evalua- need to solve the following constrained optimization prob- 31 lem to develop systems that can communicate with humans in the results, we argue that we need to measure the performance GuessWhat?!. Given an objective functional f (the accuracy of the system and analyze details of the results with more con- in GuessWhat?!), human oracle Oh , and human questioner crete criteria not just task-specific metrics such as the success Qh , the optimal oracle O∗ and questioner Q∗ are given by rate of game and accuracy. We suggested the use of criteria solving for bidirectional communication between humans and agents max f (O, Q) in vision-language tasks. The adversarial evaluation can be O,Q considered as a promising similarity criterion. s.t. O is compatible with Qh (1) Q is compatible with Oh Acknowledgments where, for a threshold t > 0, This work was supported by the Institute for Information O is compatible with Qh & Communications Technology Promotion (2015-0-00310- if f (O, Qh ) > (1 − t) · max f (O, Qh ) SW.StarLab) and Korea Evaluation Institute of Industrial O Technology (10044009-HRI.MESSI, 10060086-RISF) (2) Q is compatible with Oh if f (Oh , Q) > (1 − t) · max f (Oh , Q) References Q [Agrawal et al., 2017] Aishwarya Agrawal, Aniruddha This problem is a joint optimization problem and is difficult Kembhavi, Dhruv Batra, and Devi Parikh. C-vqa: A to solve because it involves two optimization functions and compositional split of the visual question answering (vqa) the interaction with human oracles as well as the observation v1. 0 dataset. arXiv preprint arXiv:1704.08243, 2017. of human questioners. Instead, a two-phase greedy optimiza- tion was commonly used in previous works. It involves the [Antol et al., 2015] Stanislaw Antol, Aishwarya Agrawal, Ji- observation of human questioners like the previous joint opti- asen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zit- mization problem, but it involves only one optimization func- nick, and Devi Parikh. Vqa: Visual question answering. tion at each phase and the interaction with an oracle model In Proceedings of the IEEE International Conference on instead of human oracles. Computer Vision, pages 2425–2433, 2015. Ô = argmax f (O, Qh ) (3) [Banerjee and Lavie, 2005] Satanjeev Banerjee and Alon O Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceed- Q̂ = argmax f (Ô, Q) (4) ings of the acl workshop on intrinsic and extrinsic evalua- Q tion measures for machine translation and/or summariza- However, Ô and Q̂ are marginally optimal, and Q̂ may not be tion, volume 29, pages 65–72, 2005. compatible with Oh . We may employ human oracles to deter- [Bowman et al., 2015] Samuel R Bowman, Luke Vilnis, mine whether Q̂ is compatible with Oh even if it costs large. Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Under the assumption that a questioner Q similar with the Samy Bengio. Generating sentences from a continuous human questioner Qh is compatible with Oh , similarity cri- space. arXiv preprint arXiv:1511.06349, 2015. teria may complement or substitute with human evaluation. Criteria that distinguish the AI system from humans are nec- [Das et al., 2016] Abhishek Das, Satwik Kottur, Khushi essary for this purpose. The adversarial metric is a promising Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi criterion among them because it involves the neural network, Parikh, and Dhruv Batra. Visual dialog. arXiv preprint which can learn complex patterns, as a discriminator or judge arXiv:1611.08669, 2016. who determines whether two players are human-like (Fig. 3). [Das et al., 2017] Abhishek Das, Satwik Kottur, José MF We reviewed some criteria in vision-language tasks and Moura, Stefan Lee, and Dhruv Batra. Learning coopera- suggested the use of some criteria to be adopted in Guess- tive visual dialog agents with deep reinforcement learning. What?!. We grouped the criteria into subjective, task-specific, arXiv preprint arXiv:1703.06585, 2017. or similarity criteria and gave some examples. In Guess- [de Vries et al., 2017] Harm de Vries, Florian Strub, Sarath What?!, we need more criteria other than the accuracy. If we are affordable, then we can employ human evaluation. Oth- Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron erwise, we may choose similarity criteria. We recommended Courville. Guesswhat?! visual object discovery through the adversarial evaluation as a promising similarity criterion. multi-modal dialogue. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, 2017. 5 Conclusion [Fukui et al., 2016] Akira Fukui, Dong Huk Park, Daylen We proposed rule-based search systems which used only Yang, Anna Rohrbach, Trevor Darrell, and Marcus the spatial information of the target object for GuessWhat?! Rohrbach. Multimodal compact bilinear pooling for visual game. It can be regarded as the artificial language between question answering and visual grounding. arXiv preprint agents on the GuessWhat?! task. Our rule-based search sys- arXiv:1606.01847, 2016. tem outperformed the state-of-the-art accuracy in two turns [Johnson et al., 2016a] Justin Johnson, Bharath Hariharan, and the human accuracy in four or five turns. In the view of Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, 32 and Ross Girshick. Clevr: A diagnostic dataset for compo- End-to-end optimization of goal-driven and visu- sitional language and elementary visual reasoning. arXiv ally grounded dialogue systems. arXiv preprint preprint arXiv:1612.06890, 2016. arXiv:1703.05423, 2017. [Johnson et al., 2016b] Justin Johnson, Andrej Karpathy, [Vedantam et al., 2015] Ramakrishna Vedantam, and Li Fei-Fei. Densecap: Fully convolutional localization C Lawrence Zitnick, and Devi Parikh. Cider: Consensus- networks for dense captioning. In Proceedings of the IEEE based image description evaluation. In Proceedings of Conference on Computer Vision and Pattern Recognition, the IEEE Conference on Computer Vision and Pattern pages 4565–4574, 2016. Recognition, pages 4566–4575, 2015. [Kannan and Vinyals, 2017] Anjuli Kannan and Oriol [Vinyals et al., 2015] Oriol Vinyals, Alexander Toshev, Vinyals. Adversarial evaluation of dialogue models. arXiv Samy Bengio, and Dumitru Erhan. Show and tell: A neu- preprint arXiv:1701.08198, 2017. ral image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, [Kazemzadeh et al., 2014] Sahar Kazemzadeh, Vicente Or- pages 3156–3164, 2015. donez, Mark Matten, and Tamara L Berg. Referitgame: Referring to objects in photographs of natural scenes. In [Xu et al., 2015] Kelvin Xu, Jimmy Ba, Ryan Kiros, EMNLP, pages 787–798, 2014. Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdi- nov, Richard S Zemel, and Yoshua Bengio. Show, attend [Kent et al., 1955] Allen Kent, Madeline M Berry, Fred U and tell: Neural image caption generation with visual at- Luehrs, and James W Perry. Machine literature search- tention. In ICML, volume 14, pages 77–81, 2015. ing viii. operational criteria for designing information re- trieval systems. Journal of the Association for Information Science and Technology, 6(2):93–101, 1955. [Kim et al., 2016] Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. Multimodal residual learning for vi- sual qa. In Advances in Neural Information Processing Systems, pages 361–369, 2016. [Kiros et al., 2014] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Multimodal neural language models. In Icml, volume 14, pages 595–603, 2014. [Lazaridou et al., 2016] Angeliki Lazaridou, Nghia The Pham, and Marco Baroni. Towards multi-agent communication-based language learning. arXiv preprint arXiv:1605.07133, 2016. [Li et al., 2017] Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, and Dan Jurafsky. Adversarial learning for neu- ral dialogue generation. arXiv preprint arXiv:1701.06547, 2017. [Lin, 2004] Chin-Yew Lin. Rouge: A package for auto- matic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL-04 workshop, vol- ume 8. Barcelona, Spain, 2004. [Mao et al., 2016] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Mur- phy. Generation and comprehension of unambiguous ob- ject descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11– 20, 2016. [Papineni et al., 2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for auto- matic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002. [Strub et al., 2017] Florian Strub, Harm de Vries, Jeremie Mary, Bilal Piot, Aaron Courville, and Olivier Pietquin. 33