=Paper=
{{Paper
|id=Vol-2769/38
|storemode=property
|title=Grounded and Ungrounded Referring Expressions in Human Dialogues: Language Mirrors Different Grounding Conditions
|pdfUrl=https://ceur-ws.org/Vol-2769/paper_38.pdf
|volume=Vol-2769
|authors=Eleonora Gualdoni,Raffaella Bernardi,Raquel Fernández,Sandro Pezzelle
|dblpUrl=https://dblp.org/rec/conf/clic-it/GualdoniBFP20
}}
==Grounded and Ungrounded Referring Expressions in Human Dialogues: Language Mirrors Different Grounding Conditions==
Grounded and Ungrounded Referring Expressions in Human Dialogues:
Language Mirrors Different Grounding Conditions
Eleonora Gualdoni, Raffaella Bernardi Raquel Fernández, Sandro Pezzelle
University of Trento University of Amsterdam
eleonora.gualdoni@studenti.unitn.it raquel.fernandez@uva.nl
raffaella.bernardi@unitn.it s.pezzelle@uva.nl
Abstract grounded condition
L: i have grapefruit with carrots
We study how language use differs be- and celery
tween dialogue partners in a visually F: yep me too might be a blood
grounded reference task when a referent orange though really dark
is mutually identifiable by both interlocu- non-grounded condition
tors vs. when it is only available to one L: what about a guy in a suit and
of them. In the latter case, the addressee black hat holding a blue plaid
needs to disconfirm a proposed descrip- umbrella with more of them
tion – a skill largely neglected by both the around him
theoretical and the computational linguis- F: i do not have that one
tics communities. We consider a num-
ber of linguistic features that we expect Figure 1: Examples of dialogue segments where
to vary across conditions. We then an- the image referent is visible to both leader and fol-
alyze their effectiveness in distinguishing lower (grounded condition) or only visible to the
among the two conditions by means of sta- leader (non-grounded condition).
tistical tests and a feature-based classifier.
Overall, we show that language mirrors
different grounding conditions, paving the order to refer to an object in the world – speakers
way to future deeper investigation of ref- must believe that the referent is mutually identifi-
erential disconfirmation. able to them and their addressees. This is an im-
portant skill that human speakers leverage to suc-
1 Introduction ceed in communication.
Communication is a joint activity in which inter- However, humans are not only able to identify
locutors share or synchronize aspects of their pri- an object described by the interlocutor – that is,
vate mental states and act together in the world. grounding a referring expression – but also to un-
To understand what our minds indeed do during derstand that such an object is not in the scene
communication, Brennan et al. (2010) highlight and, therefore, it cannot be grounded. It can hap-
the need to study language in interpersonal coor- pen, indeed, that a referent is not mutually iden-
dination scenarios. When a conversation focuses tifiable by the speakers, due to the speakers being
on objects, interlocutors have to reach the mu- in different grounding conditions. In this case, the
tual belief that the addressee has identified the dis- addressee is able to disconfirm a description stated
cussed referent by means of visual grounding. In by the interlocutor by communicating that he/she
this frame, Clark and Wilkes-Gibbs (1986) have does not see it (as in Figure 1). This is a crucial
pointed to referring as a collaborative process, that skill of human speakers. However, it is often ne-
requires action and coordination by both speak- glected in the computational modelling of conver-
ers and interlocutors, and that needs to be stud- sational agents.
ied with a collaborative model. Clark and Wilkes- We conjecture that the participants’ visual
Gibbs (1986), in fact, have highlighted that – in grounding conditions have an impact on the lin-
guistic form and structure of their utterances. If
Copyright c 2020 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In- confirmed, our hypothesis would lead to the claim
ternational (CC BY 4.0). that mature AI dialogue systems should learn to
master their language with the flexibility shown by gle image is being discussed.2 This results in a
humans. In particular, their language use should dataset composed of 3,777 segments paired with
differ when the referred object is mutually identifi- a given image referent and an action label indi-
able or not. It has been shown that current AI mul- cating whether the referent is visible to both par-
timodal systems are not able to decide if a visual ticipants or only to one. The annotated dataset, to-
question is answerable or not (Bhattacharya et al., gether with other relevant materials, is available at:
2019), and they fail to identify whether the entity https://dmg-photobook.github.io/
to which an expression refer is present in the visual The PhotoBook task does not impose a specific
scene or not (Shekhar et al., 2017b; Shekhar et al., role on the players, unlike for example the Map-
2017a). We believe models can acquire this skill if Task corpus (Anderson et al., 1991), where there
they learn to play the “language game” properly. are predefined information giver and information
In this paper, we investigate how the language follower roles. In PhotoBook, the dialogues typ-
of human conversational partners changes when ically follow this scheme: one of the participants
they are in a mutually grounded (they both see the spontaneously decides to describe one of the im-
image they are speaking about) or non-mutually ages highlighted in their grid and the other partic-
grounded setting (one sees the image while the ipant indicates whether they also have it in their
other does not). own grid or not. We call the former player the
We find that, indeed, there are statistically sig- leader and the latter the follower.3 We refer to
nificant differences along various linguistic di- situations where the follower also sees the image
mensions, including utterance length, parts of described by the leader as the grounded condi-
speech, and the degree of concreteness of the tion and those where the follower does not see the
words used. Moreover, a simple SVM classifier image as the non-grounded condition. Naturally,
based on these same features is shown to be able the leader always sees the referent image.
to distinguish between the two conditions with a Out of the 3,777 dialogue segments in our
relatively high performance. dataset, 1,624 belong to the grounded condition
and 2,153 to the non-grounded one.
2 Dataset
3 Linguistics Features
We take the PhotoBook dataset (Haber et al.,
2019) as our testbed: two participants play a game We hypothesize that the language used by the di-
where each sees a different grid with six images alogue participants will differ in the grounded vs.
showing everyday scenes.1 Some of the images non-grounded condition. To test this hypothesis,
are common to both players, while others are only we first identify several linguistic features that we
displayed to one of them. In each grid, three of the expect to vary across conditions.
images are highlighted. By chatting with their dia- Length. We expect that the length of the utter-
logue partner, each player needs to decide whether ances and the overall dialogue segments may de-
each of the three highlighted images is also visible pend on the players’ possibility to see the refer-
to their partner or not. ent. For example, in the non-grounded condition
A full game consists of five rounds, and the more utterances may be needed to conclude that
players can decide to move to the next round when the follower does not see the referent (thus leading
they are confident about their decisions. As the to longer segments). Furthermore, not seeing the
game progresses, some images may reappear in referred image could limit the expressivity of the
subsequent rounds. The corpus is divided into di- utterances by non-grounded follower (thus leading
alogue segments: the consecutive utterances that, to shorter utterances).
as a whole, discuss a given target image and in- 2
We discard segments that refer to more than one image
clude expressions referring to it. From the set of as well as those labelled with the wrong image by the original
all segments in PhotoBook, we create our dataset heuristics (Haber et al., 2019).
3
by focusing on segments belonging to the first We use simple heuristics to assign these roles a poste-
riori: when the image is not in common, we label as the
round of a game (since at that point all images follower the participant who does not see the image, while
are new to the participants) and where a sin- when the image is visible to both participants we consider
the follower the player who produces the last utterance of the
1
The images used in the PhotoBook task are taken from segment. We manually corrected the classification of the few
the MS COCO 2014 Trainset (Lin et al., 2014). segments that did not follow this general rule.
We compute utterance length as number of to- We extract POS distributions by first POS-
kens per utterance and segment length as both tagging the utterances in the dataset7 and then
number of tokens per segment and number of ut- computing the proportion of words per segment
terances per segment. that are nouns, adjectives, verbs, or determiners,
respectively. Given the different functions of dif-
Word frequency. Frequency effects are key in ferent determiners, we break down this class and
psycholinguistics. Word frequency is one of independently compute proportions for each of the
the strongest predictors of processing efficiency following determiners: a/an, the, that, those, this,
(Monsell et al., 1989) and experiments have con- these, some, all, each, any, half, both.
firmed its link to memory performances (Yoneli-
nas, 2002). It is plausible that different grounding 4 Statistical Analysis
conditions lead to different word choices, and that
To test our hypothesis that the language used by
word frequency turns out to be a key aspect of this
the participants differs in the grounded vs. non-
linguistic variation.
grounded condition, we perform a statistical anal-
To estimate word frequency, we use off-the-
ysis on our data. We compare: (1) the utterances
shelf lemma frequency scores (frequency per mil-
by the leaders in the grounded and non-grounded
lion tokens) from the British National Corpus
conditions, and (2) the utterances by the followers
(Leech et al., 2014).4 For each segment in our
in the grounded and non-grounded conditions. We
dataset, we compute the average word frequency
evaluate the statistical significance of these com-
by first lemmatizing the words in the segment and
parisons with a Mann-Whitney U Test, which does
then calculating the average frequency score for
not assume the data fits any specific distribution
all lemma types in the segment.5
type. Below we report the results of each of these
comparisons. Unless otherwise specified, statisti-
Concreteness. Concreteness is fundamental to
cal significance is tested for p < 0.001.
human language processing since it helps to
clearly convey information about the world (Hill Length. Followers use significantly fewer words
and Korhonen, 2014). We use the concreteness while leaders use significantly more words in the
scores by Brysbaert et al. (2014), correspond- non-grounded condition than in the grounded con-
ing to 40K English word lemmas, and collected dition. This trend is also illustrated in the example
via crowd-sourcing, where participants were re- in Figure 1. Although followers use fewer words
quested to evaluate word-concreteness by using a in the non-grounded condition, they produce a sig-
5-point rating scale ranging from abstract to con- nificantly higher number of utterances per seg-
crete. We compute the average word concreteness ment, while no reliable differences are observed
by first lemmatizing the words in the segment and for the leaders (see Figure 2a and 2e, respectively).
then calculating the average score for all lemma These findings indicate that establishing that a re-
types in the segment without repetitions, divided ferring expression cannot be commonly grounded
by part-of-speech (POS).6 requires more evidence and more information than
resolving the expression.
Parts of Speech distributions. Different POS
differ in their function and descriptive power. Frequency. Followers use significantly more
We thus expect that their distribution will vary high-frequency words in the grounded condition
between grounded and non-grounded conditions. than the non-grounded condition, in particular for
For example, we expect nouns and adjectives to be nouns and conjunctions. This is consistent with
more likely in visually grounded referential acts, the reported production of more utterances per
while determiners may signal whether the referent segment in the non-grounded condition, and sug-
is in common ground or not (the vs. a) and give gests that the non-grounded follower uses them
clues about the polarity of the context where they to talk about fine-grained details described by
are used (any vs. each). low-frequency words. In contrast, high-frequency
verbs are reliably more common in the non-
4
Available at http://ucrel.lancs.ac.uk/ grounded condition (see Figure 2b).
bncfreq/flists.html
5 7
Lemmas not present in the BNC lists are ignored. We use the NLTK Python library (Bird et al., 2009) in its
6
Lemmas not present in the corpus are ignored. “universal” tagset version.
Verbs; SR = ***
Length in utterances: SR = *** Frequency: verbs; SR = ***
2.00 Concreteness: adjectives; SR = ***
grounded 4.0 0.5
non-grounded
Average word frequency (per million tokens)
1.75 4000
3.5
Average verbs / utterance length
1.50 0.4
3.0
3000
Average concreteness
1.25 2.5
Average length
0.3
1.00 2.0
2000
0.75 1.5
0.2
0.50 1000 1.0
0.1
0.25 0.5
0.00 0 0.0 0.0
Followers Followers Followers Followers
(a) (b) (c) (d)
Verbs; SR = *
Length in utterances: SR = NR Frequency: verbs; SR = NR Concreteness: adjectives; SR = *
2.00 4.0
grounded 0.5
non-grounded
Average word frequency (per million tokens)
1.75 4000 3.5
Average verbs / utterance length
1.50 3.0
0.4
3000
Average concreteness
1.25 2.5
Average length
0.3
1.00 2.0
2000
0.75 1.5 0.2
0.50 1000 1.0
0.1
0.25 0.5
0.00 0 0.0 0.0
Leaders Leaders Leaders Leaders
(e) (f) (g) (h)
Figure 2: From left to right, difference between grounded and non-grounded condition for: (a/e) number
of utterances per segment; (b/f) frequency of used verbs; (c/g) concreteness of used adjectives; (d/h)
proportion of verbs. Top: followers; bottom: leaders. We use *** to refer to statistical significance at
p < 0.001; ** for p < 0.01; * for p < 0.05; . for p < 0.1. Best viewed in color.
For example, note the high-frequency verbs do liably different, for the other POS there is either
and have used by the non-grounded follower in no or marginally reliable difference (see adjectives
Figure 1. The language of leaders, in contrast, in Figure 2g, adverbs, conjunctions, and numerals)
shows marginally reliable or no difference across between the two conditions. This is expected since
conditions regarding word frequency (see, e.g., their language is always visually grounded.
the case of verbs in Figure 2f), except for high-
frequency nouns and conjunctions, which are re- Parts of speech. Followers use significantly
liably more common in the grounded condition more nouns and the determiners a/an, the, each
(p < 0.01). in the grounded condition, while in the non-
grounded condition they use significantly more
Concreteness. Somehow counterintuitively, fol- verbs (see Figure 2d) and determiners all and any.
lowers use overall significantly more concrete That is, the grounded condition leads followers to
words in the non-grounded than in the grounded more directly describe what they see by focusing
condition. However, an opposite pattern is found on a specific object, as in the grounded example in
for adjectives, which usually describe the colors of Figure 1. In contrast, the non-grounded condition
the objects in the scene (see Figure 2c). This lat- elicits utterances with more ‘confirmation’ verbs
ter result is in line with our intuitions: in the non- such as do and have and a more vague language
grounded condition, followers do not have direct signalled by the use of quantifiers, e.g., “I don’t
access to the specific perceptual properties of the have any of a cake”. As for the leaders, we ob-
entities in the image and hence use less concrete serve a mixed pattern of results, though, overall,
adjectives. As for the leaders, while nouns are re- there are less reliable differences between the two
conditions compared to the followers (see the case 6 Related Work
of verbs in Figure 2h).
Current multimodal systems are trained to process
5 Automatic Classification and relate modalities capturing correspondences
between “sensory” information (Baltrusaitis et al.,
To more formally investigate the effectiveness of 2017). It has been shown they have trouble de-
our selected features in distinguishing between ciding if a question is answerable or not (Bhat-
various grounding conditions, we feed them into tacharya et al., 2019). Moreover, they fail to
an SVM classifier which predicts GFC or NGFC. identify whether the entity to which an expression
We run two SVM models: one for leaders, SVM refers is present in the visual scene or not (Shekhar
leaders, and one for followers, SVM followers.8 et al., 2017b; Shekhar et al., 2017a). Connected
Our hypothesis is that SVM leaders should not be to this weakness is the limitation they encounter
very effective in the binary classification task since when put to work as dialogue systems, where
the language of the leaders differs only on few as- they fail to build common ground from minimally-
pects, and less reliably between the two conditions shared information (Udagawa and Aizawa, 2019).
compared to the followers’. In contrast, we expect To be successful in communication, speakers are
SVM followers to achieve a good performance in supposed to attribute mental states to their inter-
the task, given the significant differences observed locutors even when they are different from their
between the two conditions. own (Rabinowitz et al., 2018; Chandrasekaran et
Starting from all our linguistic features (see al., 2017). This, in multimodal situations, can hap-
above), we excluded those that turned out to be pen when the visual scene is only partially com-
multicollinear in a Variance Inflation Factor test mon between them. AI models have difficulties in
(VIF).9 The resulting N features (27 for the lead- such conditions (Udagawa and Aizawa, 2019).
ers, 28 for the followers), were used to build, for We study how the language of conversational
each datapoint, an N -dimensional vector of fea- partners changes when (i) speakers refer to an im-
tures that was fed into the classifier. We performed age their interlocutor does not see and (ii) nei-
10-fold cross-validation on the entire dataset. ther of the two is aware of this unshared vi-
Table 1 reports the accuracy, precision, recall sual ground. Though the idea that the ground-
and F1-score of the two SVM models. While SVM ing conditions of the addressees can affect their
leaders is at chance level, SVM followers achieves interlocutor’s language is not new in psycholin-
a fairly high performance in the binary classifica- guistics (Brennan et al., 2010; Brown and Dell,
tion task. This indicates that our linguistic features 1987; Lockridge and Brennan, 2002; Bard and
are effective in distinguishing among the two con- Aylett, 2000), our approach differs from previous
ditions in the followers’ segments. These results ones since it proposes a computational analysis
confirm that the language of the speakers in the of visual dialogues. Moreover, differently from
follower role is affected by their grounding con- other computational approaches (Bhattacharya et
dition, and that a well-informed model is able to al., 2019; Gurari et al., 2018), we investigate sce-
capture that by means of their language’s linguis- narios where the disconfirmation of a referent’s
tic features. presence is the answer instead of suggesting a case
Table 2 reports the confusion matrices produced of unanswerability.
by our SVM models after 10-fold cross-validation.
We can notice that SVM leaders wrongly labels 7 Conclusion
NGFC datapoints as GFC in 1,381 cases, thus pro-
Our findings confirm that, in a visually-grounded
ducing a high number of false positives. This does
dialogue, different linguistic strategies are em-
not happen with SVM followers, which is overall
ployed by speakers based on different grounding
more accurate.
conditions. Our statistical analyses reliably indi-
8
We experiment with the scikit-learn Python li- cate that several aspects of the language used in the
brary (Pedregosa et al., 2011) for C-Support Vector Classi-
fication. We use the default Radial Basis Function (rbf) ker- conversation mirror whether the referred image is
nel. Parameter C set to 100 gives the best results. – or not – mutually shared by the interlocutors.
9
The VIF test indicates whether there is a strong linear Moreover, the effectiveness of a simple feature-
association between a predictor and the others (Pituch and
Stevens, 2016). When the VIF index exceeded 10, we per- based classifier to distinguish between the two fol-
formed a variable deletion (Myers, 1990). lowers’ conditions further indicates that the lan-
Accuracy Precision Recall F1-score
GFC NGFC Av. GFC NGFC Av. GFC NGFC Av.
SVM leaders 0.57 0.15 0.89 0.40 0.50 0.58 0.55 0.23 0.70 0.50
SVM followers 0.80 0.77 0.79 0.78 0.73 0.82 0.78 0.75 0.80 0.78
Table 1: Accuracy, Precision, Recall, and F1-score of our SVM models, computed per class on a 10-
fold cross-validation, with the corresponding weighted averages (Av.). Since our two classes (GFC and
NGFC) are not balanced, chance level is 0.57.
SVM leaders SVM followers Philippe Morency. 2017. Multimodal ma-
chine learning: A survey and taxonomy. CoRR,
GFC NGFC GFC NGFC abs/1705.09406.
GFC 243 1381 1245 379 Ellen G Bard and MP Aylett. 2000. Accessibility, du-
NGFC 242 1911 461 1692 ration, and modeling the listener in spoken dialogue.
In Proceedings of the Götalog 2000 Fourth Work-
shop on the Semantics and Pragmatics of Dialogue.
Table 2: The confusion matrices produced by our
SVM models on a 10-fold cross-validation. Nilavra Bhattacharya, Qing Li, and Danna Gurari.
2019. Why does a visual question have different an-
swers? In Proceedings of the IEEE International
guage used by the speakers differs along several Conference on Computer Vision, pages 4271–4280.
dimensions. We believe this capability of humans Steven Bird, Ewan Klein, and Edward Loper. 2009.
to flexibly tune their language underpins their suc- Natural Language Processing with Python: Analyz-
cess in communication. We suggest that efforts ing text with the natural language toolkit. O’Reilly
should be put in developing conversational AI sys- Media, Inc.
tems that are capable to master language with a Susan E Brennan, Alexia Galati, and Anna K Kuhlen.
similar flexibility. This could be achieved, for ex- 2010. Two minds, one dialog: Coordinating speak-
ample, by exposing models to one or the other ing and understanding. In Psychology of learning
and motivation, volume 53, pages 301–344. Else-
condition during training to encourage them en-
vier.
code the relevant linguistic features. Alternatively,
they should first understand whether the grounded Paula M Brown and Gary S Dell. 1987. Adapting
information which is referred to is available to production to comprehension: The explicit mention
of instruments. Cognitive Psychology, 19(4):441 –
them or not. These are open challenges that we 472.
plan to tackle in future work.
Marc Brysbaert, Amy Beth Warriner, and Victor Ku-
Acknowledgments perman. 2014. Concreteness ratings for 40 thou-
sand generally known English word lemmas. Be-
EG carried out part of the work while being an havior research methods, 46(3):904–911.
ERASMUS+ visiting student at the University of Arjun Chandrasekaran, Deshraj Yadav, Prithvijit Chat-
Amsterdam. SP and RF are funded by the Eu- topadhyay, Viraj Prabhu, and Devi Parikh. 2017. It
ropean Research Council (ERC) under the Euro- takes two to tango: Towards theory of AI’s mind.
pean Union’s Horizon 2020 research and inno- CoRR, abs/1704.00717.
vation programme (grant agreement No. 819455 Herbert H. Clark and Deanna Wilkes-Gibbs. 1986. Re-
awarded to RF). ferring as a collaborative process. Cognition, 22:1–
39.
References Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo,
Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P
Anne H Anderson, Miles Bader, Ellen Gurman Bard, Bigham. 2018. Vizwiz grand challenge: Answer-
Elizabeth Boyle, Gwyneth Doherty, Simon Garrod, ing visual questions from blind people. In Proceed-
Stephen Isard, Jacqueline Kowtko, Jan McAllister, ings of the IEEE Conference on Computer Vision
Jim Miller, et al. 1991. The HCRC map task corpus. and Pattern Recognition, pages 3608–3617.
Language and speech, 34(4):351–366.
Janosch Haber, Tim Baumgärtner, Ece Takmaz, Lieke
Tadas Baltrusaitis, Chaitanya Ahuja, and Louis- Gelderloos, Elia Bruni, and Raquel Fernández.
2019. The PhotoBook dataset: Building common Takuma Udagawa and Akiko Aizawa. 2019. A nat-
ground through visually-grounded dialogue. In Pro- ural language corpus of common grounding under
ceedings of the 57th Annual Meeting of the Asso- continuous and partially-observable context. CoRR,
ciation for Computational Linguistics, pages 1895– abs/1907.03399.
1910.
Andrew P Yonelinas. 2002. The nature of recollection
Felix Hill and Anna Korhonen. 2014. Concreteness and familiarity: A review of 30 years of research.
and subjectivity as dimensions of lexical meaning. Journal of Memory and Language, 46(3):441–517.
In Proceedings of the 52nd Annual Meeting of the
Association for Computational Linguistics (Volume
2: Short Papers), pages 725–731.
Geoffrey Leech, Paul Rayson, et al. 2014. Word fre-
quencies in written and spoken English: Based on
the British National Corpus. Routledge.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,
and C Lawrence Zitnick. 2014. Microsoft COCO:
Common objects in context. In European Confer-
ence on Computer Vision, pages 740–755. Springer.
Calion Lockridge and Susan Brennan. 2002. Ad-
dressees’ needs influence speakers’ early syntactic
choices. Psychonomic bulletin & review, 9:550–7,
10.
Stephen Monsell, Michael C Doyle, and Patrick N Hag-
gard. 1989. Effects of frequency on visual word
recognition tasks: Where are they? Journal of Ex-
perimental Psychology: General, 118(1):43.
Raymond H Myers. 1990. Classical and modern re-
gression with applications. Duxbury, Boston, MA,
2nd edition.
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram-
fort, Vincent Michel, Bertrand Thirion, Olivier
Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
Weiss, Vincent Dubourg, et al. 2011. Scikit-learn:
Machine learning in Python. Journal of Machine
Learning Research, 12:2825–2830.
Keenan A Pituch and James P Stevens. 2016. Applied
Multivariate Statistics for the Social Sciences. Rout-
ledge, 6th edition.
Neil Rabinowitz, Frank Perbet, Francis Song, Chiyuan
Zhang, SM Ali Eslami, and Matthew Botvinick.
2018. Machine theory of mind. In International
Conference on Machine Learning, pages 4218–
4227.
Ravi Shekhar, Sandro Pezzelle, Aurélie Herbelot, Moin
Nabi, Enver Sangineto, and Raffaella Bernardi.
2017a. Vision and language integration: Moving
beyond objects. In IWCS 2017—12th International
Conference on Computational Semantics—Short pa-
pers.
Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich,
Aurélie Herbelot, Moin Nabi, Enver Sangineto, and
Raffaella Bernardi. 2017b. FOIL it! find one mis-
match between image and language caption. In Pro-
ceedings of the 55th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1: Long
Papers), pages 255–265.