=Paper= {{Paper |id=Vol-2769/38 |storemode=property |title=Grounded and Ungrounded Referring Expressions in Human Dialogues: Language Mirrors Different Grounding Conditions |pdfUrl=https://ceur-ws.org/Vol-2769/paper_38.pdf |volume=Vol-2769 |authors=Eleonora Gualdoni,Raffaella Bernardi,Raquel Fernández,Sandro Pezzelle |dblpUrl=https://dblp.org/rec/conf/clic-it/GualdoniBFP20 }} ==Grounded and Ungrounded Referring Expressions in Human Dialogues: Language Mirrors Different Grounding Conditions== https://ceur-ws.org/Vol-2769/paper_38.pdf
 Grounded and Ungrounded Referring Expressions in Human Dialogues:
          Language Mirrors Different Grounding Conditions

       Eleonora Gualdoni, Raffaella Bernardi Raquel Fernández, Sandro Pezzelle
                University of Trento             University of Amsterdam
         eleonora.gualdoni@studenti.unitn.it                           raquel.fernandez@uva.nl
               raffaella.bernardi@unitn.it                                 s.pezzelle@uva.nl



                        Abstract                                              grounded condition
                                                                L: i have grapefruit with carrots
    We study how language use differs be-                          and celery
    tween dialogue partners in a visually                       F: yep me too might be a blood
    grounded reference task when a referent                        orange though really dark
    is mutually identifiable by both interlocu-                             non-grounded condition
    tors vs. when it is only available to one                   L: what about a guy in a suit and
    of them. In the latter case, the addressee                     black hat holding a blue plaid
    needs to disconfirm a proposed descrip-                        umbrella with more of them
    tion – a skill largely neglected by both the                   around him
    theoretical and the computational linguis-                  F: i do not have that one
    tics communities. We consider a num-
    ber of linguistic features that we expect                   Figure 1: Examples of dialogue segments where
    to vary across conditions. We then an-                      the image referent is visible to both leader and fol-
    alyze their effectiveness in distinguishing                 lower (grounded condition) or only visible to the
    among the two conditions by means of sta-                   leader (non-grounded condition).
    tistical tests and a feature-based classifier.
    Overall, we show that language mirrors
    different grounding conditions, paving the                  order to refer to an object in the world – speakers
    way to future deeper investigation of ref-                  must believe that the referent is mutually identifi-
    erential disconfirmation.                                   able to them and their addressees. This is an im-
                                                                portant skill that human speakers leverage to suc-
1    Introduction                                               ceed in communication.
Communication is a joint activity in which inter-                   However, humans are not only able to identify
locutors share or synchronize aspects of their pri-             an object described by the interlocutor – that is,
vate mental states and act together in the world.               grounding a referring expression – but also to un-
To understand what our minds indeed do during                   derstand that such an object is not in the scene
communication, Brennan et al. (2010) highlight                  and, therefore, it cannot be grounded. It can hap-
the need to study language in interpersonal coor-               pen, indeed, that a referent is not mutually iden-
dination scenarios. When a conversation focuses                 tifiable by the speakers, due to the speakers being
on objects, interlocutors have to reach the mu-                 in different grounding conditions. In this case, the
tual belief that the addressee has identified the dis-          addressee is able to disconfirm a description stated
cussed referent by means of visual grounding. In                by the interlocutor by communicating that he/she
this frame, Clark and Wilkes-Gibbs (1986) have                  does not see it (as in Figure 1). This is a crucial
pointed to referring as a collaborative process, that           skill of human speakers. However, it is often ne-
requires action and coordination by both speak-                 glected in the computational modelling of conver-
ers and interlocutors, and that needs to be stud-               sational agents.
ied with a collaborative model. Clark and Wilkes-                   We conjecture that the participants’ visual
Gibbs (1986), in fact, have highlighted that – in               grounding conditions have an impact on the lin-
                                                                guistic form and structure of their utterances. If
     Copyright c 2020 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       confirmed, our hypothesis would lead to the claim
ternational (CC BY 4.0).                                        that mature AI dialogue systems should learn to
master their language with the flexibility shown by          gle image is being discussed.2 This results in a
humans. In particular, their language use should             dataset composed of 3,777 segments paired with
differ when the referred object is mutually identifi-        a given image referent and an action label indi-
able or not. It has been shown that current AI mul-          cating whether the referent is visible to both par-
timodal systems are not able to decide if a visual           ticipants or only to one. The annotated dataset, to-
question is answerable or not (Bhattacharya et al.,          gether with other relevant materials, is available at:
2019), and they fail to identify whether the entity          https://dmg-photobook.github.io/
to which an expression refer is present in the visual           The PhotoBook task does not impose a specific
scene or not (Shekhar et al., 2017b; Shekhar et al.,         role on the players, unlike for example the Map-
2017a). We believe models can acquire this skill if          Task corpus (Anderson et al., 1991), where there
they learn to play the “language game” properly.             are predefined information giver and information
   In this paper, we investigate how the language            follower roles. In PhotoBook, the dialogues typ-
of human conversational partners changes when                ically follow this scheme: one of the participants
they are in a mutually grounded (they both see the           spontaneously decides to describe one of the im-
image they are speaking about) or non-mutually               ages highlighted in their grid and the other partic-
grounded setting (one sees the image while the               ipant indicates whether they also have it in their
other does not).                                             own grid or not. We call the former player the
   We find that, indeed, there are statistically sig-        leader and the latter the follower.3 We refer to
nificant differences along various linguistic di-            situations where the follower also sees the image
mensions, including utterance length, parts of               described by the leader as the grounded condi-
speech, and the degree of concreteness of the                tion and those where the follower does not see the
words used. Moreover, a simple SVM classifier                image as the non-grounded condition. Naturally,
based on these same features is shown to be able             the leader always sees the referent image.
to distinguish between the two conditions with a                Out of the 3,777 dialogue segments in our
relatively high performance.                                 dataset, 1,624 belong to the grounded condition
                                                             and 2,153 to the non-grounded one.
2   Dataset
                                                             3    Linguistics Features
We take the PhotoBook dataset (Haber et al.,
2019) as our testbed: two participants play a game           We hypothesize that the language used by the di-
where each sees a different grid with six images             alogue participants will differ in the grounded vs.
showing everyday scenes.1 Some of the images                 non-grounded condition. To test this hypothesis,
are common to both players, while others are only            we first identify several linguistic features that we
displayed to one of them. In each grid, three of the         expect to vary across conditions.
images are highlighted. By chatting with their dia-          Length. We expect that the length of the utter-
logue partner, each player needs to decide whether           ances and the overall dialogue segments may de-
each of the three highlighted images is also visible         pend on the players’ possibility to see the refer-
to their partner or not.                                     ent. For example, in the non-grounded condition
   A full game consists of five rounds, and the              more utterances may be needed to conclude that
players can decide to move to the next round when            the follower does not see the referent (thus leading
they are confident about their decisions. As the             to longer segments). Furthermore, not seeing the
game progresses, some images may reappear in                 referred image could limit the expressivity of the
subsequent rounds. The corpus is divided into di-            utterances by non-grounded follower (thus leading
alogue segments: the consecutive utterances that,            to shorter utterances).
as a whole, discuss a given target image and in-                 2
                                                                   We discard segments that refer to more than one image
clude expressions referring to it. From the set of           as well as those labelled with the wrong image by the original
all segments in PhotoBook, we create our dataset             heuristics (Haber et al., 2019).
                                                                 3
by focusing on segments belonging to the first                     We use simple heuristics to assign these roles a poste-
                                                             riori: when the image is not in common, we label as the
round of a game (since at that point all images              follower the participant who does not see the image, while
are new to the participants) and where a sin-                when the image is visible to both participants we consider
                                                             the follower the player who produces the last utterance of the
    1
      The images used in the PhotoBook task are taken from   segment. We manually corrected the classification of the few
the MS COCO 2014 Trainset (Lin et al., 2014).                segments that did not follow this general rule.
   We compute utterance length as number of to-           We extract POS distributions by first POS-
kens per utterance and segment length as both          tagging the utterances in the dataset7 and then
number of tokens per segment and number of ut-         computing the proportion of words per segment
terances per segment.                                  that are nouns, adjectives, verbs, or determiners,
                                                       respectively. Given the different functions of dif-
Word frequency. Frequency effects are key in           ferent determiners, we break down this class and
psycholinguistics. Word frequency is one of            independently compute proportions for each of the
the strongest predictors of processing efficiency      following determiners: a/an, the, that, those, this,
(Monsell et al., 1989) and experiments have con-       these, some, all, each, any, half, both.
firmed its link to memory performances (Yoneli-
nas, 2002). It is plausible that different grounding   4    Statistical Analysis
conditions lead to different word choices, and that
                                                       To test our hypothesis that the language used by
word frequency turns out to be a key aspect of this
                                                       the participants differs in the grounded vs. non-
linguistic variation.
                                                       grounded condition, we perform a statistical anal-
   To estimate word frequency, we use off-the-
                                                       ysis on our data. We compare: (1) the utterances
shelf lemma frequency scores (frequency per mil-
                                                       by the leaders in the grounded and non-grounded
lion tokens) from the British National Corpus
                                                       conditions, and (2) the utterances by the followers
(Leech et al., 2014).4 For each segment in our
                                                       in the grounded and non-grounded conditions. We
dataset, we compute the average word frequency
                                                       evaluate the statistical significance of these com-
by first lemmatizing the words in the segment and
                                                       parisons with a Mann-Whitney U Test, which does
then calculating the average frequency score for
                                                       not assume the data fits any specific distribution
all lemma types in the segment.5
                                                       type. Below we report the results of each of these
                                                       comparisons. Unless otherwise specified, statisti-
Concreteness. Concreteness is fundamental to
                                                       cal significance is tested for p < 0.001.
human language processing since it helps to
clearly convey information about the world (Hill       Length. Followers use significantly fewer words
and Korhonen, 2014). We use the concreteness           while leaders use significantly more words in the
scores by Brysbaert et al. (2014), correspond-         non-grounded condition than in the grounded con-
ing to 40K English word lemmas, and collected          dition. This trend is also illustrated in the example
via crowd-sourcing, where participants were re-        in Figure 1. Although followers use fewer words
quested to evaluate word-concreteness by using a       in the non-grounded condition, they produce a sig-
5-point rating scale ranging from abstract to con-     nificantly higher number of utterances per seg-
crete. We compute the average word concreteness        ment, while no reliable differences are observed
by first lemmatizing the words in the segment and      for the leaders (see Figure 2a and 2e, respectively).
then calculating the average score for all lemma       These findings indicate that establishing that a re-
types in the segment without repetitions, divided      ferring expression cannot be commonly grounded
by part-of-speech (POS).6                              requires more evidence and more information than
                                                       resolving the expression.
Parts of Speech distributions. Different POS
differ in their function and descriptive power.        Frequency. Followers use significantly more
We thus expect that their distribution will vary       high-frequency words in the grounded condition
between grounded and non-grounded conditions.          than the non-grounded condition, in particular for
For example, we expect nouns and adjectives to be      nouns and conjunctions. This is consistent with
more likely in visually grounded referential acts,     the reported production of more utterances per
while determiners may signal whether the referent      segment in the non-grounded condition, and sug-
is in common ground or not (the vs. a) and give        gests that the non-grounded follower uses them
clues about the polarity of the context where they     to talk about fine-grained details described by
are used (any vs. each).                               low-frequency words. In contrast, high-frequency
                                                       verbs are reliably more common in the non-
   4
    Available  at   http://ucrel.lancs.ac.uk/          grounded condition (see Figure 2b).
bncfreq/flists.html
  5                                                       7
    Lemmas not present in the BNC lists are ignored.        We use the NLTK Python library (Bird et al., 2009) in its
  6
    Lemmas not present in the corpus are ignored.      “universal” tagset version.
                                                                                                                                                                                                                                              Verbs; SR = ***
                      Length in utterances: SR = ***                                                        Frequency: verbs; SR = ***
                   2.00                                                                                                                                         Concreteness: adjectives; SR = ***
                                 grounded                                                                                                                       4.0                                                                     0.5
                                 non-grounded




                                                       Average word frequency (per million tokens)
                   1.75                                                                              4000
                                                                                                                                                                3.5




                                                                                                                                                                                                     Average verbs / utterance length
                   1.50                                                                                                                                                                                                                 0.4
                                                                                                                                                                3.0
                                                                                                     3000




                                                                                                                                         Average concreteness
                   1.25                                                                                                                                         2.5
  Average length




                                                                                                                                                                                                                                        0.3
                   1.00                                                                                                                                         2.0
                                                                                                     2000
                   0.75                                                                                                                                         1.5
                                                                                                                                                                                                                                        0.2

                   0.50                                                                              1000                                                       1.0
                                                                                                                                                                                                                                        0.1
                   0.25                                                                                                                                         0.5

                   0.00                                                                                0                                                        0.0                                                                     0.0
                                 Followers                                                                           Followers                                               Followers                                                           Followers


                                (a)                                                                                (b)                                                       (c)                                                                (d)
                                                                                                                                                                                                                                              Verbs; SR = *
                      Length in utterances: SR = NR                                                         Frequency: verbs; SR = NR                            Concreteness: adjectives; SR = *
                   2.00                                                                                                                                         4.0
                                 grounded                                                                                                                                                                                               0.5
                                 non-grounded
                                                       Average word frequency (per million tokens)




                   1.75                                                                              4000                                                       3.5




                                                                                                                                                                                                     Average verbs / utterance length
                   1.50                                                                                                                                         3.0
                                                                                                                                                                                                                                        0.4
                                                                                                     3000

                                                                                                                                         Average concreteness
                   1.25                                                                                                                                         2.5
  Average length




                                                                                                                                                                                                                                        0.3
                   1.00                                                                                                                                         2.0
                                                                                                     2000
                   0.75                                                                                                                                         1.5                                                                     0.2

                   0.50                                                                              1000                                                       1.0
                                                                                                                                                                                                                                        0.1
                   0.25                                                                                                                                         0.5

                   0.00                                                                                0                                                        0.0                                                                     0.0
                                 Leaders                                                                             Leaders                                                  Leaders                                                             Leaders


                                (e)                                                                                (f)                                                       (g)                                                                (h)

Figure 2: From left to right, difference between grounded and non-grounded condition for: (a/e) number
of utterances per segment; (b/f) frequency of used verbs; (c/g) concreteness of used adjectives; (d/h)
proportion of verbs. Top: followers; bottom: leaders. We use *** to refer to statistical significance at
p < 0.001; ** for p < 0.01; * for p < 0.05; . for p < 0.1. Best viewed in color.


   For example, note the high-frequency verbs do                                                                                            liably different, for the other POS there is either
and have used by the non-grounded follower in                                                                                               no or marginally reliable difference (see adjectives
Figure 1. The language of leaders, in contrast,                                                                                             in Figure 2g, adverbs, conjunctions, and numerals)
shows marginally reliable or no difference across                                                                                           between the two conditions. This is expected since
conditions regarding word frequency (see, e.g.,                                                                                             their language is always visually grounded.
the case of verbs in Figure 2f), except for high-
frequency nouns and conjunctions, which are re-                                                                                             Parts of speech. Followers use significantly
liably more common in the grounded condition                                                                                                more nouns and the determiners a/an, the, each
(p < 0.01).                                                                                                                                 in the grounded condition, while in the non-
                                                                                                                                            grounded condition they use significantly more
Concreteness. Somehow counterintuitively, fol-                                                                                              verbs (see Figure 2d) and determiners all and any.
lowers use overall significantly more concrete                                                                                              That is, the grounded condition leads followers to
words in the non-grounded than in the grounded                                                                                              more directly describe what they see by focusing
condition. However, an opposite pattern is found                                                                                            on a specific object, as in the grounded example in
for adjectives, which usually describe the colors of                                                                                        Figure 1. In contrast, the non-grounded condition
the objects in the scene (see Figure 2c). This lat-                                                                                         elicits utterances with more ‘confirmation’ verbs
ter result is in line with our intuitions: in the non-                                                                                      such as do and have and a more vague language
grounded condition, followers do not have direct                                                                                            signalled by the use of quantifiers, e.g., “I don’t
access to the specific perceptual properties of the                                                                                         have any of a cake”. As for the leaders, we ob-
entities in the image and hence use less concrete                                                                                           serve a mixed pattern of results, though, overall,
adjectives. As for the leaders, while nouns are re-                                                                                         there are less reliable differences between the two
conditions compared to the followers (see the case              6   Related Work
of verbs in Figure 2h).
                                                                Current multimodal systems are trained to process
5    Automatic Classification                                   and relate modalities capturing correspondences
                                                                between “sensory” information (Baltrusaitis et al.,
To more formally investigate the effectiveness of               2017). It has been shown they have trouble de-
our selected features in distinguishing between                 ciding if a question is answerable or not (Bhat-
various grounding conditions, we feed them into                 tacharya et al., 2019). Moreover, they fail to
an SVM classifier which predicts GFC or NGFC.                   identify whether the entity to which an expression
We run two SVM models: one for leaders, SVM                     refers is present in the visual scene or not (Shekhar
leaders, and one for followers, SVM followers.8                 et al., 2017b; Shekhar et al., 2017a). Connected
Our hypothesis is that SVM leaders should not be                to this weakness is the limitation they encounter
very effective in the binary classification task since          when put to work as dialogue systems, where
the language of the leaders differs only on few as-             they fail to build common ground from minimally-
pects, and less reliably between the two conditions             shared information (Udagawa and Aizawa, 2019).
compared to the followers’. In contrast, we expect              To be successful in communication, speakers are
SVM followers to achieve a good performance in                  supposed to attribute mental states to their inter-
the task, given the significant differences observed            locutors even when they are different from their
between the two conditions.                                     own (Rabinowitz et al., 2018; Chandrasekaran et
   Starting from all our linguistic features (see               al., 2017). This, in multimodal situations, can hap-
above), we excluded those that turned out to be                 pen when the visual scene is only partially com-
multicollinear in a Variance Inflation Factor test              mon between them. AI models have difficulties in
(VIF).9 The resulting N features (27 for the lead-              such conditions (Udagawa and Aizawa, 2019).
ers, 28 for the followers), were used to build, for                We study how the language of conversational
each datapoint, an N -dimensional vector of fea-                partners changes when (i) speakers refer to an im-
tures that was fed into the classifier. We performed            age their interlocutor does not see and (ii) nei-
10-fold cross-validation on the entire dataset.                 ther of the two is aware of this unshared vi-
   Table 1 reports the accuracy, precision, recall              sual ground. Though the idea that the ground-
and F1-score of the two SVM models. While SVM                   ing conditions of the addressees can affect their
leaders is at chance level, SVM followers achieves              interlocutor’s language is not new in psycholin-
a fairly high performance in the binary classifica-             guistics (Brennan et al., 2010; Brown and Dell,
tion task. This indicates that our linguistic features          1987; Lockridge and Brennan, 2002; Bard and
are effective in distinguishing among the two con-              Aylett, 2000), our approach differs from previous
ditions in the followers’ segments. These results               ones since it proposes a computational analysis
confirm that the language of the speakers in the                of visual dialogues. Moreover, differently from
follower role is affected by their grounding con-               other computational approaches (Bhattacharya et
dition, and that a well-informed model is able to               al., 2019; Gurari et al., 2018), we investigate sce-
capture that by means of their language’s linguis-              narios where the disconfirmation of a referent’s
tic features.                                                   presence is the answer instead of suggesting a case
   Table 2 reports the confusion matrices produced              of unanswerability.
by our SVM models after 10-fold cross-validation.
We can notice that SVM leaders wrongly labels                   7   Conclusion
NGFC datapoints as GFC in 1,381 cases, thus pro-
                                                                Our findings confirm that, in a visually-grounded
ducing a high number of false positives. This does
                                                                dialogue, different linguistic strategies are em-
not happen with SVM followers, which is overall
                                                                ployed by speakers based on different grounding
more accurate.
                                                                conditions. Our statistical analyses reliably indi-
    8
      We experiment with the scikit-learn Python li-            cate that several aspects of the language used in the
brary (Pedregosa et al., 2011) for C-Support Vector Classi-
fication. We use the default Radial Basis Function (rbf) ker-   conversation mirror whether the referred image is
nel. Parameter C set to 100 gives the best results.             – or not – mutually shared by the interlocutors.
    9
      The VIF test indicates whether there is a strong linear   Moreover, the effectiveness of a simple feature-
association between a predictor and the others (Pituch and
Stevens, 2016). When the VIF index exceeded 10, we per-         based classifier to distinguish between the two fol-
formed a variable deletion (Myers, 1990).                       lowers’ conditions further indicates that the lan-
                       Accuracy            Precision                   Recall                   F1-score
                                    GFC     NGFC       Av.     GFC     NGFC      Av.    GFC      NGFC      Av.

        SVM leaders       0.57      0.15     0.89      0.40     0.50     0.58    0.55    0.23     0.70     0.50
       SVM followers      0.80      0.77     0.79      0.78     0.73     0.82    0.78    0.75     0.80     0.78

Table 1: Accuracy, Precision, Recall, and F1-score of our SVM models, computed per class on a 10-
fold cross-validation, with the corresponding weighted averages (Av.). Since our two classes (GFC and
NGFC) are not balanced, chance level is 0.57.


              SVM leaders         SVM followers                 Philippe Morency.     2017.    Multimodal ma-
                                                                chine learning: A survey and taxonomy. CoRR,
             GFC       NGFC       GFC      NGFC                 abs/1705.09406.

   GFC         243     1381       1245      379               Ellen G Bard and MP Aylett. 2000. Accessibility, du-
   NGFC        242     1911        461      1692                 ration, and modeling the listener in spoken dialogue.
                                                                 In Proceedings of the Götalog 2000 Fourth Work-
                                                                 shop on the Semantics and Pragmatics of Dialogue.
Table 2: The confusion matrices produced by our
SVM models on a 10-fold cross-validation.                     Nilavra Bhattacharya, Qing Li, and Danna Gurari.
                                                                2019. Why does a visual question have different an-
                                                                swers? In Proceedings of the IEEE International
guage used by the speakers differs along several                Conference on Computer Vision, pages 4271–4280.
dimensions. We believe this capability of humans              Steven Bird, Ewan Klein, and Edward Loper. 2009.
to flexibly tune their language underpins their suc-             Natural Language Processing with Python: Analyz-
cess in communication. We suggest that efforts                   ing text with the natural language toolkit. O’Reilly
should be put in developing conversational AI sys-               Media, Inc.
tems that are capable to master language with a               Susan E Brennan, Alexia Galati, and Anna K Kuhlen.
similar flexibility. This could be achieved, for ex-            2010. Two minds, one dialog: Coordinating speak-
ample, by exposing models to one or the other                   ing and understanding. In Psychology of learning
                                                                and motivation, volume 53, pages 301–344. Else-
condition during training to encourage them en-
                                                                vier.
code the relevant linguistic features. Alternatively,
they should first understand whether the grounded             Paula M Brown and Gary S Dell. 1987. Adapting
information which is referred to is available to                production to comprehension: The explicit mention
                                                                of instruments. Cognitive Psychology, 19(4):441 –
them or not. These are open challenges that we                  472.
plan to tackle in future work.
                                                              Marc Brysbaert, Amy Beth Warriner, and Victor Ku-
Acknowledgments                                                perman. 2014. Concreteness ratings for 40 thou-
                                                               sand generally known English word lemmas. Be-
EG carried out part of the work while being an                 havior research methods, 46(3):904–911.
ERASMUS+ visiting student at the University of                Arjun Chandrasekaran, Deshraj Yadav, Prithvijit Chat-
Amsterdam. SP and RF are funded by the Eu-                      topadhyay, Viraj Prabhu, and Devi Parikh. 2017. It
ropean Research Council (ERC) under the Euro-                   takes two to tango: Towards theory of AI’s mind.
pean Union’s Horizon 2020 research and inno-                    CoRR, abs/1704.00717.
vation programme (grant agreement No. 819455                  Herbert H. Clark and Deanna Wilkes-Gibbs. 1986. Re-
awarded to RF).                                                 ferring as a collaborative process. Cognition, 22:1–
                                                                39.

References                                                    Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo,
                                                                Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P
Anne H Anderson, Miles Bader, Ellen Gurman Bard,                Bigham. 2018. Vizwiz grand challenge: Answer-
  Elizabeth Boyle, Gwyneth Doherty, Simon Garrod,               ing visual questions from blind people. In Proceed-
  Stephen Isard, Jacqueline Kowtko, Jan McAllister,             ings of the IEEE Conference on Computer Vision
  Jim Miller, et al. 1991. The HCRC map task corpus.            and Pattern Recognition, pages 3608–3617.
  Language and speech, 34(4):351–366.
                                                              Janosch Haber, Tim Baumgärtner, Ece Takmaz, Lieke
Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-                  Gelderloos, Elia Bruni, and Raquel Fernández.
  2019. The PhotoBook dataset: Building common             Takuma Udagawa and Akiko Aizawa. 2019. A nat-
  ground through visually-grounded dialogue. In Pro-         ural language corpus of common grounding under
  ceedings of the 57th Annual Meeting of the Asso-           continuous and partially-observable context. CoRR,
  ciation for Computational Linguistics, pages 1895–         abs/1907.03399.
  1910.
                                                           Andrew P Yonelinas. 2002. The nature of recollection
Felix Hill and Anna Korhonen. 2014. Concreteness             and familiarity: A review of 30 years of research.
  and subjectivity as dimensions of lexical meaning.         Journal of Memory and Language, 46(3):441–517.
  In Proceedings of the 52nd Annual Meeting of the
  Association for Computational Linguistics (Volume
  2: Short Papers), pages 725–731.
Geoffrey Leech, Paul Rayson, et al. 2014. Word fre-
  quencies in written and spoken English: Based on
  the British National Corpus. Routledge.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James
  Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,
  and C Lawrence Zitnick. 2014. Microsoft COCO:
  Common objects in context. In European Confer-
  ence on Computer Vision, pages 740–755. Springer.
Calion Lockridge and Susan Brennan. 2002. Ad-
  dressees’ needs influence speakers’ early syntactic
  choices. Psychonomic bulletin & review, 9:550–7,
  10.
Stephen Monsell, Michael C Doyle, and Patrick N Hag-
   gard. 1989. Effects of frequency on visual word
   recognition tasks: Where are they? Journal of Ex-
   perimental Psychology: General, 118(1):43.
Raymond H Myers. 1990. Classical and modern re-
  gression with applications. Duxbury, Boston, MA,
  2nd edition.
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram-
  fort, Vincent Michel, Bertrand Thirion, Olivier
  Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
  Weiss, Vincent Dubourg, et al. 2011. Scikit-learn:
  Machine learning in Python. Journal of Machine
  Learning Research, 12:2825–2830.
Keenan A Pituch and James P Stevens. 2016. Applied
  Multivariate Statistics for the Social Sciences. Rout-
  ledge, 6th edition.
Neil Rabinowitz, Frank Perbet, Francis Song, Chiyuan
  Zhang, SM Ali Eslami, and Matthew Botvinick.
  2018. Machine theory of mind. In International
  Conference on Machine Learning, pages 4218–
  4227.
Ravi Shekhar, Sandro Pezzelle, Aurélie Herbelot, Moin
  Nabi, Enver Sangineto, and Raffaella Bernardi.
  2017a. Vision and language integration: Moving
  beyond objects. In IWCS 2017—12th International
  Conference on Computational Semantics—Short pa-
  pers.
Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich,
  Aurélie Herbelot, Moin Nabi, Enver Sangineto, and
  Raffaella Bernardi. 2017b. FOIL it! find one mis-
  match between image and language caption. In Pro-
  ceedings of the 55th Annual Meeting of the Associa-
  tion for Computational Linguistics (Volume 1: Long
  Papers), pages 255–265.