<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Grounded and Ungrounded Referring Expressions in Human Dialogues: Language Mirrors Different Grounding Conditions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eleonora Gualdoni</string-name>
          <email>eleonora.gualdoni@studenti.unitn.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raffaella Bernardi Raquel Fern a´ndez</string-name>
          <email>raffaella.bernardi@unitn.it</email>
          <email>raquel.fernandez@uva.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sandro Pezzelle</string-name>
          <email>s.pezzelle@uva.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Trento University of Amsterdam</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We study how language use differs between dialogue partners in a visually grounded reference task when a referent is mutually identifiable by both interlocutors vs. when it is only available to one of them. In the latter case, the addressee needs to disconfirm a proposed description - a skill largely neglected by both the theoretical and the computational linguistics communities. We consider a number of linguistic features that we expect to vary across conditions. We then analyze their effectiveness in distinguishing among the two conditions by means of statistical tests and a feature-based classifier. Overall, we show that language mirrors different grounding conditions, paving the way to future deeper investigation of referential disconfirmation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Communication is a joint activity in which
interlocutors share or synchronize aspects of their
private mental states and act together in the world.
To understand what our minds indeed do during
communication, Brennan et al. (2010) highlight
the need to study language in interpersonal
coordination scenarios. When a conversation focuses
on objects, interlocutors have to reach the
mutual belief that the addressee has identified the
discussed referent by means of visual grounding. In
this frame, Clark and Wilkes-Gibbs (1986) have
pointed to referring as a collaborative process, that
requires action and coordination by both
speakers and interlocutors, and that needs to be
studied with a collaborative model. Clark and
WilkesGibbs (1986), in fact, have highlighted that – in</p>
      <p>Copyright c 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).</p>
      <p>grounded condition
L: i have grapefruit with carrots</p>
      <p>and celery
F: yep me too might be a blood
orange though really dark</p>
      <p>non-grounded condition
L: what about a guy in a suit and
black hat holding a blue plaid
umbrella with more of them
around him
F: i do not have that one
order to refer to an object in the world – speakers
must believe that the referent is mutually
identifiable to them and their addressees. This is an
important skill that human speakers leverage to
succeed in communication.</p>
      <p>However, humans are not only able to identify
an object described by the interlocutor – that is,
grounding a referring expression – but also to
understand that such an object is not in the scene
and, therefore, it cannot be grounded. It can
happen, indeed, that a referent is not mutually
identifiable by the speakers, due to the speakers being
in different grounding conditions. In this case, the
addressee is able to disconfirm a description stated
by the interlocutor by communicating that he/she
does not see it (as in Figure 1). This is a crucial
skill of human speakers. However, it is often
neglected in the computational modelling of
conversational agents.</p>
      <p>
        We conjecture that the participants’ visual
grounding conditions have an impact on the
linguistic form and structure of their utterances. If
confirmed, our hypothesis would lead to the claim
that mature AI dialogue systems should learn to
master their language with the flexibility shown by
humans. In particular, their language use should
differ when the referred object is mutually
identifiable or not. It has been shown that current AI
multimodal systems are not able to decide if a visual
question is answerable or not
        <xref ref-type="bibr" rid="ref4">(Bhattacharya et al.,
2019)</xref>
        , and they fail to identify whether the entity
to which an expression refer is present in the visual
scene or not
        <xref ref-type="bibr" rid="ref22 ref22 ref23 ref23">(Shekhar et al., 2017b; Shekhar et al.,
2017a)</xref>
        . We believe models can acquire this skill if
they learn to play the “language game” properly.
      </p>
      <p>In this paper, we investigate how the language
of human conversational partners changes when
they are in a mutually grounded (they both see the
image they are speaking about) or non-mutually
grounded setting (one sees the image while the
other does not).</p>
      <p>We find that, indeed, there are statistically
significant differences along various linguistic
dimensions, including utterance length, parts of
speech, and the degree of concreteness of the
words used. Moreover, a simple SVM classifier
based on these same features is shown to be able
to distinguish between the two conditions with a
relatively high performance.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Dataset</title>
      <p>We take the PhotoBook dataset (Haber et al.,
2019) as our testbed: two participants play a game
where each sees a different grid with six images
showing everyday scenes.1 Some of the images
are common to both players, while others are only
displayed to one of them. In each grid, three of the
images are highlighted. By chatting with their
dialogue partner, each player needs to decide whether
each of the three highlighted images is also visible
to their partner or not.</p>
      <p>
        A full game consists of five rounds, and the
players can decide to move to the next round when
they are confident about their decisions. As the
game progresses, some images may reappear in
subsequent rounds. The corpus is divided into
dialogue segments: the consecutive utterances that,
as a whole, discuss a given target image and
include expressions referring to it. From the set of
all segments in PhotoBook, we create our dataset
by focusing on segments belonging to the first
round of a game (since at that point all images
are new to the participants) and where a
sin1The images used in the PhotoBook task are taken from
the MS COCO 2014 Trainset
        <xref ref-type="bibr" rid="ref14 ref15">(Lin et al., 2014)</xref>
        .
gle image is being discussed.2 This results in a
dataset composed of 3,777 segments paired with
a given image referent and an action label
indicating whether the referent is visible to both
participants or only to one. The annotated dataset,
together with other relevant materials, is available at:
https://dmg-photobook.github.io/
      </p>
      <p>
        The PhotoBook task does not impose a specific
role on the players, unlike for example the
MapTask corpus
        <xref ref-type="bibr" rid="ref1">(Anderson et al., 1991)</xref>
        , where there
are predefined information giver and information
follower roles. In PhotoBook, the dialogues
typically follow this scheme: one of the participants
spontaneously decides to describe one of the
images highlighted in their grid and the other
participant indicates whether they also have it in their
own grid or not. We call the former player the
leader and the latter the follower.3 We refer to
situations where the follower also sees the image
described by the leader as the grounded
condition and those where the follower does not see the
image as the non-grounded condition. Naturally,
the leader always sees the referent image.
      </p>
      <p>Out of the 3,777 dialogue segments in our
dataset, 1,624 belong to the grounded condition
and 2,153 to the non-grounded one.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Linguistics Features</title>
      <p>We hypothesize that the language used by the
dialogue participants will differ in the grounded vs.
non-grounded condition. To test this hypothesis,
we first identify several linguistic features that we
expect to vary across conditions.</p>
      <p>Length. We expect that the length of the
utterances and the overall dialogue segments may
depend on the players’ possibility to see the
referent. For example, in the non-grounded condition
more utterances may be needed to conclude that
the follower does not see the referent (thus leading
to longer segments). Furthermore, not seeing the
referred image could limit the expressivity of the
utterances by non-grounded follower (thus leading
to shorter utterances).</p>
      <p>2We discard segments that refer to more than one image
as well as those labelled with the wrong image by the original
heuristics (Haber et al., 2019).</p>
      <p>3We use simple heuristics to assign these roles a
posteriori: when the image is not in common, we label as the
follower the participant who does not see the image, while
when the image is visible to both participants we consider
the follower the player who produces the last utterance of the
segment. We manually corrected the classification of the few
segments that did not follow this general rule.</p>
      <p>We compute utterance length as number of
tokens per utterance and segment length as both
number of tokens per segment and number of
utterances per segment.</p>
      <p>
        Word frequency. Frequency effects are key in
psycholinguistics. Word frequency is one of
the strongest predictors of processing efficiency
        <xref ref-type="bibr" rid="ref17">(Monsell et al., 1989)</xref>
        and experiments have
confirmed its link to memory performances
        <xref ref-type="bibr" rid="ref25">(Yonelinas, 2002)</xref>
        . It is plausible that different grounding
conditions lead to different word choices, and that
word frequency turns out to be a key aspect of this
linguistic variation.
      </p>
      <p>
        To estimate word frequency, we use
off-theshelf lemma frequency scores (frequency per
million tokens) from the British National Corpus
        <xref ref-type="bibr" rid="ref14">(Leech et al., 2014)</xref>
        .4 For each segment in our
dataset, we compute the average word frequency
by first lemmatizing the words in the segment and
then calculating the average frequency score for
all lemma types in the segment.5
Concreteness. Concreteness is fundamental to
human language processing since it helps to
clearly convey information about the world
        <xref ref-type="bibr" rid="ref13 ref8">(Hill
and Korhonen, 2014)</xref>
        . We use the concreteness
scores by Brysbaert et al. (2014),
corresponding to 40K English word lemmas, and collected
via crowd-sourcing, where participants were
requested to evaluate word-concreteness by using a
5-point rating scale ranging from abstract to
concrete. We compute the average word concreteness
by first lemmatizing the words in the segment and
then calculating the average score for all lemma
types in the segment without repetitions, divided
by part-of-speech (POS).6
      </p>
    </sec>
    <sec id="sec-4">
      <title>Parts of Speech distributions. Different POS</title>
      <p>differ in their function and descriptive power.
We thus expect that their distribution will vary
between grounded and non-grounded conditions.
For example, we expect nouns and adjectives to be
more likely in visually grounded referential acts,
while determiners may signal whether the referent
is in common ground or not (the vs. a) and give
clues about the polarity of the context where they
are used (any vs. each).</p>
      <p>4Available at http://ucrel.lancs.ac.uk/
bncfreq/flists.html
5Lemmas not present in the BNC lists are ignored.
6Lemmas not present in the corpus are ignored.</p>
      <p>We extract POS distributions by first
POStagging the utterances in the dataset7 and then
computing the proportion of words per segment
that are nouns, adjectives, verbs, or determiners,
respectively. Given the different functions of
different determiners, we break down this class and
independently compute proportions for each of the
following determiners: a/an, the, that, those, this,
these, some, all, each, any, half, both.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Statistical Analysis</title>
      <p>To test our hypothesis that the language used by
the participants differs in the grounded vs.
nongrounded condition, we perform a statistical
analysis on our data. We compare: (1) the utterances
by the leaders in the grounded and non-grounded
conditions, and (2) the utterances by the followers
in the grounded and non-grounded conditions. We
evaluate the statistical significance of these
comparisons with a Mann-Whitney U Test, which does
not assume the data fits any specific distribution
type. Below we report the results of each of these
comparisons. Unless otherwise specified,
statistical significance is tested for p &lt; 0:001.</p>
      <p>Length. Followers use significantly fewer words
while leaders use significantly more words in the
non-grounded condition than in the grounded
condition. This trend is also illustrated in the example
in Figure 1. Although followers use fewer words
in the non-grounded condition, they produce a
significantly higher number of utterances per
segment, while no reliable differences are observed
for the leaders (see Figure 2a and 2e, respectively).
These findings indicate that establishing that a
referring expression cannot be commonly grounded
requires more evidence and more information than
resolving the expression.</p>
      <p>Frequency. Followers use significantly more
high-frequency words in the grounded condition
than the non-grounded condition, in particular for
nouns and conjunctions. This is consistent with
the reported production of more utterances per
segment in the non-grounded condition, and
suggests that the non-grounded follower uses them
to talk about fine-grained details described by
low-frequency words. In contrast, high-frequency
verbs are reliably more common in the
nongrounded condition (see Figure 2b).</p>
      <p>
        7We use the NLTK Python library
        <xref ref-type="bibr" rid="ref5">(Bird et al., 2009)</xref>
        in its
“universal” tagset version.
2.00Length in utterances: SR = ***
grounded
non-grounded
      </p>
      <p>Followers
(b)
Frequency: verbs; SR = NR
1.75
1.50
th1.25
g
n
e
leg1.00
a
r
e
vA0.75
0.50
0.25
0.00
1.75
1.50
th1.25
g
n
e
leg1.00
a
r
e
vA0.75
0.50
0.25
0.00
2.00Length in utterances: SR = NR
grounded
non-grounded
Followers
(a)
Leaders
(e)
)
sn4000
e
k
o
t
n
o
illi
rm3000
e
p
(
y
c
n
e
uq2000
e
frr
d
o
w
eg1000
a
r
e
v
A</p>
      <p>0
)
sn4000
e
k
o
t
n
o
llii
rm3000
e
p
(
y
c
n
e
uq2000
e
frr
d
o
w
eg1000
a
r
e
v
A
0</p>
      <p>For example, note the high-frequency verbs do
and have used by the non-grounded follower in
Figure 1. The language of leaders, in contrast,
shows marginally reliable or no difference across
conditions regarding word frequency (see, e.g.,
the case of verbs in Figure 2f), except for
highfrequency nouns and conjunctions, which are
reliably more common in the grounded condition
(p &lt; 0:01).</p>
      <p>Concreteness. Somehow counterintuitively,
followers use overall significantly more concrete
words in the non-grounded than in the grounded
condition. However, an opposite pattern is found
for adjectives, which usually describe the colors of
the objects in the scene (see Figure 2c). This
latter result is in line with our intuitions: in the
nongrounded condition, followers do not have direct
access to the specific perceptual properties of the
entities in the image and hence use less concrete
adjectives. As for the leaders, while nouns are
reliably different, for the other POS there is either
no or marginally reliable difference (see adjectives
in Figure 2g, adverbs, conjunctions, and numerals)
between the two conditions. This is expected since
their language is always visually grounded.
Parts of speech. Followers use significantly
more nouns and the determiners a/an, the, each
in the grounded condition, while in the
nongrounded condition they use significantly more
verbs (see Figure 2d) and determiners all and any.
That is, the grounded condition leads followers to
more directly describe what they see by focusing
on a specific object, as in the grounded example in
Figure 1. In contrast, the non-grounded condition
elicits utterances with more ‘confirmation’ verbs
such as do and have and a more vague language
signalled by the use of quantifiers, e.g., “I don’t
have any of a cake”. As for the leaders, we
observe a mixed pattern of results, though, overall,
there are less reliable differences between the two
conditions compared to the followers (see the case
of verbs in Figure 2h).
5</p>
    </sec>
    <sec id="sec-6">
      <title>Automatic Classification</title>
      <p>To more formally investigate the effectiveness of
our selected features in distinguishing between
various grounding conditions, we feed them into
an SVM classifier which predicts GFC or NGFC.
We run two SVM models: one for leaders, SVM
leaders, and one for followers, SVM followers.8
Our hypothesis is that SVM leaders should not be
very effective in the binary classification task since
the language of the leaders differs only on few
aspects, and less reliably between the two conditions
compared to the followers’. In contrast, we expect
SVM followers to achieve a good performance in
the task, given the significant differences observed
between the two conditions.</p>
      <p>Starting from all our linguistic features (see
above), we excluded those that turned out to be
multicollinear in a Variance Inflation Factor test
(VIF).9 The resulting N features (27 for the
leaders, 28 for the followers), were used to build, for
each datapoint, an N -dimensional vector of
features that was fed into the classifier. We performed
10-fold cross-validation on the entire dataset.</p>
      <p>Table 1 reports the accuracy, precision, recall
and F1-score of the two SVM models. While SVM
leaders is at chance level, SVM followers achieves
a fairly high performance in the binary
classification task. This indicates that our linguistic features
are effective in distinguishing among the two
conditions in the followers’ segments. These results
confirm that the language of the speakers in the
follower role is affected by their grounding
condition, and that a well-informed model is able to
capture that by means of their language’s
linguistic features.</p>
      <p>Table 2 reports the confusion matrices produced
by our SVM models after 10-fold cross-validation.
We can notice that SVM leaders wrongly labels
NGFC datapoints as GFC in 1,381 cases, thus
producing a high number of false positives. This does
not happen with SVM followers, which is overall
more accurate.</p>
      <p>
        8We experiment with the scikit-learn Python
library
        <xref ref-type="bibr" rid="ref19">(Pedregosa et al., 2011)</xref>
        for C-Support Vector
Classification. We use the default Radial Basis Function (rbf)
kernel. Parameter C set to 100 gives the best results.
      </p>
      <p>
        9The VIF test indicates whether there is a strong linear
association between a predictor and the others
        <xref ref-type="bibr" rid="ref20">(Pituch and
Stevens, 2016)</xref>
        . When the VIF index exceeded 10, we
performed a variable deletion
        <xref ref-type="bibr" rid="ref18">(Myers, 1990)</xref>
        .
Current multimodal systems are trained to process
and relate modalities capturing correspondences
between “sensory” information
        <xref ref-type="bibr" rid="ref2">(Baltrusaitis et al.,
2017)</xref>
        . It has been shown they have trouble
deciding if a question is answerable or not
        <xref ref-type="bibr" rid="ref4">(Bhattacharya et al., 2019)</xref>
        . Moreover, they fail to
identify whether the entity to which an expression
refers is present in the visual scene or not
        <xref ref-type="bibr" rid="ref22 ref22 ref23 ref23">(Shekhar
et al., 2017b; Shekhar et al., 2017a)</xref>
        . Connected
to this weakness is the limitation they encounter
when put to work as dialogue systems, where
they fail to build common ground from
minimallyshared information
        <xref ref-type="bibr" rid="ref24 ref4">(Udagawa and Aizawa, 2019)</xref>
        .
To be successful in communication, speakers are
supposed to attribute mental states to their
interlocutors even when they are different from their
own
        <xref ref-type="bibr" rid="ref21 ref9">(Rabinowitz et al., 2018; Chandrasekaran et
al., 2017)</xref>
        . This, in multimodal situations, can
happen when the visual scene is only partially
common between them. AI models have difficulties in
such conditions
        <xref ref-type="bibr" rid="ref24 ref4">(Udagawa and Aizawa, 2019)</xref>
        .
      </p>
      <p>
        We study how the language of conversational
partners changes when (i) speakers refer to an
image their interlocutor does not see and (ii)
neither of the two is aware of this unshared
visual ground. Though the idea that the
grounding conditions of the addressees can affect their
interlocutor’s language is not new in
psycholinguistics
        <xref ref-type="bibr" rid="ref16 ref3 ref6 ref7">(Brennan et al., 2010; Brown and Dell,
1987; Lockridge and Brennan, 2002; Bard and
Aylett, 2000)</xref>
        , our approach differs from previous
ones since it proposes a computational analysis
of visual dialogues. Moreover, differently from
other computational approaches
        <xref ref-type="bibr" rid="ref11 ref4">(Bhattacharya et
al., 2019; Gurari et al., 2018)</xref>
        , we investigate
scenarios where the disconfirmation of a referent’s
presence is the answer instead of suggesting a case
of unanswerability.
7
      </p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>Our findings confirm that, in a visually-grounded
dialogue, different linguistic strategies are
employed by speakers based on different grounding
conditions. Our statistical analyses reliably
indicate that several aspects of the language used in the
conversation mirror whether the referred image is
– or not – mutually shared by the interlocutors.
Moreover, the effectiveness of a simple
featurebased classifier to distinguish between the two
followers’ conditions further indicates that the
lan</p>
      <sec id="sec-7-1">
        <title>SVM leaders</title>
      </sec>
      <sec id="sec-7-2">
        <title>SVM followers</title>
        <p>Accuracy
guage used by the speakers differs along several
dimensions. We believe this capability of humans
to flexibly tune their language underpins their
success in communication. We suggest that efforts
should be put in developing conversational AI
systems that are capable to master language with a
similar flexibility. This could be achieved, for
example, by exposing models to one or the other
condition during training to encourage them
encode the relevant linguistic features. Alternatively,
they should first understand whether the grounded
information which is referred to is available to
them or not. These are open challenges that we
plan to tackle in future work.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>EG carried out part of the work while being an
ERASMUS+ visiting student at the University of
Amsterdam. SP and RF are funded by the
European Research Council (ERC) under the
European Union’s Horizon 2020 research and
innovation programme (grant agreement No. 819455
awarded to RF).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Anne H Anderson</surname>
          </string-name>
          , Miles Bader, Ellen Gurman Bard, Elizabeth Boyle, Gwyneth Doherty, Simon Garrod, Stephen Isard, Jacqueline Kowtko,
          <string-name>
            <surname>Jan</surname>
            <given-names>McAllister</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Jim</given-names>
            <surname>Miller</surname>
          </string-name>
          , et al.
          <year>1991</year>
          .
          <article-title>The HCRC map task corpus</article-title>
          .
          <source>Language and speech</source>
          ,
          <volume>34</volume>
          (
          <issue>4</issue>
          ):
          <fpage>351</fpage>
          -
          <lpage>366</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Tadas</given-names>
            <surname>Baltrusaitis</surname>
          </string-name>
          , Chaitanya Ahuja, and
          <string-name>
            <given-names>LouisPhilippe</given-names>
            <surname>Morency</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Multimodal machine learning: A survey and taxonomy</article-title>
          . CoRR, abs/1705.09406.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Ellen G Bard and MP Aylett</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Accessibility, duration, and modeling the listener in spoken dialogue</article-title>
          .
          <source>In Proceedings of the Go¨talog 2000 Fourth Workshop on the Semantics and Pragmatics of Dialogue.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Nilavra</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Qing</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Danna</given-names>
            <surname>Gurari</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Why does a visual question have different answers</article-title>
          ?
          <source>In Proceedings of the IEEE International Conference on Computer Vision</source>
          , pages
          <fpage>4271</fpage>
          -
          <lpage>4280</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Steven</given-names>
            <surname>Bird</surname>
          </string-name>
          , Ewan Klein, and
          <string-name>
            <given-names>Edward</given-names>
            <surname>Loper</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Natural Language Processing with Python: Analyzing text with the natural language toolkit.</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Inc.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Susan E Brennan</surname>
          </string-name>
          , Alexia Galati, and
          <article-title>Anna</article-title>
          K Kuhlen.
          <year>2010</year>
          .
          <article-title>Two minds, one dialog: Coordinating speaking and understanding</article-title>
          .
          <source>In Psychology of learning and motivation</source>
          , volume
          <volume>53</volume>
          , pages
          <fpage>301</fpage>
          -
          <lpage>344</lpage>
          . Elsevier.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Paula M Brown and Gary S Dell</surname>
          </string-name>
          .
          <year>1987</year>
          .
          <article-title>Adapting production to comprehension: The explicit mention of instruments</article-title>
          .
          <source>Cognitive Psychology</source>
          ,
          <volume>19</volume>
          (
          <issue>4</issue>
          ):
          <fpage>441</fpage>
          -
          <lpage>472</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Marc</given-names>
            <surname>Brysbaert</surname>
          </string-name>
          , Amy Beth Warriner, and
          <string-name>
            <given-names>Victor</given-names>
            <surname>Kuperman</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Concreteness ratings for 40 thousand generally known English word lemmas</article-title>
          .
          <source>Behavior research methods</source>
          ,
          <volume>46</volume>
          (
          <issue>3</issue>
          ):
          <fpage>904</fpage>
          -
          <lpage>911</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Arjun</given-names>
            <surname>Chandrasekaran</surname>
          </string-name>
          , Deshraj Yadav, Prithvijit Chattopadhyay, Viraj Prabhu, and
          <string-name>
            <given-names>Devi</given-names>
            <surname>Parikh</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>It takes two to tango: Towards theory of AI's mind</article-title>
          .
          <source>CoRR, abs/1704</source>
          .00717.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Herbert H.</given-names>
            <surname>Clark</surname>
          </string-name>
          and
          <string-name>
            <given-names>Deanna</given-names>
            <surname>Wilkes-Gibbs</surname>
          </string-name>
          .
          <year>1986</year>
          .
          <article-title>Referring as a collaborative process</article-title>
          .
          <source>Cognition</source>
          ,
          <volume>22</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Danna</given-names>
            <surname>Gurari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Qing</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>Abigale J Stangl</surname>
          </string-name>
          , Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham.
          <year>2018</year>
          .
          <article-title>Vizwiz grand challenge: Answering visual questions from blind people</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <fpage>3608</fpage>
          -
          <lpage>3617</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <year>2019</year>
          .
          <article-title>The PhotoBook dataset: Building common ground through visually-grounded dialogue</article-title>
          .
          <source>In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          , pages
          <fpage>1895</fpage>
          -
          <lpage>1910</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Felix</given-names>
            <surname>Hill</surname>
          </string-name>
          and
          <string-name>
            <given-names>Anna</given-names>
            <surname>Korhonen</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Concreteness and subjectivity as dimensions of lexical meaning</article-title>
          .
          <source>In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</source>
          , pages
          <fpage>725</fpage>
          -
          <lpage>731</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Geoffrey</given-names>
            <surname>Leech</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Paul</given-names>
            <surname>Rayson</surname>
          </string-name>
          , et al.
          <year>2014</year>
          .
          <article-title>Word frequencies in written and spoken English: Based on the British National Corpus</article-title>
          . Routledge.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Tsung-Yi Lin</surname>
            ,
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Maire</surname>
            , Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dolla´r, and
            <given-names>C Lawrence</given-names>
          </string-name>
          <string-name>
            <surname>Zitnick</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <string-name>
            <surname>Microsoft</surname>
            <given-names>COCO</given-names>
          </string-name>
          :
          <article-title>Common objects in context</article-title>
          .
          <source>In European Conference on Computer Vision</source>
          , pages
          <fpage>740</fpage>
          -
          <lpage>755</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Calion</given-names>
            <surname>Lockridge</surname>
          </string-name>
          and
          <string-name>
            <given-names>Susan</given-names>
            <surname>Brennan</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Addressees' needs influence speakers' early syntactic choices</article-title>
          .
          <source>Psychonomic bulletin &amp; review</source>
          ,
          <volume>9</volume>
          :
          <fpage>550</fpage>
          -
          <lpage>7</lpage>
          ,
          <fpage>10</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Monsell</surname>
          </string-name>
          , Michael C Doyle, and
          <string-name>
            <surname>Patrick N Haggard</surname>
          </string-name>
          .
          <year>1989</year>
          .
          <article-title>Effects of frequency on visual word recognition tasks: Where are they</article-title>
          ?
          <source>Journal of Experimental Psychology: General</source>
          ,
          <volume>118</volume>
          (
          <issue>1</issue>
          ):
          <fpage>43</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Raymond H Myers</surname>
          </string-name>
          .
          <year>1990</year>
          .
          <article-title>Classical and modern regression with applications</article-title>
          .
          <source>Duxbury</source>
          , Boston, MA, 2nd edition.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Fabian</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          , Gae¨l Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          , Ron Weiss,
          <string-name>
            <surname>Vincent Dubourg</surname>
          </string-name>
          , et al.
          <year>2011</year>
          .
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>12</volume>
          :
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Keenan A Pituch and James P Stevens</surname>
          </string-name>
          .
          <year>2016</year>
          . Applied Multivariate Statistics for the
          <source>Social Sciences. Routledge, 6th edition.</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Neil</given-names>
            <surname>Rabinowitz</surname>
          </string-name>
          , Frank Perbet, Francis Song, Chiyuan Zhang, SM Ali Eslami, and
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Botvinick</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Machine theory of mind</article-title>
          .
          <source>In International Conference on Machine Learning</source>
          , pages
          <fpage>4218</fpage>
          -
          <lpage>4227</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Ravi</given-names>
            <surname>Shekhar</surname>
          </string-name>
          , Sandro Pezzelle, Aure´lie Herbelot, Moin Nabi, Enver Sangineto, and
          <string-name>
            <given-names>Raffaella</given-names>
            <surname>Bernardi</surname>
          </string-name>
          .
          <year>2017a</year>
          .
          <article-title>Vision and language integration: Moving beyond objects</article-title>
          .
          <source>In IWCS 2017-12th International Conference on Computational Semantics-Short papers.</source>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Ravi</given-names>
            <surname>Shekhar</surname>
          </string-name>
          , Sandro Pezzelle, Yauhen Klimovich, Aure´lie Herbelot, Moin Nabi, Enver Sangineto, and
          <string-name>
            <given-names>Raffaella</given-names>
            <surname>Bernardi</surname>
          </string-name>
          .
          <year>2017b</year>
          .
          <article-title>FOIL it! find one mismatch between image and language caption</article-title>
          .
          <source>In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          , pages
          <fpage>255</fpage>
          -
          <lpage>265</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Takuma</given-names>
            <surname>Udagawa</surname>
          </string-name>
          and
          <string-name>
            <given-names>Akiko</given-names>
            <surname>Aizawa</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>A natural language corpus of common grounding under continuous and partially-observable context</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1907</year>
          .03399.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Andrew P Yonelinas</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>The nature of recollection and familiarity: A review of 30 years of research</article-title>
          .
          <source>Journal of Memory and Language</source>
          ,
          <volume>46</volume>
          (
          <issue>3</issue>
          ):
          <fpage>441</fpage>
          -
          <lpage>517</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>