Probability Distributions as a Litmus Test to Inspect
NNs Grounding Skills
Alex J. Lucassen1 , Alberto Testoni2 and Raffaella Bernardi1,2
1
    CIMeC, University of Trento, Palazzo Fedrigotti - corso Bettini 31, 38068 Rovereto (TN), Italy
2
    DISI, University of Trento, Via Sommarive, 9 I-38123 Povo (TN), Italy


                                         Abstract
                                         Today AI systems are trained by ultimately using a classifier to perform a down-streaming task and are
                                         mostly evaluated on the task-success they reach. Not enough attention is given to how the classifier
                                         distributes the probabilities among the candidates out of which the target with the highest probability is
                                         selected. We propose to take the probability distribution as a litmus test to inspect models’ grounding
                                         skills. We take a visually grounded referential guessing game as test-bed and use the probability
                                         distribution as a way to evaluate whether question answer pairs are well grounded by the model. To
                                         this end, we propose a method to obtain such soft-labels automatically and show they correlate well
                                         with human uncertainty about the grounded interpretation of the QA pair. Our result shows that higher
                                         task accuracy does not necessarily correspond to a more meaningful probability distribution; we do not
                                         consider trustworthy the models which do not pass our litmus test.

                                         Keywords
                                         Referential Guessing Games, Soft-labels, Interpretable and Trustworthy Agents.


1. Introduction
Classification tasks represent the backbone of most AI systems. These systems are trained to
assign probabilities to a set of labels, while the underneath model learns to reach a suitable
representation of the given input. The evaluation usually consists of comparing the probability
distribution of the model against the ground-truth label. Instead, we conjecture it is important
to look into the probability distribution over the whole set of possible candidate labels and not
only the one that receives the highest probability. In our work, we will bring evidence that
modelling uncertainty is a crucial step for building trustworthy AI systems.
   Probabilities are crucial during problem-solving tasks. In Cognitive Science, a typical test to
study humans’ problem-solving strategies is the 20Q game in which a Questioner has to ask
a sequence of questions to guess which is the target object the other player, the Oracle, has
been assigned. The studies focus on how the questioner’ conjectures about the target drive
the sequence of questions he/she asks. Models have been evaluated on tasks such as the 20Q
game but again the evaluation has mostly focused on task success or on the linguistic quality
of the generated sequence of questions. We believe models should be compared also on how

NL4AI 2022: Sixth Workshop on Natural Language for Artificial Intelligence, November 30, 2022, Udine, Italy [1]
$ alex.lucassen@studenti.unitn.it (A. J. Lucassen); alberto.testoni@unitn.it (A. Testoni); raffaella.bernardi@unitn.it
(R. Bernardi)
 0000-0002-1388-0444 (A. J. Lucassen); 0000-0002-4130-0056 (A. Testoni); 0000-0002-3423-1208 (R. Bernardi)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: Although in both games the target object is guessed correctly (receives the highest probability)
at the last turn, the trust on the model decreases when looking at the probability distribution at first
turn.


sound and interpretable the probability is that they assign to the various candidates. When
challenged with tasks in which the agent aims to incrementally gain confidence in its hypothesis
and therefore arrives to take a well-thought decision, the probability assigned to the candidates
at a certain step should impact the reasoning process and guide the natural language generation.
Therefore, assigning meaningful probabilities is crucial also for the efficacy and coherence of
the dialogue structure. We take GuessWhat?! [2], a grounded version of the 20Q game, as a
test-bed and shed light on how the probability distribution of models serves as a litmus test for
evaluating the extent to which their task success can be trusted.
   Figure 1 illustrates two cases where a state-of-the-art (SOTA) model correctly predicts the
target object based on the full dialogue generated by humans playing the game. However, if
we look at the probabilities assigned after the first turn we see that they are uninterpretable. In
the example on the left, one would expect the first Question Answer (QA) pair (the category
question: ’Is it an animal?’ ’No’) to lead to low probabilities for objects #0 and #2, which are
cats, and higher for the other two candidates, the vase and the potted plant. Contrary to the
expectations, the cats receive higher probabilities than the other two objects, and one cat seems
more probable than the other one, clearly just for some data bias. In the example on the right,
after the first QA pair (a spatial question: ’is it in right? ’Yes’), object #0 is assigned a very high
probability, much higher than object #1, which is actually further to the right in the image. In
this paper, we call attention to such a crucial aspect of neural network models and propose a
method to evaluate their probability distribution success by taking human disagreement as a
proxy of human uncertainty, and designing rule-based systems that simulate such disagreement
by leveraging on the human annotation and which can then be used to automatically annotate
larger data.
   Given the nature of the GuessWhat?! dialogues, at each turn we can identify the set of objects
that, given the information received till that point, could be the target and those which have
been instead excluded. We call these two sets, the reference set and the complement set.
Each turn brings an update of the reference set: when the question is positively (vs. negatively)
answered the reference set maintains all the candidates that have (vs. don’t have) the property
asked in the question. The candidates that are not maintained in the reference set move to the
complement set. We would expect the probabilities assigned to members of these two sets to
differ for the QA pair to be well grounded, and those of the complement set to be close to zero.
Obviously, the cut between the two sets is not always crystal clear and the members in the sets
might have a different status: for some members of the referent (vs. complement) set, the model
could be more confident than for others, that is, it should assign a higher (vs. lower) probability;
they should receive different soft-labels. The difficulty is how to obtain such labels so as to
properly evaluate the model’s grounding and reasoning skills.
   We contribute to this challenge in the following way. First of all, we propose a method for
obtaining soft-label distribution automatically. It consists of two phases: 1) a data collection
step with subjects on a small sample, and 2) an automatic annotation of the full dataset through
a rule-based system, after having verified its correlation with the human data on the sample
set. Concretely, we take the first turns of GuessWhat?! games as case-study and collect
human annotations. Our data collection experiment shows that humans highly agree on
grounding category and object questions (“Is it an animal?”/ “Is it a dog?”) but less so in
grounding spatial questions, in particular absolute questions (“Is it on the left?”). We take
humans’ disagreement as a proxy for humans’ uncertainty and obtain human-based soft-labels
for a sample of the GuessWhat?! dataset. We implement rule-based systems that simulate
such disagreement/uncertainty and use the human annotation data to verify that the resulting
soft-labels correlate well with those derived from human disagreement on the sample dataset.
We use such systems to automatically annotate the first turns in the GuessWhat?! test set
containing category and spatial questions. Secondly, we report how much models’ probability
distribution correlates with the soft-labels obtained automatically for the full test set. Finally, we
propose metrics that quantify to which extent the models’ probability distribution reflects the
intuition that candidates in the complement set should have a probability as close as possible
to zero or at least lower than a certain threshold while all the candidates in the referential
set should receive a probability not lower than a given threshold. Since grounding negatively
answered questions has been shown to be harder than grounding the positively answered ones
both for models and for humans [3], we report our results by comparing turns containing a
negative vs. positive answer.


2. Related Work
GuessWhat?! Over the years, a number of different approaches have been proposed for the
GuessWhat?! task [4]. [2] proposed a baseline model for the Oracle, the Questioner, and the
Guesser, which is the model that would guess the target object after the dialogue has been
concluded. [5] introduced a Deep Reinforcement Learning model for building a multi-modal
goal direction dialogue system. [6] sought to mitigate the issue of dialogue agents focusing on
simple utterances, and introduced a class of temperature-based extensions for policy gradient
methods called Tempered Policy Gradients (TPGs), which would mitigate the problem. This
method was used to create an improved Guesser model based on Deep Reinforcement Learning,
TPGs and Memory-Attention (tow-hop attention) [6, 4]. [7] proposed a grounded dialogue state
encoder (GDSE), which combines training the Question generator and the Guesser by learning
both in a multi-task fashion. More recently, [8] proposed new models for the GuessWhat?! task
using VilBert, thereby allowing the models to take advantage of the pretrained vision-linguistic
model. The models for the Oracle, Guesser, and Questioner all outperformed state-of-the-art
models. [9] adapted LXMERT [10], a multimodal universal encoder, to act as the Oracle, reaching
an important improvement over the baseline. [11] used GuessWhat?! as a way to evaluate
the performance of two pre-trained Transformers, LXMERT and RoBERTa [12], and [3] used
Guesser models with LXMERT and RoBERTa as encoders. In our work, we take GDSE as a
baseline model and LXMERT as a representative instance of pre-trained transformer-based
models. None of these works, however, has studied how to assign probabilities to different
candidates. Even [8], in which for the first time the Guesser has been trained incrementally
turn by turn neglected this analysis of the probability distribution. [13] proposed Confirm-it, a
cognitively-inspired beam search re-ranking strategy for the GuessWhat Questioner agent that
exploits the probabilities assigned by the Guesser to guide the Question Generator. However,
the authors rely on the probabilities of the Guesser without investigating them in detail. So far,
there has been no careful study of how to distribute the probability for the Guesser.

Evaluation of Visually Grounded Models Through the study of probability distributions
generated by the model, our work also relates to research on natural language understanding
in vision and language (V&L) models. While V&L models are often evaluated by looking at
performance on V&L tasks, recent studies have focused on models’ understanding of specific
linguistic structures [14, 15]. [14] proposed a new framework for investigating a model’s fine-
grained understanding of linguistic structures in spatial expressions. They used the visual
dialogue dataset OneCommon and evaluated the model’s understanding through reference
resolution. [15] propose VALSE (Vision and Language Structured Evaluation), a benchmark for
evaluating the grounding capabilities of pretrained V&L models on specific linguistic phenomena.
Models have to distinguish real captions from foils (captions altered to differ on some linguistic
phenomenon). Our work differs from these studies by proposing a metric to evaluate the
grounding of the reference set for a single QA pair, rather than focusing on specific linguistic
structures or phenomena.

Soft labels There has been a growing interest in soft labels, defined as probability distributions
over annotator labels. Soft labels have been shown to help build a more robust representation.
[16] show that models trained with soft labeling are better at generalizing on unseen data.
[17] use soft labels as an auxiliary task in a multi-task learning setting with a main task based
on hard labels, and show that this improves on the main task. These works build on earlier
lines of research that replaced the training objective moving from the most likely label to full
distribution over labels [16]. Alternatives to one-hot encodings have been proposed also for
image classification [18, 19, 16]. Most previous works on soft labels aim at improving the task
accuracy. Instead, we propose to use soft-labels as a test-bed to evaluate the trustworthiness of
computational models. We conjecture that generating probability distributions inspired by soft
labels should come naturally as a by-product of an effective model training, without the need to
Table 1
Composition of the human annotation dataset
                                           Question type         n
                                           Category             25
                                           Object               25
                                           Color                40
                                           Absolute Spatial     85


explicitly provide this information at training time. As an upper-bound, we also train a model
to replicate soft-labels distributions using KL divergence loss [20].


3. Dataset and soft-label annotation
In this Section, we are going to describe the data collection procedure on a sample dataset
with human annotators. Then, we present how we design the automatic annotation on the
GuessWhat?! test set. The human annotation allows us to study human disagreement and
verify the reliability of the automatic annotation.

3.1. Human annotation collected for a sample set
From the training set of GuessWhat?! human-human dialogues, we selected a subset of first
questions (i.e., questions asked at the first turn of the dialogue) focusing on the most frequent
question types, namely category, object, color, and absolute spatial questions. The distribution
of such a subset is given in Table 1.1
   In order to collect humans’ judgment, participants were asked to use the website makesense.ai.
As illustrated in Figure 2, they were given an image with colored bounding boxes for each
candidate object, a list of MS-COCO labels corresponding to such objects, and a QA pair (’Is
it at the bottom of the picture? No’). The participants had to select the set of possible target
objects for the given QA pair (the reference set, i.e., all the objects that are not considered to
be at the bottom of the picture). Participants were also informed that there were no correct or
wrong answers. We collected human annotation data for 175 QA pairs, each annotated by three
participants. In total, we recruited nine people, most of whom were university students.

3.2. Automatic soft-label annotation for a large-scale dataset
Starting from the sample of human-annotated data, we designed methods to annotate the first
QA pairs containing category and absolute questions automatically. We focus on these two
question types because the GuessWhat?! dataset contains, for each candidate object in the
image, its category label and its spatial coordinates. We can thus easily exploit these two sources
of information for the automatic annotation. We believe other strategies could be found for
other question types so as to apply the rule-based system leveraging on human annotations.
    1
      We initially started with 25 questions per question type and then we expanded the two question types with
less agreement, namely color questions and absolute spatial questions.
Figure 2: An example of the set up seen by human annotators in makesense.ai.


Automatic annotation methods For the category questions, we simply rely on the MS-
COCO labels and divide the probability uniformly among the objects in the referent set, and
leave 0 to the probability of the candidates in the complement set. An example of the resulting
probability distribution is given in Figure 3 (left). Here, given the QA pair ’Is it a person? Yes’,
the probability is uniformly distributed among the two candidate objects (#1 and #3) which
are associated with the label ’person’.


Figure 3: Examples of how the rule based system assigns soft-labels for category (left) and spatial
(right) questions. Category: The only candidates that are labelled as “person” are #1 and #3, hence
the probability of 1 is divided equally between them. Spatial: The three systems agree in considering
candidates #0 and #2 as not being on the left, hence they receive a 0 probability score. The probability
of 1 is divided among the other candidates proportionally to the number of rules that consider the object
as being on the left, resulting into 0.4286 assigned to #1 and #3 (3/3 rules considered them being on
the left) and 0.1429 assigned to #4 (only 1/3 rule).


  For the absolute spatial questions, we designed a rule-based method. We implemented three
systems that simulate three different ways of grounding spatial questions: two of them based
on the coordinates of the bounding boxes of the candidate objects, and one on the centroid
of the square bounding box containing a given candidate. These three systems are meant to
simulate different humans’ interpretations of the same question and are explained in detail in
Appendix A. We then combine the three systems so as to generate soft-label probabilities over
the candidates. In particular, for each rule system, the objects in the reference set get assigned a
1 and the objects in the complement set are assigned a 0. We then sum the result for the three
rules for each candidate object, so that an object that is in the reference set for all three rules
receives a 3. We now have a list of numbers from 0 to 3, reflecting how often each candidate
object is in the reference set. Finally, we apply a Softmax function to this list to turn it into a
probability distribution. Figure 3 (right) shows the soft-label probabilities generated out of the
answers given by our three rule-based systems. The three rules agree in considering #0 and
#2 not to be on the left, hence their probability is scored 0 and the probability of 1 is divided
among the other candidates proportionally to the number of rules that consider them to be on
the left, namely #1 and #3 receive 0.4286 while #4 receives 0.1429.

Training and Testing data set We extracted the first-turn QA pair containing either a
category or an absolute spatial question out of the training and testing set. From the training
set, we obtained 59,518 QA containing a category question and 1274 containing an absolute
spatial question. From the test set, we obtained 9421 QA-image pairs (4774 Yes and 4647 No)
and 235 (104 Yes and 131 No) for category and absolute spatial questions, respectively. We have
automatically annotated these 1st QA pairs in the training and test sets with soft-labels using
the methods described above.


4. Models and Metrics
Below we describe the models we will evaluate through the paper; in the setting we consider,
models receive as input the dialogues or QA pair generated by the Amazon Mechanical Turks
who played the GuessWhat?! games, i.e. the dialogues/QA in the dataset released by [2].

4.1. Models
Baseline We use the model provided by [13] for the GuessWhat?! task, which is based on the
GDSE architecture [7]. The images are encoded using a ResNet-152 network [21], with an LSTM
being used to encode the dialogue history. From this a multi-modal shared representation is
generated, which is used to train both the Question Generator and the Guesser module in a joint
multi-task learning method. The original model uses the cross-entropy loss (CE) against the
ground-truth target object, we experiment with both CE and Kullback-Leibler (KL) divergence
loss [20] against a probability distribution over candidate objects.

Upper-bound We train the GDSE model in a multi-task setting. For the main task, we train
the model to predict a probability distribution that is a one-hot encoding for the target object,
so that the Guesser is still trained to guess the target object. For the auxiliary task, we train
the Guesser to predict a probability distribution for a QA pair by using soft labels as a target
probability distribution. In both of these cases, KL-divergence loss is used as the loss function to
calculate the distance between the predicted probability distribution and the target probability
distribution (one-hot encoding for the target object for complete dialogues and soft labels for
single QA). We derived the soft labels from the rule-based systems described above. We will
refer to this model as MTL-KL.

LXMERT We compare the models described above with the Guesser model trained by [3]
which is based on a multimodal Transformer-based model, LXMERT [10].

4.2. Metrics
Besides looking at the task-accuracy of the model in identifying the target object at the end
of the dialogue, we propose the following metrics to evaluate to which extent the models are
trustworthy. They focus on the probability distributions generated by the models and serve as a
litmus test of the model’s grounding skills. When a model performs poorly on one or more of
these metrics, it calls into question the reliability of the model. The metrics are all based on
the automatic soft-labels we obtain on the test set with the rule-based systems which we use
to identify the members of the reference and the complement set. As a preliminary step, we
calculate the Pearson’s correlation coefficient between the soft-labels obtained from human
disagreement and those assigned by the rule-based system on the sample set. Given the high
correlation we obtain (See Section 5), we can use the metrics below.

    • Correlation with rule-based soft-labels We calculate the Pearson’s correlation co-
      efficient between the models’ probability distribution and the soft-labels assigned by
      rule-based systems.
    • Percentage of well-grounded QA pairs: We define a QA pair to be well-grounded
      when all candidates in the referent (vs. complement) set have been assigned a probability
      higher (vs. below) a certain threshold 𝜃. Given an image 𝐼, a question-answer pair 𝑄𝐴, a
      set of candidate objects 𝑂 = {𝑜1 ...𝑜𝑛 } and corresponding probabilities 𝑃 (𝑜), a reference
      set 𝑅𝑆 ⊂ 𝑂 and a complement set 𝐶𝑆 ⊂ 𝑂 such that 𝑅𝑆 ∪ 𝐶𝑆 = 𝑂 and 𝑅𝑆 ∩ 𝐶𝑆 = ∅,
      we define two criteria for a question-answer pair to be considered well-grounded. Well-
      grounded against RS: given a threshold 𝜃, we consider a game to be well-grounded if
      ∀𝑥 ∈ 𝑅𝑆, 𝑃 (𝑥) > 𝜃. Well-grounded against CS: given a threshold 𝜃, we consider a game
      to be well-grounded if ∀𝑥 ∈ 𝐶𝑆, 𝑃 (𝑥) < 𝜃.
    • Average Probability of the complement set: We consider a model to be trustworthy
      if all candidates in the complement set receive a probability close to zero. To this end,
      for each game, we calculate the average probability of the candidate entities in the
      complement set and the standard deviation. The lower this probability is, the higher is
      our trust in the model.


5. Experiments
5.1. From human disagreement to automatic soft-labels
Figure 4: Human almost fully agree on object and category questions, whereas there is some disagree-
ment on color and absolute questions.

Human disagreement on the sample set Figure 4 shows the percentage of games on
which human annotators fully agree in identifying the reference set. The comparison of human
annotation of the reference sets reveals that humans highly agree with each other on grounding
category and objects questions (91% and 92%, respectively); whereas they disagree on 25% of
the color questions and 30% of the absolute spatial questions. Figure 5 illustrates examples of
spatial questions on which humans disagree. On the left, we can see that human participant
2 also included objects #0 and #5 in the reference set, showing the ambiguity of this kind of
spatial questions when dealing with objects located towards the centre of the image. On the
right image, instead, there is another source of ambiguity related to the word ’middle’. In fact,
annotators seem to interpret this question as either ’middle’ of the image or ’middle’ of the
group of the most salient objects appearing in it. Our aim is to model the disagreement between
annotators to obtain reliable probability distributions to model uncertainty.


Figure 5: Human disagreement. Left: the objects that the annotators disagree about the objects located
near the middle of the image. Right: human disagreement appears to come from a relational reading of
the question.


Quality of the automatic soft-labels annotation To evaluate the quality of the rule-based
system annotation, we compute the Pearson’s correlation between the soft-labels obtained out of
Table 2
Test set: Task accuracy (TA) obtained by the models when receiving the full humans generated dialogues,
the Pearson’s correlation (r) between the soft-labels assigned by the rule-based systems and the models
for category (CQ) and spatial (SQ) questions in the first turn of the test set games, and the average
probability (prob) assigned to entities in the complement set after the first turn containing a category
question for positively-answered (Y) and negatively-answered (N) questions.
                                TA      r-CQ    r-SQ    prob-Y (SD)    prob-N (SD)
                   STL-CE       61.2%   0.80    0.68    0.25% (3.41)   0.82% (4.55)
                   STL-KL       60.6%   0.80    0.64    0.28% (4.06)   0.69% (4.63)
                   LXMERT       70.2%   0.73    0.66    0.19% (2.20)   0.08% (0.88)
                   MTL-KL       60.0%   0.97    0.68    0.19% (3.06)   0.34% (3.99)


human disagreement and those obtained out of the rule-based systems. For category questions,
the correlation is almost perfect. This is due to the fact that for these questions (appearing in
the first turn of GuessWhat dialogues) there is no uncertainty about their interpretation. This
is also revealed by the high agreement between human annotators and between the automatic
annotation based on MS-COCO labels and human annotation. The Pearson’s correlation shows
that the ensemble of rule-based systems successfully captures humans’ disagreement on spatial
absolute questions: its soft-labels have a Pearson’s correlation of 0.94 with those derived from
humans data. These results bring the first evidence on the feasibility of the method we propose,
namely to implement rule-based systems to simulate human disagreement on a small sample
of data and then use such systems to automatically annotate large-scale datasets. Of course, it
remains to be seen whether the method can be extended to other question types, and whether
having such soft-labels for first questions could help models in properly assigning probabilities
through the dialogue, incrementally.

5.2. Soft-labels evaluation on the test set
In this experiment, we compute the metrics introduced above to evaluate the probability
distributions assigned by the models in the test set. In particular, we are interested in assessing
the ability of the models to generate probability distributions over candidate objects that
effectively mirror the uncertainty of GuessWhat?! games with incomplete dialogue history.

Pearson’s correlation Given the reliability of our proposed rule-based systems to simulate
humans’ uncertainty / disagreement, we claim they can be used to automatically annotate the
GuessWhat?! test set focusing on the first turns of the games which contain category questions.
Table 2 shows the task accuracy of different models together with their Pearson’s correlation
(r) with the soft-labels described above. We can observe some interesting properties emerging
from the interplay between these two sets of metrics. First of all, we can observe that LXMERT
outperforms the other models to a large extent in accurately identifying the target object at
the end of the dialogue. The other models (STL and MTL, regardless of the loss function used)
show a similar accuracy. However, if we look at the ability of the models to generate probability
distributions that mirror human uncertainty, we can see that the pre-trained transformer-based
Table 3
Percentage of well-grounded QA pairs in the first turn when containing a category question or an
absolute spatial question, when looking at the complement set (Left) with 𝑡ℎ𝑒𝑡𝑎 = 0.4% and at the
reference set (Right) with 𝑡ℎ𝑒𝑡𝑎 = 0.1%.
                 Category       Absolute Spatial                    Category         Absolute Spatial
                 YES NO         YES NO                              YES    NO        YES    NO
        STL-CE   96.6    78.2   31.7   12.0            STL-CE       95.88   80.00    76.72   87.50
        STL-KL   97.5    80.9   30.8   13.0            STL-KL       93.62   77.26    82.76   78.47
        LXMERT   97.9    95.1   57.7   30.5            LXMERT       57.02   47.72    59.48   57.64
        MTL-KL   95.6    95.5   -      -               MTL-KL       99.98   99.98    -       -


model LXMERT is far away from the upper-bound MTL-KL model (explicitly trained to replicate
these probabilities) but, surprisingly, also lower than the baseline model. This result shows the
importance of our evaluation criterion to shed light on the behaviour of computational models.

Well-grounded QA pairs We compare models based on how well they ground the QA pairs
via analysing the probability assigned to the members of the referential set vs. those assigned
to the members of the complement set. To verify whether negatively answered questions are
harder to be grounded as attested in the literature [3], we report results for the Yes- and the
No-first turns containing category or absolute spatial questions. Table 3 reports the percentage
of well-grounded QA pairs for the complement set and the reference set . Given the small dataset
size of spatial questions, for MTL-KL we focus only on category questions. Remember that this
metric captures the ability of the model to assign a probability below a given threshold (vs. above)
for all objects belonging to the complement set (vs. reference set). We distinguish between
questions that receive positive or negative answers. Overall, we can see that the proposed MTL-
KL model effectively manages to include and exclude objects from the reference/complement set
and assign soft-label probabilities when dealing with uncertainty. Looking at the complement
set (Table 3, left), for category questions we can see that all models except STL-CE effectively
assign a low probability to objects belonging to the complement set when dealing with positive
answers. However, if we look at questions that receive negative answers, only LXMERT and
MTL-KL do not show a degradation in the model’s performance. The advantage of LXMERT is
even more prominent when looking at spatial questions, both for positive and negative answers.
Combining these results with the ones reported in Table 2, we can conclude that LXMERT
effectively assigns low probabilities to candidate objects belonging to the complement set but it
fails at generating probability distributions that mimic human uncertainty.
   Table 2 also shows the average probability assigned to entities in the complement set for
category questions. We can see that LXMERT and MTL-KL effectively handle category questions
with both positive and negative answers, while the single-task models struggle with the latter,
assigning a much higher probability to entities in the complement set.2 Moving to the reference
set, Table 3 (right) also reports the percentage of well-grounded games in the reference set.

    2
    We have evaluated models changing the 𝜃 value and noticed that it only impacts results on the negatively
answered questions.
LXMERT is by far the worst model in assigning a probability over a threshold as low as 0.1% to
all objects belonging to the reference set while MTL-KL reaches an almost perfect performance
on category questions. Despite the high task accuracy reached by the model, LXMERT is shown
to lack the ability to generate trustworthy probabilities.


6. Conclusion
Classification tasks represent the core component of most AI systems. Assigning reliable
probability distributions across labels represents a crucial step toward building trustworthy
systems. In our work, we take a referential visual dialogue task, GuessWhat?!, as a test bed and
we scrutinize the probability distribution assigned by different models to the set of possible
referents. We experimented with just the QA in the first turn of the dialogue. We focus our
attention on category and spatial questions and designed a set of heuristics that mimic the
human annotation we collected for these question types. While category questions show a high
agreement between human annotators, spatial questions show an intrinsic uncertainty and
lead to human disagreement. Our approach of combining three different rule-based systems
effectively takes human annotators’ disagreement into account when generating probability
distributions over candidate objects. The well-grounded metric we propose shows that com-
putational models struggle in generating reliable probability distributions regardless of their
architecture or pre-training regime. In line with previous work, we show that models perform
worse when dealing with negative answers to polar questions. We used as an upper-bound a
model trained to replicate these probabilities using KL divergence loss. We show that the model
that performs best in the task accuracy (LXMERT, a pre-trained transformer-based model) does
not generate trustworthy probabilities over candidate objects. Our work shows the importance
of scrutinizing the ability of computational models to assign reliable probability distributions at
inference time.


Acknowledgments
This work was supported by the University of Trento. The first author is a student of the
Erasmus Mundus European Masters Program in Language and Communication Technologies.
We also thank the anonymous reviewers for their valuable comments.


References
 [1] D. Nozza, L. Passaro, M. Polignano, Preface to the Sixth Workshop on Natural Language
     for Artificial Intelligence (NL4AI), in: D. Nozza, L. C. Passaro, M. Polignano (Eds.), Pro-
     ceedings of the Sixth Workshop on Natural Language for Artificial Intelligence (NL4AI
     2022) co-located with 21th International Conference of the Italian Association for Artificial
     Intelligence (AI*IA 2022), November 30, 2022, CEUR-WS.org, 2022.
 [2] H. De Vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle, A. Courville, Guesswhat?!
     visual object discovery through multi-modal dialogue, in: Proceedings of the IEEE Confer-
     ence on Computer Vision and Pattern Recognition, 2017, pp. 5503–5512.
 [3] A. Testoni, C. Greco, R. Bernardi, Artificial intelligence models do not ground negation,
     humans do. guesswhat?! dialogues as a case study, Frontiers in big Data 4 (2021).
 [4] G. M. Elshamy, M. Alfonse, M. M. Aref, A guesswhat?! game for goal-oriented visual
     dialog: A survey, in: 2021 Tenth International Conference on Intelligent Computing and
     Information Systems (ICICIS), IEEE, 2021, pp. 116–123.
 [5] F. Strub, H. De Vries, J. Mary, B. Piot, A. Courville, O. Pietquin, End-to-end optimization
     of goal-driven and visually grounded dialogue systems, arXiv preprint arXiv:1703.05423
     (2017).
 [6] R. Zhao, V. Tresp, Learning goal-oriented visual dialog via tempered policy gradient, in:
     2018 IEEE Spoken Language Technology Workshop (SLT), IEEE, 2018, pp. 868–875.
 [7] R. Shekhar, A. Venkatesh, T. Baumgärtner, E. Bruni, B. Plank, R. Bernardi, R. Fernández,
     Beyond task success: A closer look at jointly learning to see, ask, and guesswhat, arXiv
     preprint arXiv:1809.03408 (2018).
 [8] T. Tu, Q. Ping, G. Thattai, G. Tur, P. Natarajan, Learning better visual dialog agents with
     pretrained visual-linguistic representation, in: Proceedings of the IEEE/CVF Conference
     on Computer Vision and Pattern Recognition, 2021, pp. 5622–5631.
 [9] A. Testoni, C. Greco, T. Bianchi, M. Mazuecos, A. Marcante, L. Benotti, R. Bernardi, They are
     not all alike: Answering different spatial questions requires different grounding strategies,
     in: Proceedings of the Third International Workshop on Spatial Language Understanding,
     2020, pp. 29–38.
[10] H. Tan, M. Bansal, Lxmert: Learning cross-modality encoder representations from trans-
     formers, arXiv preprint arXiv:1908.07490 (2019).
[11] C. Greco, A. Testoni, R. Bernardi, Grounding dialogue history: Strengths and weaknesses
     of pre-trained transformers, in: International Conference of the Italian Association for
     Artificial Intelligence, Springer, 2020, pp. 263–279.
[12] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
     V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint
     arXiv:1907.11692 (2019).
[13] A. Testoni, R. Bernardi, Looking for confirmations: An effective and human-like visual
     dialogue strategy, arXiv preprint arXiv:2109.05312 (2021).
[14] T. Udagawa, T. Yamazaki, A. Aizawa, A linguistic analysis of visually grounded dialogues
     based on spatial expressions, arXiv preprint arXiv:2010.03127 (2020).
[15] L. Parcalabescu, M. Cafagna, L. Muradjan, A. Frank, I. Calixto, A. Gatt, Valse: A task-
     independent benchmark for vision and language models centered on linguistic phenomena,
     arXiv preprint arXiv:2112.07566 (2021).
[16] J. C. Peterson, R. M. Battleday, T. L. Griffiths, O. Russakovsky, Human uncertainty makes
     classification more robust, in: Proceedings of the IEEE/CVF International Conference on
     Computer Vision, 2019, pp. 9617–9626.
[17] T. Fornaciari, A. Uma, S. Paun, B. Plank, D. Hovy, M. Poesio, Beyond black & white:
     Leveraging annotator disagreement via soft-label multi-task learning, in: 2021 Conference
     of the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Association for Computational Linguistics, 2021.
[18] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture
     for computer vision, in: Proceedings of the IEEE conference on computer vision and pattern
     recognition, 2016, pp. 2818–2826.
[19] R. A. Krishna, K. Hata, S. Chen, J. Kravitz, D. A. Shamma, L. Fei-Fei, M. S. Bernstein,
     Embracing error to enable rapid crowdsourcing, in: Proceedings of the 2016 CHI conference
     on human factors in computing systems, 2016, pp. 3167–3179.
[20] S. Kullback, R. A. Leibler, On information and sufficiency, The annals of mathematical
     statistics 22 (1951) 79–86.
[21] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016
     IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,
     NV, USA, June 27-30, 2016, IEEE Computer Society, 2016, pp. 770–778. URL: https://doi.
     org/10.1109/CVPR.2016.90. doi:10.1109/CVPR.2016.90.


A. Rule-Based Systems
We designed a rule-based method for automatic annotation of absolute spatial questions. We
implemented three systems that simulate three different ways of grounding spatial questions:
two of them based on the coordinates of the bounding boxes of the candidate objects, and one
on their centroid, i.e., the centroid of the squared bounding box containing a given candidate.

   Each system defines rules for different types absolute spatial questions, specifically, those
related to left/right, top/bottom, the middle, and quadrants of the image (e.g. top right). The
systems differ in how they assign labels for areas of the image which showed a larger human
disagreement so that when combined they reflect different humans’ interpretations of the same
question.

Rule-based System 1 (R1) The first rule-based system is relatively simplistic and strict. It
is based on the intuition that (nearly) the entire object should be in a region of space for it to
be considered in that region. The rules and their implications for the candidate answers are
illustrated in Figure 6


Figure 6: Illustration of the R1 rules for left/right, top/bottom, and quadrant questions respectively.
  Left and right questions A candidate object is considered to be on the left of the image
when more than 80% of the bounding box is on the left 50% of the image. The inverse is true for
the right.

   Top and bottom questions A candidate object is considered to be on the top (half) of the
image when more than 80% of the bounding box is on the top 50% of the image. The inverse is
true for the bottom half.

  Middle and center questions A candidate object is considered to be in the middle or
center when in both a horizontal and a vertical direction, at least 50% of the bounding box is
inside the central 50% of the image.

   Quadrants All candidate objects where the entire bounding box is in the top left quadrant
of the image are considered to be in the top left of the image. The same holds for other quadrants.

Rule-Based System 2 (R2) The second rule-based system for assigning the candidate objects
is more fine-grained than the first one and again uses bounding boxes. It is simultaneously
more lenient when it comes to large objects sticking out into other regions of space, and stricter
when it comes to small objects near the middle of the image. This is because humans disagree
more on objects near the middle for left/right/top/bottom questions. Hypothetical examples of
the rules and their implications can be found in Figure 7 and Figure 8.

   Left and right questions Unlike in R1, we make a distinction between the ’left half’ or
’left picture’ and the more generic ’left’. These more specific questions make up 9.6% and 5.2% of
the left and right questions respectively, and there is some indication in the human disagreement
that they are not always treated the same as generic questions. The intuition is that when the
question specifically asks about the left half, small objects close to the middle may still count as
’left’ as long as they are on the left 50%, while for the generic ’is it on the left?’, the middle may
already be considered ’not left’.

  To determine whether an object is on the left or left half of an image, we consider the following
two conditions:

   1. Two-thirds of the bounding box must be on the left 50% of the image.
   2. Some part of the bounding box must be on the leftmost 40% of the image

  A candidate object is on the left half of the image when the first condition is met. It is
considered to be on the left when both conditions are met. The same rule applies in the inverse
for questions about the right.

   Top and bottom questions A candidate object is considered to be on the top half if at
least 75% of the bounding box is on the top half. The inverse is true for bottom questions.
Figure 7: Illustration of the R2 rules for left/right and top/bottom questions.


  Middle and center questions A candidate object is considered to be in the middle of the
image if the entire bounding box is either in the middle 50% horizontally or vertically.


Figure 8: Illustration of the R2 rules for middle and center questions.


  Quadrants Lastly, the rule for quadrants such as “top left” is the same as in the first
rule-based system.
    • All candidate objects where the entire bounding box is in the top left quadrant of the
      image are considered to be in the top left of the image.

Rule-Based System 3 (R3) The third rule-based system for assigning candidate answers is
based entirely on the centroid of the bounding box. The rules are visualized using hypothetical
examples in Figures 9 and 10.

  Left and right questions A candidate object is considered to be on the left when the
centroid of the bounding box is on the left 50% of the image. The inverse holds for right
questions.
Figure 9: Illustration of the R3 rules for left/right and top/bottom questions respectively.


  Top and bottom questions A candidate object is considered to be on the top part when
the centroid of the bounding box is on the upper 50% of the image. The inverse holds for bottom
questions.

  Middle and center questions A candidate object is considered to be in the middle of the
image when the centroid of the bounding box is in the middle 50% of the image both horizontally
and vertically.


Figure 10: Illustration of the R3 rules for quadrant and middle/center questions respectively.


  Quadrants A candidate object is considered to be in the top left when the centroid of the
bounding box is in the left 50% horizontally and the top 50% vertically. The same holds for the
other quadrants.
  Accuracy of the rules In order to assess the accuracy of the annotation assigned by the
rules on a larger dataset containing unseen data, we calculate the accuracy in assigning a
candidate answer to the target object (available from the GuessWhat?! dataset). For rule-based
system 1, the rules lead to correct candidate answers for the target object in 90.5% of the absolute
spatial questions in the training set. The accuracy for R2 and R3 is 92.6% and 89.7% respectively.

  As we designed these systems to simulate the variation in human interpretation of questions,
we also investigate the union of the systems. The union of R1 and R2 for assigning the target
object is 94.9%, that of R1 and R3 is 93.9% and that of R2 and R3 is 92.2%.