<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Akshar Tumu</string-name>
          <email>atumu@ucsd.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Parisa Kordjamshidi</string-name>
          <email>kordjams@msu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Spatial Reasoning, Vision-language models (VLMs), Referring Expression Comprehension</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Michigan State University</institution>
          ,
          <addr-line>East Lansing, MI</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of California San Diego</institution>
          ,
          <addr-line>9500 Gilman Dr, La Jolla, CA, 92093</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Spatial Reasoning is an important component of human cognition and is an area in which the latest Visionlanguage models (VLMs) show signs of dificulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation ('not'). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Vision-language model (VLM) research has boomed in the recent past, owing to the enhanced user
interaction and accessibility they provide. Models such as GPT 4o1, LLaVA [
        <xref ref-type="bibr" rid="ref1 ref41">1</xref>
        ], Google Gemini [
        <xref ref-type="bibr" rid="ref2 ref42">2</xref>
        ]
have become adept at solving vision-language tasks such as Visual Question Answering (VQA), Image
Captioning, and more. However, VLMs still lack human-level ‘Spatial Reasoning’ capabilities [
        <xref ref-type="bibr" rid="ref3 ref4 ref43 ref5">3, 4,
5</xref>
        ]. Spatial reasoning involves comprehending relations that depict the absolute/relative position or
orientation of an object, such as ‘left’, ‘above’, or ‘near’. Inaccurate spatial reasoning by VLMs can lead
to serious consequences in embodied AI domains such as autonomous driving and surgical robotics.
A focused analysis of VLMs’ spatial reasoning capabilities can help identify and address potential
reasoning issues.
      </p>
      <p>Most of the previous works confine their analysis to testing which models work well for spatial
relations. We go further to analyze the comparative performance of these models for spatial categories
that represent diferent orientational and positional relations between objects. A novel aspect of our
work is the analysis of the efect of varying spatial composition (number of spatial relations) in the
expressions on the performance of the models.</p>
      <p>Previous works focused on spatial analysis with image captioning-related tasks, thus failing to locate
the source of error in the presence of visual and linguistic ambiguity. To avoid this, we adopt the
Referring Expression Comprehension (REC) task for our analysis. The REC models output bounding
boxes around the target entity based on a natural language expression, the analysis of which could
reveal the parts of the input that the models fail to comprehend. Comprehension accuracy (or simply,
https://www.cse.msu.edu/~kordjams/ (P. Kordjamshidi)</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
(a) The white
napkin that is wrapped
around the hot dog
(b) The white box
that is around the
mirror
(c) The brown
table that is to the
left of the black cell
phone
(d) The sandy shore
that is near the
murky water
(e) The baseball
player that is to
the left of the black
helmet and to the
right of the home
plate
(f) The large branch
that is to the right of
the log that is behind
the large bear
(g) The black monitor
that is to the left of the
keyboard or on the desk
(h) The blanket that is
not green and that is
not on the bed
(i) The fence that is not
black and that is not to
the left of the man
accuracy) is a common metric for this task; it captures how often a model correctly outputs the bounding
box around the target entity.</p>
      <p>
        For our analysis, we use the CopsRef dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which is a complex dataset with visual ambiguity
and multiple spatial relations in expressions. We focus our analysis on 51 spatial relations, categorized
into 8 categories.
      </p>
      <p>
        We test two popular VLMs - LLaVA [
        <xref ref-type="bibr" rid="ref1 ref41">1</xref>
        ] and Grounding DINO [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We also include ‘MGA-Net’ [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], a
model specifically designed for the REC task. The chosen models ofer diversity in the evaluation as they
difer in their architectural elements, training strategies, and input formats. We further compare these
models with an object detector baseline to test if the images are truly complex and require elaborate
referring expressions to ground the correct object.
      </p>
      <p>
        Some of our important findings are as follows:
(1) Referring expressions that include spatial relations, in addition to object attributes, result in higher
accuracy on the REC task compared to expressions with only attributes. (2) Increasing the spatial
complexity (no. of spatial relations) of an expression afects the performance of the VLMs, but models
with explicit compositional learning components maintain the performance. (3) Expressions involving
dynamic spatial relations yield low accuracy across all models, indicating the dificulty in modelling
these relations. (4) The task-specific trained models achieve higher accuracy for expressions with
geometric spatial relations (e.g., left of, right of) while the VLMs show relatively better accuracy for
expressions having ambiguous relations such as proximity. (5) The models fail to recognize negated
spatial relations in referring expressions in multiple instances, though the extent of this failure varies
across models.
2. Related Work
Previous works have conducted a broad analysis on the ability of VLMs to perform multimodal perception
and reasoning tasks, such as Spatial Reasoning, Multimodal conversation, etc. Many comprehensive
real-world benchmarks have been introduced to test multiple VLM capabilities. [
        <xref ref-type="bibr" rid="ref1 ref10 ref11 ref12 ref13 ref14 ref41">1, 10, 11, 12, 13, 14</xref>
        ].
      </p>
      <p>
        Some works [
        <xref ref-type="bibr" rid="ref15 ref4">4, 15</xref>
        ] focus solely on spatial analysis of VLMs. SpatialEval benchmark [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] goes a step
further to analyze the role of each modality in spatial reasoning. However, these works do not analyze
the factors that afect the spatial reasoning ability of the VLMs. Another class of works performs
a category-wise analysis of spatial relations, either based on their spatial properties [
        <xref ref-type="bibr" rid="ref17 ref3 ref43">3, 17</xref>
        ] or their
linguistic properties and complexity [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Difering from these works, [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] analyzes the efects of spatial
biases in the datasets for the REC task performance. Spatial analysis has also been approached through
the use of text-only questions to probe pre-trained LLMs or VLMs [
        <xref ref-type="bibr" rid="ref19 ref20 ref21">19, 20, 21</xref>
        ]. However, these methods
do not evaluate spatial grounding within the visual modality, which is a crucial aspect of our work.
      </p>
      <p>
        Other closely aligned work includes Embodied Spatial Analysis, which focuses on the efects of
diferent perspectives and non-verbal cues on the spatial reasoning capabilities of VLMs [
        <xref ref-type="bibr" rid="ref22 ref23">22, 23</xref>
        ].
Task Complexity and Interpretability. The works mentioned previously use image-caption
agreement as their evaluation task. Due to the inherent limitations of this task, these works simplified the
expressions to have only 2 objects and 1 spatial relation. To improve the interpretability of model
output, synthetic datasets have been used instead of real-world images [
        <xref ref-type="bibr" rid="ref18 ref24 ref4">4, 18, 24</xref>
        ]. However, it simplifies
the problem due to bounded expressivity (limited number of objects, attributes, and spatial relations).
In our case, REC models output bounding boxes around the target objects. Analyzing the position and
characteristics of the output object helps identify the parts of the input that the models fail to process.
This enables comparative analysis of expressions with 0, 1, or more spatial relations, a unique feature of
our work. The REC task also enables us to test the models over images of diferent visual complexities
(single or multiple instances of objects in an image).
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. Dataset</title>
    </sec>
    <sec id="sec-3">
      <title>4. Approach</title>
      <sec id="sec-3-1">
        <title>In our analysis, we seek to answer the following research questions:</title>
        <p>RQ1. Which spatial relation categories result in low accuracy for REC models? RQ2. How do diferent
model characteristics/architectures influence the REC task accuracy for certain spatial relation categories
compared to the others? RQ3. Does the inclusion of spatial relations increase or decrease the accuracy
of REC models? RQ4. How does the number of spatial relations in the expressions afect the accuracy
across diferent types of models? RQ5. Do the REC models accurately recognize negated spatial
relations in expressions?</p>
        <p>To answer these questions, we explain our research methodology and the designed experiments in
this section.</p>
        <sec id="sec-3-1-1">
          <title>4.1. Models Description</title>
          <p>We select three distinct models for our analysis such that they difer in key components like architecture,
pre-training tasks, and input formats.</p>
          <p>
            MGA-Net. [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] It is an REC task-specific model whose compositional learning architecture was
designed to handle complex expressions. It decomposes a query using the soft attention mechanism
and processes visual and linguistic information using dedicated modules to construct a relational graph
among objects. Then, it uses a Gated Graph Neural Network to perform multi-step reasoning over the
referring expression. We first implement the Faster-RCNN model [
            <xref ref-type="bibr" rid="ref25">25</xref>
            ] to procure object proposals. Then,
we generate the vector representations for these object proposals using a pre-trained ResNet-101 model.
Considering the available computing resources, we omit the fourth (topmost) layer of the ResNet101
model to obtain a Partial CNN backbone. Finally, we train the model for ten epochs. We limit our
training to ten epochs due to computational constraints.
          </p>
          <p>
            Grounding DINO. [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] It is an open-set object detector VLM with language support. It has a vision
and a language backbone whose outputs are fused at multiple levels. Its contrastive loss for grounded
pre-training makes it suitable for the REC task. We use the Swin-B vision backbone and the CLIP-text
encoder for the language backbone. We filter all bounding box detections for an expression using their
output labels to see which detections match the target entity. Then we select the detection with the
highest confidence score.
          </p>
          <p>
            LLaVA. [
            <xref ref-type="bibr" rid="ref1 ref41">1</xref>
            ] It is a general-purpose VLM that connects an open-set vision encoder from CLIP [
            <xref ref-type="bibr" rid="ref26">26</xref>
            ] with a
language decoder. The model is trained end-to-end, which involves visual instruction tuning for aligning
the vision and language modalities. We test LLaVA with a Short prompt: (USER: &lt;image&gt;\n Give
the bounding box for: ”Referring Expression”\nASSISTANT:) and a Long prompt: (USER: &lt;image&gt;\n
Provide the bounding box coordinates for the object described by the referring expression: ”Referring
Expression”\n ASSISTANT:). Both prompts have a similar structure, but the second prompt is longer.
OWL-ViT [
            <xref ref-type="bibr" rid="ref27">27</xref>
            ] It is an object detector baseline that only takes the target object’s label as the input
instead of the entire referring expression. It is an open-set object detector, which is required because
CopsRef expressions involve entities from the Visual Genome [
            <xref ref-type="bibr" rid="ref28">28</xref>
            ] Scene Graphs, which have entities
absent in common datasets used to train famous closed-set detectors like YOLO [
            <xref ref-type="bibr" rid="ref29">29</xref>
            ]. It also has a simple
architecture with a Vision transformer and CLIP for aligning images and labels in a zero-shot manner,
making it an ideal baseline.
          </p>
          <p>Model Diferences. A key diference in the three main models can be seen in their input format.
While Grounding DINO and LLaVA take the entire image as the input and perform bounding box
regression to get object proposals, MGA-Net directly takes the externally detected bounding boxes
as the input. Grounding DINO and LLaVA also have similarities in their architectures, as they both
have vision and language backbones that are fed the entire image and text inputs. This is unlike
MGA-Net, which has dedicated transformer architecture modules for visual, linguistic, and relative
location components. However, Grounding DINO and MGA-Net show similarities in having grounded
training/pre-training tasks, while LLaVA only has general multimodal pre-training.</p>
          <p>
            In addition to these three models, we also experimented with InstructBLIP [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ] and OpenFlamingo
[
            <xref ref-type="bibr" rid="ref31">31</xref>
            ] models for the REC task. These models are general-purpose VLMs. While InstructBLIP works
in the zero-shot mode, OpenFlamingo functions in the few-shot mode. Neither of the models could
provide meaningful outputs for the task. The outputs of these two VLMs have been discussed in more
detail in Appendix B.
          </p>
        </sec>
        <sec id="sec-3-1-2">
          <title>4.2. Experimental Setting and Evaluation</title>
          <p>We create the following dataset test splits for evaluation and answering the earlier mentioned research
questions, RQ1-RQ5.
4.2.1. Fine-grained Spatial Relations Split
In the test dataset, we split the expressions with 1 spatial relation using the categories shown in Table 2.
Using the categories from Table 3, we split the remaining expressions based on the number of spatial
relations they contain. Then, we rank the models based on their accuracy for each category.</p>
          <p>To compare the models’ performances across the categories, we employ a statistical test known as the
Kendall Tau Independence Test. It evaluates the degree of similarity between two sets of ranks given
to the same set of objects. We calculate the Kendall rank coeficient (  ), which yields the correlation
between two ranked lists. Given  value, we calculate the  statistic, which follows standard normal
distribution, as:
 = 3 ∗  ∗
√( − 1)/
√2(2 + 5).</p>
          <p>(1)
Using the 2-tailed p-test at 0.05 level of significance, we test the following: Null hypothesis: There is
no correlation between the two ranked lists. Alternative hypothesis: There is a correlation between
the two ranked lists.
4.2.2. Visual Complexity Split
To observe the efect of visual complexity on model performance, we split the test dataset into two
parts. The first part has images that have multiple instances of one or more objects mentioned in the
associated referring expressions. The second part has images with at most one instance of every object
mentioned in the expression. We perform this splitting by first collecting the entities in each expression
using spaCy2 and then employing Grounding DINO to find the number of instances in the image for
each of the collected entities.
4.2.3. Negation Analysis Split
In our analysis, we found that models have dificulties in grounding spatial expressions with negations.
Therefore, we created a test split for a more accurate evaluation and a deeper analysis of negated spatial
expressions. We collected expressions that include the keyword ‘not’ and divided them into two sets
according to the number of occurring negations (1 or 2). Then, we collected those expressions for which
all three models give an IoU of less than 0.5. For each expression, we perform a qualitative analysis to
verify whether the errors are due to misinterpreting the negations or conflation of other errors. We
limit our analysis to the results from the first run of the three models to facilitate the instance-wise
analysis.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <p>Hardware. For Grounding DINO, LLaVA, and the OWL-ViT baseline, we use the T4 GPU provided
by Google Colaboratory for inference.3 For MGA-Net, we use the NVIDIA GeForce GTX 1650 GPU for
training. We run each model three times (both training and testing for MGA-Net, and inference for the
VLMs and the baseline) to ensure the statistical significance of our results.</p>
      <p>
        Evaluation Metrics. We evaluate the models using the Intersection over Union (IoU) metric.
Following previous works [
        <xref ref-type="bibr" rid="ref32 ref33">32, 33</xref>
        ], we consider the output as a correct comprehension if the IoU is greater
than 0.5. We calculate the comprehension accuracy (referred to as accuracy) as the fraction of data points
that have an IoU &gt;0.5.
      </p>
      <sec id="sec-4-1">
        <title>5.1. Evaluation on Referring Expressions</title>
        <p>From Table 4, we can observe that Grounding DINO and MGA-Net outperform the OWL-ViT baseline,
with the former achieving the highest accuracy in grounding the referring expressions. However,
we also tried training MGA-Net with the full ResNet-101 visual backbone (Full CNN) instead of the
partial backbone (Partial CNN). We could only train this model for four epochs due to computational
constraints. However, the model crossed 60% test accuracy in just four epochs and was monotonically
increasing. This shows that MGA-Net could potentially provide a better performance using adequate
computational resources. To avoid unfair comparisons due to the training discrepancies, we focus our
results on the relative performances of each model across diferent spatial relation categories rather
than comparing the absolute performances.</p>
        <p>For LLaVA, we used the prompts explained in Section 4.1. The shorter prompt gave a slightly better
accuracy than the longer prompt. Hence, we used the shorter prompt for further experiments. The
accuracy of LLaVA is less than both the other models and the baseline. Possible reasons are the lack of
both bounding box regression and visual grounding instructions during pre-training.</p>
        <p>Since we trained/tested each model for three runs, we report the average accuracy of the three runs
and the standard deviation in the table. Since we re-train MGA-Net for each of these runs, there is a
noticeable diference in model predictions in each run, leading to a slightly high standard deviation.
However, we test the VLMs and the baseline zero-shot, leading to zero or near-zero standard deviation
in the accuracies. This also follows for the future result tables.</p>
      </sec>
      <sec id="sec-4-2">
        <title>5.2. Evaluation on Fine-grained Relations</title>
        <p>2https://spacy.io/
3https://colab.research.google.com/</p>
      </sec>
      <sec id="sec-4-3">
        <title>5.3. Impact of Multiple Spatial Relations</title>
        <p>Table 6 shows the Kendall Tau Independence test results for the three pairs of VLMs. We can observe
that while the category-wise ranks of the VLMs (Grounding DINO and LLaVA) are correlated,
MGANet’s ranks aren’t correlated with them. This motivates us to study the possible reasons behind the
diference in the category-wise performances of MGA-Net and the VLMs.</p>
        <p>Among spatial categories of MGA-Net and VLMs, the major diference occurs with the Proximity
and Projective categories. To answer RQ2, we can observe that the ‘Proximity’ category ranks third for
both the VLMs but 8th for MGA-Net. On the other hand, ‘Projective’ has a higher rank for MGA-Net
than both VLMs. We can see that MGA-Net prefers geometric spatial relations like left of, on top of, etc.,
as it takes the relative locations of bounding boxes as input, which helps represent such relations. On
the other hand, the two VLMs have a better ranking than MGA-Net for ambiguous relation categories
that do not specify a clear distance or geometric direction (e.g., by, close to). This is because the vision
backbones of the VLMs utilize the entire image and help capture relations between a region in the
image and its surrounding regions, unlike MGA-Net, which only receives the detected bounding boxes
as input.</p>
        <p>To study further diferences between MGA-Net and the VLMs, we design Table 7, which shows the
performance of the three models and the OWL-ViT baseline for expressions having diferent numbers
of spatial relations. We observe that VLMs perform considerably better for expressions with 0/1 spatial
relations compared to expressions with 2/3 spatial relations. This proves that VLMs find it comparatively
dificult to ground multiple spatial relations. However, MGA-Net takes advantage of its compositional
learning architecture to handle multi-step reasoning, resulting in a similar performance for all categories.</p>
        <p>An interesting observation is that the performance of the baseline considerably drops for the ‘Two’
and ‘Three’ categories, even though the spatial relations aren’t being passed as input to the baseline.
The reason might be that 41.4% of these images have multiple instances of objects, the impact of which
is explained in the next section.</p>
        <p>From Table 7, we can also compare the performance of the models for expressions with none and
one spatial relation. We observe that LLaVA performs better for the former and MGA-Net for the latter.
Grounding DINO gives a similar performance for both.</p>
        <p>Now, to answer RQ3, we observe in Table 5 that among the seven categories of single spatial relations,
MGA-Net and Grounding DINO perform better for five of those compared to expressions with no spatial
relations. LLaVA also performs better for four such categories. Thus, we can conclude that in a setup
involving visual and linguistic ambiguity (such as ours), spatial relations along with visual attributes
often aid the models in grounding the expressions, compared to the attributes alone. This is also
reinforced by the results of the baseline. From Table 7, we can observe that while the baseline gives
the second-best performance for expressions with no spatial relations, it drops to the third place for
expressions with one spatial relation, with a 7.3% reduction in performance. This is because the baseline
doesn’t have access to the spatial relations.</p>
        <p>Finally, Table 7 helps us answer RQ4 as it shows the efect of increasing spatial relations on the
performance of MGA-Net versus the VLMs (as discussed before).</p>
      </sec>
      <sec id="sec-4-4">
        <title>5.4. Impact of Visual Complexity</title>
        <p>Out of 12586 test data points, we found that in the images of 4730 data points, there are multiple
instances of objects mentioned in the referring expressions. Table 8 shows the accuracies of the three
models and the OWL-ViT baseline for images with a single instance (‘Accuracy Single’ column) and
multiple instances (‘Accuracy Multi’ column). The models perform better for the single instance images
by 5.4% on average compared to the multi-instance images. The 8.4% performance drop of the baseline
for multi-instance images proves that the images are indeed complex and require more than just the
label as the input for grounding the right object. However, the 7.3% performance drop of LLaVA, as
compared to MGA-Net and Grounding DINO, shows that grounded pre-training also plays a crucial
role in helping the models ground the right object instance in multi-instance images.</p>
      </sec>
      <sec id="sec-4-5">
        <title>5.5. Impact of Negation</title>
        <p>
          We obtained 36 expressions with 1 ‘not’ and 73 expressions with 2 ‘not’s for which all models
gave incorrect predictions. Table 9 shows the total number of expressions we obtained with 1 and 2
negations. The ‘Total failure’ row gives the number of instances for which models failed to recognize at
least 1 negation. We can observe that Grounding DINO has the highest number of failure instances.
LLaVA performs better than Grounding DINO, possibly due to the Vicuna [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ] language backbone, as
it has a better language understanding (including negations) compared to Grounding DINO’s CLIP
text encoder. MGA-Net outperforms Grounding DINO since its training involves expressions with
negations, increasing its ability to comprehend negations during testing. Hence, to answer RQ5, we
observe that while all REC models face issues with recognizing negations, certain model characteristics
and training paradigms might reduce the failure cases when expressions contain negations.
        </p>
        <p>Another interesting observation was for the outputs of MGA-Net and LLaVA models when they are
close to the target object. From Table 10, we can see that while LLaVA has a better precision in such
cases, MGA-Net has a better recall.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Qualitative Analysis</title>
      <p>Here, we provide a qualitative analysis of certain issues faced by the models in handling referring
expressions.</p>
      <sec id="sec-5-1">
        <title>6.1. Directional Relations</title>
        <p>The expressions pertaining to Figures 1a and 1b consist of the same spatial relation (‘around’). In the
ifrst figure, the wrapping of the napkin around the hot dog only makes the napkin partially visible. But
in the second figure, the white box around the mirror is almost entirely visible. This shows how the
interpretation of ‘around’ is highly dependent on the configuration of the involved objects. For the first
image, LLaVA fails to precisely localize the object, while MGA-Net only returns a part of the napkin
that is visible. In the second image, both models fail to localize the object.</p>
      </sec>
      <sec id="sec-5-2">
        <title>6.2. Projective and Proximity Relations</title>
      </sec>
      <sec id="sec-5-3">
        <title>6.3. Multiple Spatial Relations</title>
        <p>For ‘Two-and’ category expressions, the models sometimes only satisfy one of the spatial clauses. This
often happens if multiple objects of the same class are in the image. For example, in Figure 1e, the
output baseball player is to the left of the black helmet but is not to the right of the home plate.</p>
        <p>Similarly, for ’Two-chained’ category expressions, the models sometimes do not consider the entire
expression. For example, in Figure 1f, MGA-Net and LLaVA return the ‘log that is behind the large
bear’, and Grounding DINO returns the bear itself. None of the models consider the ‘large branch’ part
of the expression, which should have been the output.</p>
        <p>Finally, for ’Two-or’ category expressions, the model might pay attention to only one spatial clause.
Consequently, it returns an object satisfying that clause but not the additional attributes mentioned in
the expression. For example, in Figure 1g, the model returns the monitor, which is to the ‘left of the
keyboard’, but it does not satisfy the color attribute.</p>
      </sec>
      <sec id="sec-5-4">
        <title>6.4. Negation</title>
        <p>Figures 1h and 1i show two cases where all models fail to recognize negation. In 1h, we can observe
that while MGA-Net is wrong, LLaVA is close to the ground truth but partially covers the target object
(high precision, low recall). In 1i, while LLaVA is wrong, MGA-Net is closest to the ground truth but
covers an excess area (low precision, high recall).</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>7. Conclusion</title>
      <p>Spatial reasoning is an integral aspect of cognitive reasoning and embodied AI tasks. However, recent
studies have shown that state-of-the-art VLMs often fail to accurately comprehend spatial relations. To
better understand the limitations of these models, we evaluate their spatial understanding using the
referring expression comprehension task because it requires explicit grounding of complex linguistic
expressions in the visual modality. We picked multiple models, including Vision-language models
(LLaVA, Grounding DINO) as well as task-specific models (MGA-Net). We observed that the VLMs
that are trained in the wild with visual and textual data perform worse in grounding. All models show
low accuracy in grounding Directional relations. However, the VLMs do better in vague relations such
as proximity, while the task-specific models are better in geometrically well-defined relations such as
left and right. While using spatial relations increases the grounding accuracy, using multiple relations
makes the reasoning more challenging for all models, with a higher impact on VLMs. However, unlike
VLMs, MGA-Net maintains its performance for complex spatial expressions due to its compositional
learning architecture. In the presence of visual complexity, the performance of all models drops, but
LLaVA’s performance is afected the most due to a lack of grounded pre-training. Finally, both VLMs
and task-specific models have failure cases when grounding expressions that include negation. These
ifndings shed light on the gaps for future work on Vision-language models.</p>
    </sec>
    <sec id="sec-7">
      <title>8. Future Directions</title>
      <p>
        Although increasing the number of parameters of VLMs can improve their performance for expressions
with simple spatial relations, architectural changes are necessary if the VLMs are to maintain their
performance even for expressions with novel complex compositions of spatial relations. We observed
that MGA-Net maintains a consistent performance for expressions with varying spatial complexity
better than the VLMs due to its soft attention module, which decomposes the expression into its
semantic components for compositional reasoning. This highlights the decomposition of complex
spatial expressions as a potential path forward to help VLMs generalize. Alternative strategies [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ]
could be using multi-modal transformer models [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ], [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ] and techniques such as weight sharing
across transformer layers or ‘Pushdown layers’ with recursive language understanding [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ]. Another
promising direction is Neuro-symbolic processing [
        <xref ref-type="bibr" rid="ref39 ref40">39, 40</xref>
        ], which involves generating symbolic programs
from expressions using LLMs and conducting explicit symbolic compositions before grounding into
visual modality. We plan to explore integrating such techniques with VLMs to improve their spatial
compositional reasoning capabilities.
      </p>
      <p>Another issue to address is the VLMs’ inability to comprehend negations. MGA-Net’s improved
performance over Grounding DINO due to the presence of negated expressions in the training data
motivates us to explore the augmentation of training/instruction tuning data of VLMs with synthetically
generated negated expressions. Additionally, we also plan to formulate contrastive learning objectives
to penalize the model when it fails to comprehend negations.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <sec id="sec-8-1">
        <title>The author(s) have not employed any Generative AI tools.</title>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>A. Description of spatial categories</title>
      <p>
        For our analysis, we utilize the spatial categories introduced by [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and replace the ’Cardinal Direction’
category with ’Absolute’. The descriptions and examples for the chosen categories are as follows:
1. Absolute: Consists of relations that describe the location of an object in an absolute manner and
not in relation to another object.
      </p>
      <p>E.g.: man on the right that is standing and wearing gray pant
2. Adjacency: Consists of relations that describe the close, side-by-side positioning of two objects.</p>
      <p>They may or may not imply a particular direction.</p>
      <p>E.g.: The large poster that is leaning against the wall
3. Directional: Consists of dynamic action verbs / directional relations. They describe the
movement or change in position of an object relative to other objects in the image. The interpretation
of these relations heavily relies on the configuration of the involved objects and/or the dynamic
spatial relationship between them.</p>
      <p>E.g.: The gray car that is driving down the road
4. Orientation: Consists of relations which describe the orientation of an object w.r.t another
object.</p>
      <p>E.g.: The sitting dog that is facing the window that is to the right of the mirror
5. Projective: Consists of relations that indicate the concrete spatial relationship between two
objects, i.e., these relations can be quantified in terms of the coordinates of the two objects.</p>
      <p>E.g.: The black oven that is above the drawer
6. Proximity: Consists of relations that indicate that two objects are near each other without giving
a specific directional relationship.</p>
      <p>E.g.: The blue chair that is close to the white monitor
7. Topological: Consists of relations that indicate the broader arrangement or the containment of
an object w.r.t another object</p>
      <p>E.g.: The silver train that is at the colorful station
8. Unallocated: Consists of relations that cannot be allocated to any of the above categories.</p>
    </sec>
    <sec id="sec-10">
      <title>B. Experiments with other VLMs</title>
      <p>
        In our analysis, we also experimented with InstructBLIP [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] and OpenFlamingo [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] models for the
REC task. These models are general-purpose VLMs with InstructBLIP working in the zero-shot model
and OpenFlamingo in the few-shot mode. In this section, we discuss the prompts that we used for these
two models and the outputs obtained for the prompts:
      </p>
      <sec id="sec-10-1">
        <title>B.1. InstructBLIP</title>
        <p>For InstructBLIP, we designed three prompts for the REC task. They are as follows:
In the prompts, the ‘bounding box list’ placeholder takes the coordinates of the detected bounding
boxes in the image being passed as the input, along with indices for each bounding box, starting from
‘1’. But for the third prompt, the model has no access to pre-detected candidate bounding boxes in the
image. While the expected output for the first prompt is the index of the correct bounding box, for the
other 2 prompts it is the bounding box coordinates as the output.</p>
        <p>The bounding box format is [x1, y1, x2, y2], where (x1, y1) is the bottom left corner and (x2, y2) is
the top right corner of the box. The coordinate values are a fraction of the total length/width of the
image according to the position of the coordinate.</p>
        <p>Unfortunately, none of the prompts gave consistently correct outputs. The outputs were as follows:
Prompt 1: The outputs were mostly incorrect. Sometimes, the model also gave ‘0’ as the output, even
though it is not a valid index.</p>
        <p>Prompt 2: The output did not return meaningful coordinates in most cases. But in the few instances
that it did, they were mostly incorrect. Example outputs when the model could not return meaningful
coordinates are:
• {1: [0.16, 0.55], 2: [0.32, 0.47], 3: [0.55, 0.6], 4: [0.21, 0.06]}
• [0.9, 0.53, 0.93, 0.57, 0.0, 0.39]</p>
        <p>Prompt 3: The model could not understand the task, and it just paraphrased parts of the prompt
instead of giving the coordinates as the output. Example prompts and outputs are:
• Prompt: Provide the bounding box coordinates for: ”The large poster that is leaning against the
wall”</p>
        <p>Output: what is the bounding box coordinates for the large poster that is leaning against the wall
• Prompt: Provide the bounding box coordinates for: ”The young man that is leaning against the
wall”
Output: is standing in an elevator. the young man that is leaning against the wall is standing in
an elevator</p>
      </sec>
      <sec id="sec-10-2">
        <title>B.2. OpenFlamingo</title>
        <p>Prompt 1:
We tested all the prompts designed for OpenFlamingo in both 2 and 3-shot settings.
• Example output format: &lt;image&gt;Bounding Boxes:bounding box list; Expression: Refexp;</p>
        <p>Correct Bounding Box:”ID”&lt;|endofchunk|&gt;
• Query format: &lt;image&gt;Bounding Boxes:bounding box list; Expression: Refexp; Correct
Bounding Box:“
‘bounding box list’ placeholder takes the list of candidate bounding boxes in the image as input, in the
same format as InstructBLIP (discussed in the previous section). The expected output is the index of the
correct bounding box. However, we observed that irrespective of the query, the model gave the same
output index for the same set of prompting examples.</p>
        <p>Prompt 2:
• Example output format: &lt;image&gt;Expression: Refexp; Correct Bounding Box:[Bounding box
coordinates]&lt;|endofchunk|&gt;
‘bounding box list’ placeholder takes the same input as explained for Prompt 1. But instead of expecting
the index, we expect the coordinates of the bounding box as the output. The format of the bounding
box is the same as explained for InstructBLIP in the previous section. However, the model failed to give
meaningful coordinates as output in most cases. When it did give meaningful coordinates, the outputs
were mostly incorrect.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          , Visual Instruction Tuning,
          <source>Advances in neural information processing systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Anil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Borgeaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-B. Alayrac</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Soricut</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Schalkwyk</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hauth</surname>
          </string-name>
          , et al.,
          <source>Gemini: A Family of Highly Capable Multimodal Models, arXiv preprint arXiv:2312.11805</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          , G. Emerson,
          <string-name>
            <given-names>N.</given-names>
            <surname>Collier</surname>
          </string-name>
          ,
          <source>Visual Spatial Reasoning, Transactions of the Association for Computational Linguistics</source>
          <volume>11</volume>
          (
          <year>2023</year>
          )
          <fpage>635</fpage>
          -
          <lpage>651</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Subramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Merrill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gardner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Rohrbach,</surname>
          </string-name>
          <article-title>ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension</article-title>
          ,
          <source>arXiv preprint arXiv:2204.05991</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kamath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hessel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <article-title>What's “up” with vision-language models? Investigating their struggle with spatial reasoning</article-title>
          ,
          <source>arXiv preprint arXiv:2310.19785</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          , L. Ma,
          <string-name>
            <surname>K.-Y. K. Wong</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Cops-Ref</surname>
          </string-name>
          :
          <article-title>A new Dataset and Task on Compositional Referring Expression Comprehension</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>10086</fpage>
          -
          <lpage>10095</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Grounding</surname>
            <given-names>DINO</given-names>
          </string-name>
          :
          <article-title>Marrying DINO with Grounded Pre-Training for Open-Set Object Detection</article-title>
          ,
          <source>arXiv preprint arXiv:2303.05499</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Modular Graph Attention Network for Complex Visual Relational Reasoning</article-title>
          ,
          <source>in: Proceedings of the Asian Conference on Computer Vision</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C. K.</given-names>
            <surname>Marchi Fagundes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Delazari</surname>
          </string-name>
          ,
          <article-title>A cross-linguistic study of spatial location descriptions in New Zealand English and Brazilian Portuguese natural language</article-title>
          ,
          <source>Transactions in GIS 25</source>
          (
          <year>2021</year>
          )
          <fpage>3159</fpage>
          -
          <lpage>3187</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          , T. Ma, L. Xie,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <article-title>ChatterBox: Multi-round Multimodal Referring and Grounding</article-title>
          , arXiv preprint arXiv:
          <volume>2401</volume>
          .13307 (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , et al.,
          <article-title>MMBench: Is Your Multi-modal Model an All-around Player?</article-title>
          ,
          <source>arXiv preprint arXiv:2307.06281</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shan</surname>
          </string-name>
          , SEED-Bench:
          <article-title>Benchmarking Multimodal LLMs with Generative Comprehension</article-title>
          ,
          <source>arXiv preprint arXiv:2307.16125</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>MM-Vet</surname>
          </string-name>
          :
          <article-title>Evaluating Large Multimodal Models for Integrated Capabilities</article-title>
          ,
          <source>arXiv preprint arXiv:2308.02490</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>X.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          , W.-C. Ma, R. Krishna, BLINK:
          <article-title>Multimodal Large Language Models Can See but Not Perceive</article-title>
          , in: European Conference on
          <source>Computer Vision</source>
          , Springer,
          <year>2025</year>
          , pp.
          <fpage>148</fpage>
          -
          <lpage>166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Rösch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Libovickỳ</surname>
          </string-name>
          ,
          <article-title>Probing the Role of Positional Information in Vision-Language Models</article-title>
          ,
          <source>arXiv preprint arXiv:2305.10046</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ming</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vineet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Is A Picture Worth A Thousand Words</surname>
          </string-name>
          <article-title>? Delving Into Spatial Reasoning for Vision Language Models</article-title>
          ,
          <source>arXiv preprint arXiv:2406.14852</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gokhale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Palangi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Nushi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vineet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Horvitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kamar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Baral</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Benchmarking Spatial Relationships in Text-to-Image Generation</article-title>
          ,
          <source>arXiv preprint arXiv:2212.10015</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kuhnle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Copestake</surname>
          </string-name>
          ,
          <article-title>How clever is the FiLM model, and how clever can it be?</article-title>
          ,
          <source>in: Proceedings of the European Conference on Computer Vision</source>
          (ECCV) Workshops,
          <year>2018</year>
          , pp.
          <fpage>0</fpage>
          -
          <lpage>0</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Cohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hernandez-Orallo</surname>
          </string-name>
          ,
          <article-title>Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs</article-title>
          , arXiv preprint arXiv:
          <volume>2304</volume>
          .11164 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Things not Written in Text: Exploring Spatial Commonsense from Visual Signals</article-title>
          ,
          <source>arXiv preprint arXiv:2203.08075</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mirzaee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Faghihi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ning</surname>
          </string-name>
          , P. Kordjmashidi,
          <article-title>SPARTQA: A Textual Question Answering Benchmark for Spatial Reasoning</article-title>
          , arXiv preprint arXiv:
          <volume>2104</volume>
          .05832 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>M. M. Islam</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Mirzaiee</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gladstone</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Green</surname>
          </string-name>
          , T. Iqbal,
          <source>CAESAR: An Embodied Simulator for Generating Multimodal Referring Expression Datasets, Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>21001</fpage>
          -
          <lpage>21015</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>M. M. Islam</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gladstone</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Islam</surname>
          </string-name>
          , T. Iqbal, EQA-MX:
          <article-title>Embodied Question Answering using Multimodal Expression</article-title>
          ,
          <source>in: The Twelfth International Conference on Learning Representations</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. V.</given-names>
            <surname>Nayak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Merullo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Bach</surname>
          </string-name>
          , E. Pavlick,
          <string-name>
            <surname>Does CLIP Bind</surname>
          </string-name>
          <article-title>Concepts? Probing Compositionality in Large Image Models</article-title>
          ,
          <source>arXiv preprint arXiv:2212.10537</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Faster</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          :
          <article-title>Towards Real-Time Object Detection with Region Proposal Networks</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>39</volume>
          (
          <year>2016</year>
          )
          <fpage>1137</fpage>
          -
          <lpage>1149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <source>Learning Transferable Visual Models From Natural Language Supervision</source>
          , in: International conference on machine learning,
          <source>PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gritsenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mahendran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Arnab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shen</surname>
          </string-name>
          , et al.,
          <article-title>Simple Open-Vocabulary Object Detection with Vision Transformers</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>728</fpage>
          -
          <lpage>755</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kravitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kalantidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Shamma</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Visual</surname>
            <given-names>Genome</given-names>
          </string-name>
          :
          <article-title>Connecting Language and Vision Using Crowdsourced Dense Image Annotations</article-title>
          ,
          <source>International journal of computer vision 123</source>
          (
          <year>2017</year>
          )
          <fpage>32</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          , You Only Look Once: Unified,
          <string-name>
            <surname>Real-Time Object</surname>
          </string-name>
          Detection,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>W.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          , S. Hoi,
          <article-title>InstructBLIP: Towards Generalpurpose Vision-Language Models with Instruction Tuning</article-title>
          ,
          <source>arXiv preprint arXiv:2305.06500 2</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>A.</given-names>
            <surname>Awadalla</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gardner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hessel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hanafy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Marathe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bitton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gadre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sagawa</surname>
          </string-name>
          , et al.,
          <article-title>OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models</article-title>
          ,
          <source>arXiv preprint arXiv:2308.01390</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Berg</surname>
          </string-name>
          ,
          <article-title>MAttNet: Modular Attention Network for Referring Expression Comprehension</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1307</fpage>
          -
          <lpage>1315</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-H. G.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <surname>H. Zhang,</surname>
          </string-name>
          <article-title>Revisiting referring expression comprehension evaluation in the era of large multimodal models</article-title>
          ,
          <source>arXiv preprint arXiv:2406.16866</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <article-title>[34] novita.ai, Vicuna: an Open-Source Large Language Model for Chatbots</article-title>
          , https://blogs.novita.
          <article-title>ai/ vicuna-an-open-source-large-language-model-for-chatbots/, 2024</article-title>
          . Published:
          <fpage>2024</fpage>
          -04-
          <lpage>18</lpage>
          . Accessed:
          <fpage>2024</fpage>
          -07-26.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Premsri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kordjamshidi</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          <article-title>Survey on Compositional Learning of AI Models: Theoretical and Experimental Practices</article-title>
          ,
          <source>arXiv preprint arXiv:2406.08787</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sikarwar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <article-title>When Can Transformers Ground and Compose: Insights from Compositional Generalization Benchmarks</article-title>
          ,
          <source>arXiv preprint arXiv:2210.12786</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>L.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sha</surname>
          </string-name>
          , Systematic Generalization on gSCAN:
          <article-title>What is Nearly Solved and What is Next?</article-title>
          ,
          <source>arXiv preprint arXiv:2109.12243</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>S.</given-names>
            <surname>Murty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Andreas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          , Pushdown Layers:
          <article-title>Encoding Recursive Structure in Transformer Language Models</article-title>
          ,
          <source>arXiv preprint arXiv:2310.19089</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kamali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Barezi</surname>
          </string-name>
          , P. Kordjamshidi,
          <article-title>NeSyCoCo: A Neuro-Symbolic Concept Composer for Compositional Generalization</article-title>
          ,
          <source>arXiv preprint arXiv:2412.15588</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tenenbaum</surname>
          </string-name>
          , J. Wu,
          <article-title>What's Left? Concept Grounding with Logic-Enhanced Foundation Models</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Bounding</given-names>
            <surname>Boxes</surname>
          </string-name>
          <article-title>: bounding box list; Referring Expression: Refexp; The index of the output bounding box is:</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Bounding</given-names>
            <surname>Boxes</surname>
          </string-name>
          <article-title>: bounding box list; Referring Expression: Refexp; The coordinates of the output bounding box are:</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <article-title>3. Provide the bounding box coordinates for: ”Refexp”</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>