<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning Visually Grounded Common Sense Spatial Knowledge for Implicit Spatial Language*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Guillem Collell</string-name>
          <email>gcollell@kuleuven.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marie-Francine Moens</string-name>
          <email>sien.moens@cs.kuleuven.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department KU Leuven</institution>
        </aff>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Motivation</title>
      <p>
        Spatial understanding is crucial for any agent that navigates
in a physical world. Computational and cognitive frameworks
often model spatial representations as spatial templates or
regions of acceptability for two objects under an explicit
spatial preposition such as “left” or “below”
        <xref ref-type="bibr" rid="ref4">(Logan and Sadler
1996)</xref>
        . Contrary to previous work that define spatial templates
for explicit spatial language only
        <xref ref-type="bibr" rid="ref5 ref6">(Malinowski and Fritz 2014;
Moratz and Tenbrink 2006)</xref>
        , we extend such concept to
implicit spatial language, i.e., those relationships (usually
actions) that do not explicitly define the relative location of the
two objects (e.g., “dog under table”) but only implicitly (e.g.,
“girl riding horse”). Unlike explicit relationships, predicting
spatial arrangements from implicit spatial language requires
spatial common sense knowledge about the objects and
actions. Furthermore, prior work that leverage common sense
spatial knowledge to solve tasks such as visual paraphrasing
        <xref ref-type="bibr" rid="ref3">(Lin and Parikh 2015)</xref>
        or object labeling
        <xref ref-type="bibr" rid="ref8">(Shiang et al. 2017)</xref>
        do not aim to predict (unseen) spatial configurations.
      </p>
      <p>
        Here, we propose the task of predicting the relative spatial
locations of two objects given a textual input of the form
(Subject, Relationship, Object). We report on initial
experiments with a simple neural network model with
distancebased supervision learned in annotated images that obtains
promising performance. Crucially, we show that the model
can reliably predict templates of unseen combinations, e.g.,
predicting (man, riding, elephant) without having seen such
scene before. Furthermore, by leveraging word embeddings
of objects and relationships, the model can correctly predict
spatial templates for unseen words. E.g., without having ever
seen “boots” before but only “sandals”, the model predicts
correctly the template of (person, wearing, boots) by
inferring that, since “boots” are similar to “sandals”, they must be
worn at the same position of the “person”’s body. Hence, the
model is able to leverage the learned common sense spatial
knowledge to generalize to unseen objects.
*The reader may refer to a full paper
        <xref ref-type="bibr" rid="ref1">(Collell, Van Gool, and Moens
2018)</xref>
        that resulted from the preliminary studies presented in this
abstract.
      </p>
      <sec id="sec-1-1">
        <title>Target (Obj. center &amp; size)</title>
        <p>Obj.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Predict</title>
        <p>g Compose
n
i
n
r</p>
      </sec>
      <sec id="sec-1-3">
        <title>Lea Concatenate</title>
      </sec>
      <sec id="sec-1-4">
        <title>Embeddings</title>
      </sec>
      <sec id="sec-1-5">
        <title>Input (text)</title>
        <p>man flying kite
Subj. Rel. Obj.
j)b j)
(zeu rSeub
S (
iS ten</p>
        <p>C
Obj.</p>
        <sec id="sec-1-5-1">
          <title>Subj.</title>
        </sec>
        <sec id="sec-1-5-2">
          <title>Subj.</title>
        </sec>
      </sec>
      <sec id="sec-1-6">
        <title>1) Mirror (if needed)</title>
        <p>Obj.
g
n
i
s
s
e
c
o
r
P
e
r
P</p>
      </sec>
      <sec id="sec-1-7">
        <title>1) Re-scale eag</title>
        <p>coordinates Im</p>
        <p>Proposed task and model
2.1</p>
        <p>Proposed task
We propose the task of predicting the 2D relative spatial
arrangement of two objects under a relationship given a
structured text input of the form (Subject, Relationship, Object)—
abbreviated as (S, R, O). More precisely, the model predicts
the Object’s box center and box size (output) given the
structured text input (S, R, O) plus the center and size of the
Subject’s box (Fig. 1).
2.2</p>
        <p>Proposed model
We employ a feed forward network with embeddings (Fig. 1).
The embedding layer maps the input words (S,R,O) to their
d-dimensional representations. The embeddings are then
concatenated with the Subject’s box center and size. This vector
is then fed into a fully connected layer to compose S, R,
O into a joint representation. model predictions (Object’s
center and size) are evaluated against ground truth with a
mean squared error (MSE) loss.</p>
        <p>3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Experimental setup</title>
      <p>
        Data. We use the Visual Genome
        <xref ref-type="bibr" rid="ref2">(Krishna et al. 2017)</xref>
        dataset, which has 108K images containing 1,5M
human-annotated (S, R, O) instances with corresponding
object boxes. We filter out all instances with explicit spatial
prepositions, yielding 378K implicit (S, R, O) instances.
t EMB
ic RND
i
lp 1H
Im ctrl
t EMB
ic RND
i
lxp 1H
E ctrl
0.008
0.008
0.008
0.054
0.013
0.013
0.012
0.060
0.705
0.691
0.717
-1.000
0.586
0.580
0.604
-1.000
accy
0.770
0.769
0.780
0.630
Evaluation sets. We evaluate performance in the
following subsets of Visual Genome. (i) Raw set: Simply the
unfiltered instances. (ii) Unseen words: We randomly pick
25 objects (e.g., “woman”, “apple”, etc.) among the 100
most frequent ones and leave out from the training data
all the instances ( 130K) containing any of these words.
This set is used for testing. (iii) Unseen combinations: We
randomly pick 100 combinations (S, R, O) among the 1,000
most frequent implicit ones and leave them out for training.
We finally consider the explicit version of the Raw set.
Reported results are always on unseen instances—yet the
combinations (S, R, O) may have been seen during training
(e.g., in different images).
      </p>
      <p>Data pre-processing. Coordinates are normalized by
image width and height. Since right/left depends only on the
camera viewpoint, we get rid of this arbitrariness by
mirroring the image when the Object is on the left of the Subject.
Evaluation metrics. We use standard regression
metrics: (i) Mean Squared Error (MSE) between predicted
and true Object center and size. (ii) Coefficient of
Determination (R2) of model predictions and ground truth.
(iii) Pearson Correlation (r) between predicted and
true x-component of the Object center, and similarly for
the y-component. We also consider the classification of
above/below relative locations of the Object w.r.t. the Subject.
We report (macro averaged) F1 (F1y) and accuracy (accy).
4</p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>We test the following model variations. EMB denotes a model
that uses pre-trained word embeddings1, RND a model with
random normal embeddings, 1H employs one-hot
embeddings and ctrl outputs random normal predictions. Overall,
the preliminary results outlined below look promising.
4.1</p>
      <p>
        Quantitative results
Evaluation with raw data. Table 1 shows that all methods
perform well in the Raw data. Remarkably, we see that
relative locations can be predicted from implicit spatial language
at least as accurately as from explicit spatial language.
Unseen combinations. All models perform well on unseen
combinations (table not shown), remarkably closely to their
1We use 300-d GloVe embeddings
        <xref ref-type="bibr" rid="ref5 ref7">(Pennington, Socher, and
Manning 2014)</xref>
        http://nlp.stanford.edu/projects/glove.
performance with seen combinations.
      </p>
      <p>Unseen Words. Contrarily, large differences in performance
are observed with unseen words (table not shown) where the
model that uses embeddings (EMB) performs significantly
better than the rest.</p>
      <p>person, holding, cat</p>
      <p>man, following, elephant person, riding, elephant
man, flying, kite
man, holding, kite
man, walking, dog
Heat maps in Fig. 2 show regions of predicted high (red) and
low (blue) probability. The “heat” of the objects is assumed
to be normally distributed with equal to the object’s center
and to the object’s size. The EMB model is able to infer
both, relative locations and sizes, e.g., predicting correctly the
size of a “cat” relative to a “person” even though the model
has never seen a “cat” before. Notably, the model learns to
compose the triplet as a whole, distinguishing, e.g., (man,
flying, kite) from (man, holding, kite).</p>
      <p>Acknowledgments
This work has been supported by the CHIST-ERA EU project
MUSTER.2</p>
      <p>2http://www.chistera.eu/projects/muster</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Collell</surname>
            ,
            <given-names>G.; Van</given-names>
          </string-name>
          <string-name>
            <surname>Gool</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ; and Moens, M.-
          <string-name>
            <surname>F.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Acquiring common sense spatial knowledge through implicit spatial templates</article-title>
          .
          <source>AAAI.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Krishna</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Zhu,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Groth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            ;
            <surname>Johnson</surname>
          </string-name>
          , J.;
          <string-name>
            <surname>Hata</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kravitz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Kalantidis,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            -J.;
            <surname>Shamma</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. A.</surname>
          </string-name>
          ; et al.
          <year>2017</year>
          .
          <article-title>Visual genome: Connecting language and vision using crowdsourced dense image annotations</article-title>
          .
          <source>International Journal of Computer Vision</source>
          <volume>123</volume>
          (
          <issue>1</issue>
          ):
          <fpage>32</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Don't just listen, use your imagination: Leveraging visual common sense for non-visual tasks</article-title>
          .
          <source>In CVPR</source>
          ,
          <fpage>2984</fpage>
          -
          <lpage>2993</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Logan</surname>
            ,
            <given-names>G. D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sadler</surname>
            ,
            <given-names>D. D.</given-names>
          </string-name>
          <year>1996</year>
          .
          <article-title>A computational analysis of the apprehension of spatial relations</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Malinowski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Fritz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>A pooling approach to modelling spatial relations for image retrieval and annotation</article-title>
          .
          <source>arXiv preprint arXiv:1411</source>
          .
          <fpage>5190</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Moratz</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Tenbrink</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>Spatial reference in linguistic human-robot interaction: Iterative, empirically supported development of a model of projective relations</article-title>
          .
          <source>Spatial Cognition and computation 6</source>
          (
          <issue>1</issue>
          ):
          <fpage>63</fpage>
          -
          <lpage>107</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Socher, R.; and
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <source>In EMNLP</source>
          , volume
          <volume>14</volume>
          ,
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Shiang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          -R.; Rosenthal,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Gershman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ; Carbonell, J.; and
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Vision-language fusion for object recognition</article-title>
          .
          <source>AAAI.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>