Learning Visually Grounded Common Sense Spatial Knowledge for Implicit
                               Spatial Language*

                                       Guillem Collell and Marie-Francine Moens
                                                   Computer Science Department
                                                            KU Leuven
                                          gcollell@kuleuven.be; sien.moens@cs.kuleuven.be


                       1    Motivation                                                                                                         Obj.
                                                                                        Target               Obj.
                                                                                      (Obj. center
                                                                                        & size)                                                       Subj.
Spatial understanding is crucial for any agent that navigates                                                       Predict                             1) Mirror


                                                                                                                                                                      Image Pre-Processing
                                                                                                                                                       (if needed)
in a physical world. Computational and cognitive frameworks                        Compose                                                             Obj.


                                                                        Learning
often model spatial representations as spatial templates or
regions of acceptability for two objects under an explicit spa-                    Concatenate                                                 Subj.
tial preposition such as “left” or “below” (Logan and Sadler


                                                                                                                                 Size (Subj)
                                                                                                                               Center (Subj)
                                                                                                                                                        1) Re-scale
1996). Contrary to previous work that define spatial templates                                                                                          coordinates
for explicit spatial language only (Malinowski and Fritz 2014;                     Embeddings

Moratz and Tenbrink 2006), we extend such concept to im-                                             man       flying   kite
                                                                                   Input (text)
plicit spatial language, i.e., those relationships (usually ac-                                      Subj.      Rel.    Obj.
tions) that do not explicitly define the relative location of the
two objects (e.g., “dog under table”) but only implicitly (e.g.,
“girl riding horse”). Unlike explicit relationships, predicting                     Figure 1: Overview of our model and setting.
spatial arrangements from implicit spatial language requires
spatial common sense knowledge about the objects and ac-
tions. Furthermore, prior work that leverage common sense                                   2        Proposed task and model
spatial knowledge to solve tasks such as visual paraphrasing          2.1          Proposed task
(Lin and Parikh 2015) or object labeling (Shiang et al. 2017)
do not aim to predict (unseen) spatial configurations.                We propose the task of predicting the 2D relative spatial
                                                                      arrangement of two objects under a relationship given a struc-
   Here, we propose the task of predicting the relative spatial       tured text input of the form (Subject, Relationship, Object)—
locations of two objects given a textual input of the form            abbreviated as (S, R, O). More precisely, the model predicts
(Subject, Relationship, Object). We report on initial exper-          the Object’s box center and box size (output) given the struc-
iments with a simple neural network model with distance-              tured text input (S, R, O) plus the center and size of the
based supervision learned in annotated images that obtains            Subject’s box (Fig. 1).
promising performance. Crucially, we show that the model
can reliably predict templates of unseen combinations, e.g.,          2.2          Proposed model
predicting (man, riding, elephant) without having seen such           We employ a feed forward network with embeddings (Fig. 1).
scene before. Furthermore, by leveraging word embeddings              The embedding layer maps the input words (S,R,O) to their
of objects and relationships, the model can correctly predict         d-dimensional representations. The embeddings are then con-
spatial templates for unseen words. E.g., without having ever         catenated with the Subject’s box center and size. This vector
seen “boots” before but only “sandals”, the model predicts            is then fed into a fully connected layer to compose S, R,
correctly the template of (person, wearing, boots) by infer-          O into a joint representation. model predictions (Object’s
ring that, since “boots” are similar to “sandals”, they must be       center and size) are evaluated against ground truth with a
worn at the same position of the “person”’s body. Hence, the          mean squared error (MSE) loss.
model is able to leverage the learned common sense spatial
knowledge to generalize to unseen objects.                                                        3          Experimental setup
                                                                       Data. We use the Visual Genome (Krishna et al. 2017)
                                                                      dataset, which has ∼108K images containing ∼1,5M
*The reader may refer to a full paper (Collell, Van Gool, and Moens   human-annotated (S, R, O) instances with corresponding
2018) that resulted from the preliminary studies presented in this    object boxes. We filter out all instances with explicit spatial
abstract.                                                             prepositions, yielding ∼378K implicit (S, R, O) instances.
                         MSE      R2       accy     F1y     rx       ry     performance with seen combinations.
      Implicit    EMB    0.008    0.705    0.756   0.755   0.894    0.834    Unseen Words. Contrarily, large differences in performance
                  RND    0.008    0.691    0.750   0.750   0.891    0.826   are observed with unseen words (table not shown) where the
                  1H     0.008    0.717    0.762   0.762   0.896   0.842    model that uses embeddings (EMB) performs significantly
                  ctrl   0.054   -1.000    0.522   0.521   0.000   -0.001
                                                                            better than the rest.
                  EMB    0.013   0.586     0.768   0.770   0.811   0.823
      Explicit


                  RND    0.013   0.580     0.767   0.769   0.808   0.815
                  1H     0.012   0.604     0.778   0.780   0.815   0.828
                  ctrl   0.060   -1.000    0.633   0.630   0.000   0.000

        Table 1: Results on implicit and explicit relations.


                                                                                     person, holding, cat   man, following, elephant person, riding, elephant
Evaluation sets. We evaluate performance in the fol-
lowing subsets of Visual Genome. (i) Raw set: Simply the
unfiltered instances. (ii) Unseen words: We randomly pick
25 objects (e.g., “woman”, “apple”, etc.) among the 100
most frequent ones and leave out from the training data
all the instances (∼130K) containing any of these words.
This set is used for testing. (iii) Unseen combinations: We
randomly pick 100 combinations (S, R, O) among the 1,000                              man, flying, kite        man, holding, kite       man, walking, dog
most frequent implicit ones and leave them out for training.
We finally consider the explicit version of the Raw set.
Reported results are always on unseen instances—yet the                     Figure 2: Predictions by the model that leverages word em-
combinations (S, R, O) may have been seen during training                   beddings (EMB). Top: Predictions in unseen words (under-
(e.g., in different images).                                                lined). Bottom: Predictions in unseen triplets.

Data pre-processing. Coordinates are normalized by
image width and height. Since right/left depends only on the                4.2      Qualitative evaluation (spatial templates)
camera viewpoint, we get rid of this arbitrariness by mir-
roring the image when the Object is on the left of the Subject.             Heat maps in Fig. 2 show regions of predicted high (red) and
                                                                            low (blue) probability. The “heat” of the objects is assumed
Evaluation metrics. We use standard regression met-                         to be normally distributed with µ equal to the object’s center
rics: (i) Mean Squared Error (MSE) between predicted                        and σ to the object’s size. The EMB model is able to infer
and true Object center and size. (ii) Coefficient of De-                    both, relative locations and sizes, e.g., predicting correctly the
termination (R2 ) of model predictions and ground truth.                    size of a “cat” relative to a “person” even though the model
(iii) Pearson Correlation (r) between predicted and                         has never seen a “cat” before. Notably, the model learns to
true x-component of the Object center, and similarly for                    compose the triplet as a whole, distinguishing, e.g., (man,
the y-component. We also consider the classification of                     flying, kite) from (man, holding, kite).
above/below relative locations of the Object w.r.t. the Subject.
We report (macro averaged) F1 (F1y ) and accuracy (accy ).                  Acknowledgments
                                                                            This work has been supported by the CHIST-ERA EU project
                                  4       Results                           MUSTER.2
We test the following model variations. EMB denotes a model
that uses pre-trained word embeddings1 , RND a model with                                                    References
random normal embeddings, 1H employs one-hot embed-                         Collell, G.; Van Gool, L.; and Moens, M.-F. 2018. Acquiring
dings and ctrl outputs random normal predictions. Overall,                  common sense spatial knowledge through implicit spatial
the preliminary results outlined below look promising.                      templates. AAAI.
4.1              Quantitative results                                       Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz,
                                                                            J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; et al.
 Evaluation with raw data. Table 1 shows that all methods
                                                                            2017. Visual genome: Connecting language and vision us-
perform well in the Raw data. Remarkably, we see that rela-
                                                                            ing crowdsourced dense image annotations. International
tive locations can be predicted from implicit spatial language
                                                                            Journal of Computer Vision 123(1):32–73.
at least as accurately as from explicit spatial language.
 Unseen combinations. All models perform well on unseen                     Lin, X., and Parikh, D. 2015. Don’t just listen, use your
combinations (table not shown), remarkably closely to their                 imagination: Leveraging visual common sense for non-visual
                                                                            tasks. In CVPR, 2984–2993.
  1
    We use 300-d GloVe embeddings (Pennington, Socher, and
                                                                               2
Manning 2014) http://nlp.stanford.edu/projects/glove.                              http://www.chistera.eu/projects/muster
Logan, G. D., and Sadler, D. D. 1996. A computational
analysis of the apprehension of spatial relations.
Malinowski, M., and Fritz, M. 2014. A pooling approach to
modelling spatial relations for image retrieval and annotation.
arXiv preprint arXiv:1411.5190.
Moratz, R., and Tenbrink, T. 2006. Spatial reference in
linguistic human-robot interaction: Iterative, empirically sup-
ported development of a model of projective relations. Spatial
Cognition and computation 6(1):63–107.
Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove:
Global vectors for word representation. In EMNLP, vol-
ume 14, 1532–1543.
Shiang, S.-R.; Rosenthal, S.; Gershman, A.; Carbonell, J.; and
Oh, J. 2017. Vision-language fusion for object recognition.
AAAI.