=Paper=
{{Paper
|id=Vol-2052/paper6
|storemode=property
|title=Learning Visually Grounded Common Sense Spatial Knowledge for Implicit Spatial Language
|pdfUrl=https://ceur-ws.org/Vol-2052/paper6.pdf
|volume=Vol-2052
|authors=Guillem Collell,Marie-Francine Moens
|dblpUrl=https://dblp.org/rec/conf/commonsense/CollellM17
}}
==Learning Visually Grounded Common Sense Spatial Knowledge for Implicit Spatial Language==
Learning Visually Grounded Common Sense Spatial Knowledge for Implicit
Spatial Language*
Guillem Collell and Marie-Francine Moens
Computer Science Department
KU Leuven
gcollell@kuleuven.be; sien.moens@cs.kuleuven.be
1 Motivation Obj.
Target Obj.
(Obj. center
& size) Subj.
Spatial understanding is crucial for any agent that navigates Predict 1) Mirror
Image Pre-Processing
(if needed)
in a physical world. Computational and cognitive frameworks Compose Obj.
Learning
often model spatial representations as spatial templates or
regions of acceptability for two objects under an explicit spa- Concatenate Subj.
tial preposition such as “left” or “below” (Logan and Sadler
Size (Subj)
Center (Subj)
1) Re-scale
1996). Contrary to previous work that define spatial templates coordinates
for explicit spatial language only (Malinowski and Fritz 2014; Embeddings
Moratz and Tenbrink 2006), we extend such concept to im- man flying kite
Input (text)
plicit spatial language, i.e., those relationships (usually ac- Subj. Rel. Obj.
tions) that do not explicitly define the relative location of the
two objects (e.g., “dog under table”) but only implicitly (e.g.,
“girl riding horse”). Unlike explicit relationships, predicting Figure 1: Overview of our model and setting.
spatial arrangements from implicit spatial language requires
spatial common sense knowledge about the objects and ac-
tions. Furthermore, prior work that leverage common sense 2 Proposed task and model
spatial knowledge to solve tasks such as visual paraphrasing 2.1 Proposed task
(Lin and Parikh 2015) or object labeling (Shiang et al. 2017)
do not aim to predict (unseen) spatial configurations. We propose the task of predicting the 2D relative spatial
arrangement of two objects under a relationship given a struc-
Here, we propose the task of predicting the relative spatial tured text input of the form (Subject, Relationship, Object)—
locations of two objects given a textual input of the form abbreviated as (S, R, O). More precisely, the model predicts
(Subject, Relationship, Object). We report on initial exper- the Object’s box center and box size (output) given the struc-
iments with a simple neural network model with distance- tured text input (S, R, O) plus the center and size of the
based supervision learned in annotated images that obtains Subject’s box (Fig. 1).
promising performance. Crucially, we show that the model
can reliably predict templates of unseen combinations, e.g., 2.2 Proposed model
predicting (man, riding, elephant) without having seen such We employ a feed forward network with embeddings (Fig. 1).
scene before. Furthermore, by leveraging word embeddings The embedding layer maps the input words (S,R,O) to their
of objects and relationships, the model can correctly predict d-dimensional representations. The embeddings are then con-
spatial templates for unseen words. E.g., without having ever catenated with the Subject’s box center and size. This vector
seen “boots” before but only “sandals”, the model predicts is then fed into a fully connected layer to compose S, R,
correctly the template of (person, wearing, boots) by infer- O into a joint representation. model predictions (Object’s
ring that, since “boots” are similar to “sandals”, they must be center and size) are evaluated against ground truth with a
worn at the same position of the “person”’s body. Hence, the mean squared error (MSE) loss.
model is able to leverage the learned common sense spatial
knowledge to generalize to unseen objects. 3 Experimental setup
Data. We use the Visual Genome (Krishna et al. 2017)
dataset, which has ∼108K images containing ∼1,5M
*The reader may refer to a full paper (Collell, Van Gool, and Moens human-annotated (S, R, O) instances with corresponding
2018) that resulted from the preliminary studies presented in this object boxes. We filter out all instances with explicit spatial
abstract. prepositions, yielding ∼378K implicit (S, R, O) instances.
MSE R2 accy F1y rx ry performance with seen combinations.
Implicit EMB 0.008 0.705 0.756 0.755 0.894 0.834 Unseen Words. Contrarily, large differences in performance
RND 0.008 0.691 0.750 0.750 0.891 0.826 are observed with unseen words (table not shown) where the
1H 0.008 0.717 0.762 0.762 0.896 0.842 model that uses embeddings (EMB) performs significantly
ctrl 0.054 -1.000 0.522 0.521 0.000 -0.001
better than the rest.
EMB 0.013 0.586 0.768 0.770 0.811 0.823
Explicit
RND 0.013 0.580 0.767 0.769 0.808 0.815
1H 0.012 0.604 0.778 0.780 0.815 0.828
ctrl 0.060 -1.000 0.633 0.630 0.000 0.000
Table 1: Results on implicit and explicit relations.
person, holding, cat man, following, elephant person, riding, elephant
Evaluation sets. We evaluate performance in the fol-
lowing subsets of Visual Genome. (i) Raw set: Simply the
unfiltered instances. (ii) Unseen words: We randomly pick
25 objects (e.g., “woman”, “apple”, etc.) among the 100
most frequent ones and leave out from the training data
all the instances (∼130K) containing any of these words.
This set is used for testing. (iii) Unseen combinations: We
randomly pick 100 combinations (S, R, O) among the 1,000 man, flying, kite man, holding, kite man, walking, dog
most frequent implicit ones and leave them out for training.
We finally consider the explicit version of the Raw set.
Reported results are always on unseen instances—yet the Figure 2: Predictions by the model that leverages word em-
combinations (S, R, O) may have been seen during training beddings (EMB). Top: Predictions in unseen words (under-
(e.g., in different images). lined). Bottom: Predictions in unseen triplets.
Data pre-processing. Coordinates are normalized by
image width and height. Since right/left depends only on the 4.2 Qualitative evaluation (spatial templates)
camera viewpoint, we get rid of this arbitrariness by mir-
roring the image when the Object is on the left of the Subject. Heat maps in Fig. 2 show regions of predicted high (red) and
low (blue) probability. The “heat” of the objects is assumed
Evaluation metrics. We use standard regression met- to be normally distributed with µ equal to the object’s center
rics: (i) Mean Squared Error (MSE) between predicted and σ to the object’s size. The EMB model is able to infer
and true Object center and size. (ii) Coefficient of De- both, relative locations and sizes, e.g., predicting correctly the
termination (R2 ) of model predictions and ground truth. size of a “cat” relative to a “person” even though the model
(iii) Pearson Correlation (r) between predicted and has never seen a “cat” before. Notably, the model learns to
true x-component of the Object center, and similarly for compose the triplet as a whole, distinguishing, e.g., (man,
the y-component. We also consider the classification of flying, kite) from (man, holding, kite).
above/below relative locations of the Object w.r.t. the Subject.
We report (macro averaged) F1 (F1y ) and accuracy (accy ). Acknowledgments
This work has been supported by the CHIST-ERA EU project
4 Results MUSTER.2
We test the following model variations. EMB denotes a model
that uses pre-trained word embeddings1 , RND a model with References
random normal embeddings, 1H employs one-hot embed- Collell, G.; Van Gool, L.; and Moens, M.-F. 2018. Acquiring
dings and ctrl outputs random normal predictions. Overall, common sense spatial knowledge through implicit spatial
the preliminary results outlined below look promising. templates. AAAI.
4.1 Quantitative results Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz,
J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; et al.
Evaluation with raw data. Table 1 shows that all methods
2017. Visual genome: Connecting language and vision us-
perform well in the Raw data. Remarkably, we see that rela-
ing crowdsourced dense image annotations. International
tive locations can be predicted from implicit spatial language
Journal of Computer Vision 123(1):32–73.
at least as accurately as from explicit spatial language.
Unseen combinations. All models perform well on unseen Lin, X., and Parikh, D. 2015. Don’t just listen, use your
combinations (table not shown), remarkably closely to their imagination: Leveraging visual common sense for non-visual
tasks. In CVPR, 2984–2993.
1
We use 300-d GloVe embeddings (Pennington, Socher, and
2
Manning 2014) http://nlp.stanford.edu/projects/glove. http://www.chistera.eu/projects/muster
Logan, G. D., and Sadler, D. D. 1996. A computational
analysis of the apprehension of spatial relations.
Malinowski, M., and Fritz, M. 2014. A pooling approach to
modelling spatial relations for image retrieval and annotation.
arXiv preprint arXiv:1411.5190.
Moratz, R., and Tenbrink, T. 2006. Spatial reference in
linguistic human-robot interaction: Iterative, empirically sup-
ported development of a model of projective relations. Spatial
Cognition and computation 6(1):63–107.
Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove:
Global vectors for word representation. In EMNLP, vol-
ume 14, 1532–1543.
Shiang, S.-R.; Rosenthal, S.; Gershman, A.; Carbonell, J.; and
Oh, J. 2017. Vision-language fusion for object recognition.
AAAI.