Motivation

Learning Visually Grounded Common Sense Spatial Knowledge for Implicit Spatial Language*

Guillem Collell

gcollell@kuleuven.be 0

Marie-Francine Moens

sien.moens@cs.kuleuven.be 0 0 Computer Science Department KU Leuven

Motivation

Spatial understanding is crucial for any agent that navigates in a physical world. Computational and cognitive frameworks often model spatial representations as spatial templates or regions of acceptability for two objects under an explicit spatial preposition such as “left” or “below” (Logan and Sadler 1996) . Contrary to previous work that define spatial templates for explicit spatial language only (Malinowski and Fritz 2014; Moratz and Tenbrink 2006) , we extend such concept to implicit spatial language, i.e., those relationships (usually actions) that do not explicitly define the relative location of the two objects (e.g., “dog under table”) but only implicitly (e.g., “girl riding horse”). Unlike explicit relationships, predicting spatial arrangements from implicit spatial language requires spatial common sense knowledge about the objects and actions. Furthermore, prior work that leverage common sense spatial knowledge to solve tasks such as visual paraphrasing (Lin and Parikh 2015) or object labeling (Shiang et al. 2017) do not aim to predict (unseen) spatial configurations.

Here, we propose the task of predicting the relative spatial locations of two objects given a textual input of the form (Subject, Relationship, Object). We report on initial experiments with a simple neural network model with distancebased supervision learned in annotated images that obtains promising performance. Crucially, we show that the model can reliably predict templates of unseen combinations, e.g., predicting (man, riding, elephant) without having seen such scene before. Furthermore, by leveraging word embeddings of objects and relationships, the model can correctly predict spatial templates for unseen words. E.g., without having ever seen “boots” before but only “sandals”, the model predicts correctly the template of (person, wearing, boots) by inferring that, since “boots” are similar to “sandals”, they must be worn at the same position of the “person”’s body. Hence, the model is able to leverage the learned common sense spatial knowledge to generalize to unseen objects. *The reader may refer to a full paper (Collell, Van Gool, and Moens 2018) that resulted from the preliminary studies presented in this abstract.

Target (Obj. center & size)

Obj.

Predict

g Compose n i n r

Lea Concatenate Embeddings Input (text)

man flying kite Subj. Rel. Obj. j)b j) (zeu rSeub S ( iS ten

C Obj.

Subj. Subj. 1) Mirror (if needed)

Obj. g n i s s e c o r P e r P

1) Re-scale eag

coordinates Im

Proposed task and model 2.1

Proposed task We propose the task of predicting the 2D relative spatial arrangement of two objects under a relationship given a structured text input of the form (Subject, Relationship, Object)— abbreviated as (S, R, O). More precisely, the model predicts the Object’s box center and box size (output) given the structured text input (S, R, O) plus the center and size of the Subject’s box (Fig. 1). 2.2

Proposed model We employ a feed forward network with embeddings (Fig. 1). The embedding layer maps the input words (S,R,O) to their d-dimensional representations. The embeddings are then concatenated with the Subject’s box center and size. This vector is then fed into a fully connected layer to compose S, R, O into a joint representation. model predictions (Object’s center and size) are evaluated against ground truth with a mean squared error (MSE) loss.

Experimental setup

Data. We use the Visual Genome (Krishna et al. 2017) dataset, which has 108K images containing 1,5M human-annotated (S, R, O) instances with corresponding object boxes. We filter out all instances with explicit spatial prepositions, yielding 378K implicit (S, R, O) instances. t EMB ic RND i lp 1H Im ctrl t EMB ic RND i lxp 1H E ctrl 0.008 0.008 0.008 0.054 0.013 0.013 0.012 0.060 0.705 0.691 0.717 -1.000 0.586 0.580 0.604 -1.000 accy 0.770 0.769 0.780 0.630 Evaluation sets. We evaluate performance in the following subsets of Visual Genome. (i) Raw set: Simply the unfiltered instances. (ii) Unseen words: We randomly pick 25 objects (e.g., “woman”, “apple”, etc.) among the 100 most frequent ones and leave out from the training data all the instances ( 130K) containing any of these words. This set is used for testing. (iii) Unseen combinations: We randomly pick 100 combinations (S, R, O) among the 1,000 most frequent implicit ones and leave them out for training. We finally consider the explicit version of the Raw set. Reported results are always on unseen instances—yet the combinations (S, R, O) may have been seen during training (e.g., in different images).

Data pre-processing. Coordinates are normalized by image width and height. Since right/left depends only on the camera viewpoint, we get rid of this arbitrariness by mirroring the image when the Object is on the left of the Subject. Evaluation metrics. We use standard regression metrics: (i) Mean Squared Error (MSE) between predicted and true Object center and size. (ii) Coefficient of Determination (R2) of model predictions and ground truth. (iii) Pearson Correlation (r) between predicted and true x-component of the Object center, and similarly for the y-component. We also consider the classification of above/below relative locations of the Object w.r.t. the Subject. We report (macro averaged) F1 (F1y) and accuracy (accy). 4

Results

We test the following model variations. EMB denotes a model that uses pre-trained word embeddings1, RND a model with random normal embeddings, 1H employs one-hot embeddings and ctrl outputs random normal predictions. Overall, the preliminary results outlined below look promising. 4.1

Quantitative results Evaluation with raw data. Table 1 shows that all methods perform well in the Raw data. Remarkably, we see that relative locations can be predicted from implicit spatial language at least as accurately as from explicit spatial language. Unseen combinations. All models perform well on unseen combinations (table not shown), remarkably closely to their 1We use 300-d GloVe embeddings (Pennington, Socher, and Manning 2014) http://nlp.stanford.edu/projects/glove. performance with seen combinations.

Unseen Words. Contrarily, large differences in performance are observed with unseen words (table not shown) where the model that uses embeddings (EMB) performs significantly better than the rest.

person, holding, cat

man, following, elephant person, riding, elephant man, flying, kite man, holding, kite man, walking, dog Heat maps in Fig. 2 show regions of predicted high (red) and low (blue) probability. The “heat” of the objects is assumed to be normally distributed with equal to the object’s center and to the object’s size. The EMB model is able to infer both, relative locations and sizes, e.g., predicting correctly the size of a “cat” relative to a “person” even though the model has never seen a “cat” before. Notably, the model learns to compose the triplet as a whole, distinguishing, e.g., (man, flying, kite) from (man, holding, kite).

Acknowledgments This work has been supported by the CHIST-ERA EU project MUSTER.2

2http://www.chistera.eu/projects/muster

Collell , G.; Van

Gool , L. ; and Moens, M.- F. 2018 . Acquiring common sense spatial knowledge through implicit spatial templates . AAAI.

Krishna , R. ; Zhu, Y. ; Groth , O. ; Johnson , J.; Hata , K. ; Kravitz , J. ; Chen , S. ; Kalantidis, Y. ; Li , L. -J.; Shamma , D. A. ; et al. 2017 . Visual genome: Connecting language and vision using crowdsourced dense image annotations . International Journal of Computer Vision 123 ( 1 ): 32 - 73 .

Lin , X. , and Parikh , D. 2015 . Don't just listen, use your imagination: Leveraging visual common sense for non-visual tasks . In CVPR , 2984 - 2993 .

Logan , G. D. , and Sadler , D. D. 1996 . A computational analysis of the apprehension of spatial relations .

Malinowski , M. , and Fritz , M. 2014 . A pooling approach to modelling spatial relations for image retrieval and annotation . arXiv preprint arXiv:1411 . 5190 .

Moratz , R. , and Tenbrink , T. 2006 . Spatial reference in linguistic human-robot interaction: Iterative, empirically supported development of a model of projective relations . Spatial Cognition and computation 6 ( 1 ): 63 - 107 .

Pennington , J. ; Socher, R.; and Manning , C. D. 2014 . Glove: Global vectors for word representation . In EMNLP , volume 14 , 1532 - 1543 .

Shiang , S. -R.; Rosenthal, S. ; Gershman , A. ; Carbonell, J.; and Oh , J. 2017 . Vision-language fusion for object recognition . AAAI.