Learning Visually Grounded Common Sense Spatial Knowledge for Implicit Spatial Language* Guillem Collell and Marie-Francine Moens Computer Science Department KU Leuven gcollell@kuleuven.be; sien.moens@cs.kuleuven.be 1 Motivation Obj. Target Obj. (Obj. center & size) Subj. Spatial understanding is crucial for any agent that navigates Predict 1) Mirror Image Pre-Processing (if needed) in a physical world. Computational and cognitive frameworks Compose Obj. Learning often model spatial representations as spatial templates or regions of acceptability for two objects under an explicit spa- Concatenate Subj. tial preposition such as “left” or “below” (Logan and Sadler Size (Subj) Center (Subj) 1) Re-scale 1996). Contrary to previous work that define spatial templates coordinates for explicit spatial language only (Malinowski and Fritz 2014; Embeddings Moratz and Tenbrink 2006), we extend such concept to im- man flying kite Input (text) plicit spatial language, i.e., those relationships (usually ac- Subj. Rel. Obj. tions) that do not explicitly define the relative location of the two objects (e.g., “dog under table”) but only implicitly (e.g., “girl riding horse”). Unlike explicit relationships, predicting Figure 1: Overview of our model and setting. spatial arrangements from implicit spatial language requires spatial common sense knowledge about the objects and ac- tions. Furthermore, prior work that leverage common sense 2 Proposed task and model spatial knowledge to solve tasks such as visual paraphrasing 2.1 Proposed task (Lin and Parikh 2015) or object labeling (Shiang et al. 2017) do not aim to predict (unseen) spatial configurations. We propose the task of predicting the 2D relative spatial arrangement of two objects under a relationship given a struc- Here, we propose the task of predicting the relative spatial tured text input of the form (Subject, Relationship, Object)— locations of two objects given a textual input of the form abbreviated as (S, R, O). More precisely, the model predicts (Subject, Relationship, Object). We report on initial exper- the Object’s box center and box size (output) given the struc- iments with a simple neural network model with distance- tured text input (S, R, O) plus the center and size of the based supervision learned in annotated images that obtains Subject’s box (Fig. 1). promising performance. Crucially, we show that the model can reliably predict templates of unseen combinations, e.g., 2.2 Proposed model predicting (man, riding, elephant) without having seen such We employ a feed forward network with embeddings (Fig. 1). scene before. Furthermore, by leveraging word embeddings The embedding layer maps the input words (S,R,O) to their of objects and relationships, the model can correctly predict d-dimensional representations. The embeddings are then con- spatial templates for unseen words. E.g., without having ever catenated with the Subject’s box center and size. This vector seen “boots” before but only “sandals”, the model predicts is then fed into a fully connected layer to compose S, R, correctly the template of (person, wearing, boots) by infer- O into a joint representation. model predictions (Object’s ring that, since “boots” are similar to “sandals”, they must be center and size) are evaluated against ground truth with a worn at the same position of the “person”’s body. Hence, the mean squared error (MSE) loss. model is able to leverage the learned common sense spatial knowledge to generalize to unseen objects. 3 Experimental setup Data. We use the Visual Genome (Krishna et al. 2017) dataset, which has ∼108K images containing ∼1,5M *The reader may refer to a full paper (Collell, Van Gool, and Moens human-annotated (S, R, O) instances with corresponding 2018) that resulted from the preliminary studies presented in this object boxes. We filter out all instances with explicit spatial abstract. prepositions, yielding ∼378K implicit (S, R, O) instances. MSE R2 accy F1y rx ry performance with seen combinations. Implicit EMB 0.008 0.705 0.756 0.755 0.894 0.834 Unseen Words. Contrarily, large differences in performance RND 0.008 0.691 0.750 0.750 0.891 0.826 are observed with unseen words (table not shown) where the 1H 0.008 0.717 0.762 0.762 0.896 0.842 model that uses embeddings (EMB) performs significantly ctrl 0.054 -1.000 0.522 0.521 0.000 -0.001 better than the rest. EMB 0.013 0.586 0.768 0.770 0.811 0.823 Explicit RND 0.013 0.580 0.767 0.769 0.808 0.815 1H 0.012 0.604 0.778 0.780 0.815 0.828 ctrl 0.060 -1.000 0.633 0.630 0.000 0.000 Table 1: Results on implicit and explicit relations. person, holding, cat man, following, elephant person, riding, elephant Evaluation sets. We evaluate performance in the fol- lowing subsets of Visual Genome. (i) Raw set: Simply the unfiltered instances. (ii) Unseen words: We randomly pick 25 objects (e.g., “woman”, “apple”, etc.) among the 100 most frequent ones and leave out from the training data all the instances (∼130K) containing any of these words. This set is used for testing. (iii) Unseen combinations: We randomly pick 100 combinations (S, R, O) among the 1,000 man, flying, kite man, holding, kite man, walking, dog most frequent implicit ones and leave them out for training. We finally consider the explicit version of the Raw set. Reported results are always on unseen instances—yet the Figure 2: Predictions by the model that leverages word em- combinations (S, R, O) may have been seen during training beddings (EMB). Top: Predictions in unseen words (under- (e.g., in different images). lined). Bottom: Predictions in unseen triplets. Data pre-processing. Coordinates are normalized by image width and height. Since right/left depends only on the 4.2 Qualitative evaluation (spatial templates) camera viewpoint, we get rid of this arbitrariness by mir- roring the image when the Object is on the left of the Subject. Heat maps in Fig. 2 show regions of predicted high (red) and low (blue) probability. The “heat” of the objects is assumed Evaluation metrics. We use standard regression met- to be normally distributed with µ equal to the object’s center rics: (i) Mean Squared Error (MSE) between predicted and σ to the object’s size. The EMB model is able to infer and true Object center and size. (ii) Coefficient of De- both, relative locations and sizes, e.g., predicting correctly the termination (R2 ) of model predictions and ground truth. size of a “cat” relative to a “person” even though the model (iii) Pearson Correlation (r) between predicted and has never seen a “cat” before. Notably, the model learns to true x-component of the Object center, and similarly for compose the triplet as a whole, distinguishing, e.g., (man, the y-component. We also consider the classification of flying, kite) from (man, holding, kite). above/below relative locations of the Object w.r.t. the Subject. We report (macro averaged) F1 (F1y ) and accuracy (accy ). Acknowledgments This work has been supported by the CHIST-ERA EU project 4 Results MUSTER.2 We test the following model variations. EMB denotes a model that uses pre-trained word embeddings1 , RND a model with References random normal embeddings, 1H employs one-hot embed- Collell, G.; Van Gool, L.; and Moens, M.-F. 2018. Acquiring dings and ctrl outputs random normal predictions. Overall, common sense spatial knowledge through implicit spatial the preliminary results outlined below look promising. templates. AAAI. 4.1 Quantitative results Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; et al. Evaluation with raw data. Table 1 shows that all methods 2017. Visual genome: Connecting language and vision us- perform well in the Raw data. Remarkably, we see that rela- ing crowdsourced dense image annotations. International tive locations can be predicted from implicit spatial language Journal of Computer Vision 123(1):32–73. at least as accurately as from explicit spatial language. Unseen combinations. All models perform well on unseen Lin, X., and Parikh, D. 2015. Don’t just listen, use your combinations (table not shown), remarkably closely to their imagination: Leveraging visual common sense for non-visual tasks. In CVPR, 2984–2993. 1 We use 300-d GloVe embeddings (Pennington, Socher, and 2 Manning 2014) http://nlp.stanford.edu/projects/glove. http://www.chistera.eu/projects/muster Logan, G. D., and Sadler, D. D. 1996. A computational analysis of the apprehension of spatial relations. Malinowski, M., and Fritz, M. 2014. A pooling approach to modelling spatial relations for image retrieval and annotation. arXiv preprint arXiv:1411.5190. Moratz, R., and Tenbrink, T. 2006. Spatial reference in linguistic human-robot interaction: Iterative, empirically sup- ported development of a model of projective relations. Spatial Cognition and computation 6(1):63–107. Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In EMNLP, vol- ume 14, 1532–1543. Shiang, S.-R.; Rosenthal, S.; Gershman, A.; Carbonell, J.; and Oh, J. 2017. Vision-language fusion for object recognition. AAAI.