<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Learning Visually Grounded Common Sense Spatial Knowledge for Implicit Spatial Language*</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Guillem</forename><surname>Collell</surname></persName>
							<email>gcollell@kuleuven.be</email>
							<affiliation key="aff0">
								<orgName type="department">Computer Science Department KU Leuven</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Marie-Francine</forename><surname>Moens</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Computer Science Department KU Leuven</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Learning Visually Grounded Common Sense Spatial Knowledge for Implicit Spatial Language*</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">F7A495C20A2C7DE1BD9B5A002E6B383A</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T21:00+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract/>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Motivation</head><p>Spatial understanding is crucial for any agent that navigates in a physical world. Computational and cognitive frameworks often model spatial representations as spatial templates or regions of acceptability for two objects under an explicit spatial preposition such as "left" or "below" <ref type="bibr" target="#b3">(Logan and Sadler 1996)</ref>. Contrary to previous work that define spatial templates for explicit spatial language only (Malinowski and Fritz 2014; <ref type="bibr" target="#b4">Moratz and Tenbrink 2006)</ref>, we extend such concept to implicit spatial language, i.e., those relationships (usually actions) that do not explicitly define the relative location of the two objects (e.g., "dog under table") but only implicitly (e.g., "girl riding horse"). Unlike explicit relationships, predicting spatial arrangements from implicit spatial language requires spatial common sense knowledge about the objects and actions. Furthermore, prior work that leverage common sense spatial knowledge to solve tasks such as visual paraphrasing <ref type="bibr" target="#b2">(Lin and Parikh 2015)</ref> or object labeling <ref type="bibr" target="#b6">(Shiang et al. 2017)</ref> do not aim to predict (unseen) spatial configurations.</p><p>Here, we propose the task of predicting the relative spatial locations of two objects given a textual input of the form (Subject, Relationship, Object). We report on initial experiments with a simple neural network model with distancebased supervision learned in annotated images that obtains promising performance. Crucially, we show that the model can reliably predict templates of unseen combinations, e.g., predicting (man, riding, elephant) without having seen such scene before. Furthermore, by leveraging word embeddings of objects and relationships, the model can correctly predict spatial templates for unseen words. E.g., without having ever seen "boots" before but only "sandals", the model predicts correctly the template of (person, wearing, boots) by inferring that, since "boots" are similar to "sandals", they must be worn at the same position of the "person"'s body. Hence, the model is able to leverage the learned common sense spatial knowledge to generalize to unseen objects.</p><p>*The reader may refer to a full paper <ref type="bibr" target="#b0">(Collell, Van Gool, and Moens 2018</ref>) that resulted from the preliminary studies presented in this abstract. 2 Proposed task and model</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Proposed task</head><p>We propose the task of predicting the 2D relative spatial arrangement of two objects under a relationship given a structured text input of the form (Subject, Relationship, Object)abbreviated as (S, R, O). More precisely, the model predicts the Object's box center and box size (output) given the structured text input (S, R, O) plus the center and size of the Subject's box (Fig. <ref type="figure" target="#fig_0">1</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Proposed model</head><p>We employ a feed forward network with embeddings (Fig. <ref type="figure" target="#fig_0">1</ref>). The embedding layer maps the input words (S,R,O) to their d-dimensional representations. The embeddings are then concatenated with the Subject's box center and size. This vector is then fed into a fully connected layer to compose S, R, O into a joint representation. model predictions (Object's center and size) are evaluated against ground truth with a mean squared error (MSE) loss.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Experimental setup</head><p>Data. We use the Visual Genome <ref type="bibr" target="#b1">(Krishna et al. 2017)</ref> dataset, which has ∼108K images containing ∼1,5M human-annotated (S, R, O) instances with corresponding object boxes. We filter out all instances with explicit spatial prepositions, yielding ∼378K implicit (S, R, O) instances.  (iii) Pearson Correlation (r) between predicted and true x-component of the Object center, and similarly for the y-component. We also consider the classification of above/below relative locations of the Object w.r.t. the Subject. We report (macro averaged) F1 (F1 y ) and accuracy (acc y ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Results</head><p>We test the following model variations. EMB denotes a model that uses pre-trained word embeddings 1 , RND a model with random normal embeddings, 1H employs one-hot embeddings and ctrl outputs random normal predictions. Overall, the preliminary results outlined below look promising.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Quantitative results</head><p>Evaluation with raw data. Table <ref type="table" target="#tab_0">1</ref> shows that all methods perform well in the Raw data. Remarkably, we see that relative locations can be predicted from implicit spatial language at least as accurately as from explicit spatial language. Unseen combinations. All models perform well on unseen combinations (table not shown), remarkably closely to their  Qualitative evaluation (spatial templates)</p><p>Heat maps in Fig. <ref type="figure" target="#fig_2">2</ref> show regions of predicted high (red) and low (blue) probability. The "heat" of the objects is assumed to be normally distributed with µ equal to the object's center and σ to the object's size. The EMB model is able to infer both, relative locations and sizes, e.g., predicting correctly the size of a "cat" relative to a "person" even though the model has never seen a "cat" before. Notably, the model learns to compose the triplet as a whole, distinguishing, e.g., (man, flying, kite) from (man, holding, kite).</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Overview of our model and setting.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>Evaluation sets. We evaluate performance in the following subsets of Visual Genome. (i) Raw set: Simply the unfiltered instances. (ii) Unseen words: We randomly pick 25 objects (e.g., "woman", "apple", etc.) among the 100 most frequent ones and leave out from the training data all the instances (∼130K) containing any of these words. This set is used for testing. (iii) Unseen combinations: We randomly pick 100 combinations (S, R, O) among the 1,000 most frequent implicit ones and leave them out for training. We finally consider the explicit version of the Raw set. Reported results are always on unseen instances-yet the combinations (S, R, O) may have been seen during training (e.g., in different images). Data pre-processing. Coordinates are normalized by image width and height. Since right/left depends only on the camera viewpoint, we get rid of this arbitrariness by mirroring the image when the Object is on the left of the Subject. Evaluation metrics. We use standard regression metrics: (i) Mean Squared Error (MSE) between predicted and true Object center and size. (ii) Coefficient of Determination (R 2 ) of model predictions and ground truth.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Predictions by the model that leverages word embeddings (EMB). Top: Predictions in unseen words (underlined). Bottom: Predictions in unseen triplets.</figDesc><graphic coords="2,405.25,209.21,66.67,66.67" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Results on implicit and explicit relations.</figDesc><table><row><cell></cell><cell>MSE</cell><cell>R 2</cell><cell>acc y</cell><cell>F1 y</cell><cell>r x</cell><cell>r y</cell></row><row><cell>Implicit</cell><cell cols="6">EMB 0.008 0.705 0.756 0.755 0.894 0.834 RND 0.008 0.691 0.750 0.750 0.891 0.826 1H 0.008 0.717 0.762 0.762 0.896 0.842 ctrl 0.054 -1.000 0.522 0.521 0.000 -0.001</cell></row><row><cell>Explicit</cell><cell cols="6">EMB 0.013 0.586 RND 0.013 0.580 1H 0.012 0.604 ctrl 0.060 -1.000 0.633 0.630 0.000 0.000 0.768 0.770 0.811 0.823 0.767 0.769 0.808 0.815 0.778 0.780 0.815 0.828</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>Unseen Words. Contrarily, large differences in performance are observed with unseen words (table not shown) where the model that uses embeddings (EMB) performs significantly better than the rest.</figDesc><table><row><cell>person, holding,</cell><cell>cat</cell><cell cols="2">man, following, elephant person, riding, elephant</cell></row><row><cell cols="2">man, flying, kite</cell><cell>man, holding, kite</cell><cell>man, walking, dog</cell></row></table><note>1 We use 300-d GloVe embeddings<ref type="bibr" target="#b5">(Pennington, Socher, and Manning 2014)</ref> http://nlp.stanford.edu/projects/glove. performance with seen combinations.</note></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">http://www.chistera.eu/projects/muster</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This work has been supported by the CHIST-ERA EU project MUSTER. 2   </p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Acquiring common sense spatial knowledge through implicit spatial templates</title>
		<author>
			<persName><forename type="first">G</forename><surname>Collell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Van Gool</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-F</forename><surname>Moens</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">AAAI</title>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Visual genome: Connecting language and vision using crowdsourced dense image annotations</title>
		<author>
			<persName><forename type="first">R</forename><surname>Krishna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Groth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Johnson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Hata</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kravitz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kalantidis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L.-J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">A</forename><surname>Shamma</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computer Vision</title>
		<imprint>
			<biblScope unit="volume">123</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="32" to="73" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Don&apos;t just listen, use your imagination: Leveraging visual common sense for non-visual tasks</title>
		<author>
			<persName><forename type="first">X</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Parikh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CVPR</title>
				<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="2984" to="2993" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">A pooling approach to modelling spatial relations for image retrieval and annotation</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">D</forename><surname>Logan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Sadler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Malinowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Fritz</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1411.5190</idno>
		<imprint>
			<date type="published" when="1996">1996. 2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
	<note>A computational analysis of the apprehension of spatial relations</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Spatial reference in linguistic human-robot interaction: Iterative, empirically supported development of a model of projective relations</title>
		<author>
			<persName><forename type="first">R</forename><surname>Moratz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Tenbrink</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Spatial Cognition and computation</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="63" to="107" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Glove: Global vectors for word representation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Pennington</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Socher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">EMNLP</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page" from="1532" to="1543" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Vision-language fusion for object recognition</title>
		<author>
			<persName><forename type="first">S.-R</forename><surname>Shiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Rosenthal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gershman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Carbonell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Oh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">AAAI</title>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
