1. Introduction

Knowing Earlier what Right Means to You: A Comprehensive VQA Dataset for Grounding Relative Directions via Multi-Task Learning

Kyra Ahrens

Matthias Kerzel

Jae Hee Lee

Cornelius Weber

Stefan Wermter

Spatial reasoning poses a particular challenge for intelligent agents and is at the same time a prerequisite for their successful interaction and communication in the physical world. One such reasoning task is to describe the position of a target object with respect to the intrinsic orientation of some reference object via relative directions. In this paper, we introduce GRiD-A-3D, a novel diagnostic visual question-answering (VQA) dataset based on abstract objects. Our dataset allows for a ifne-grained analysis of end-to-end VQA models' capabilities to ground relative directions. At the same time, model training requires considerably fewer computational resources compared with existing datasets, yet yields a comparable or even higher performance. Along with the new dataset, we provide a thorough evaluation based on two widely known end-to-end VQA architectures trained on GRiD-A-3D. We demonstrate that within a few epochs, the subtasks required to reason over relative directions, such as recognizing and locating objects in a scene and estimating their intrinsic orientations, are learned in the order in which relative directions are intuitively processed.

1. Introduction

Reasoning to solve complex spatial tasks like grounding directional relations in an intrinsic frame of reference can be decomposed into a set of subtasks that are hierarchically organized. Consider two objects 1 and 2 in an image, where each of the objects has a clear front side and orientation. Learning to answer whether the triple (1, , 2) holds for a given directional relation in a frame of reference that is intrinsic to 2 spans the following stages (see Fig. 1 for an example): 1. Both the target object and the reference object have to be recognized in the image (existence prediction). In other words, an agent must initially be capable of answering questions such as “Is 1 in the image?” or “Is 2 in the image?”. 2. Next, the object’s pose that defines the relative relation has to be discerned, enabling an agent to successfully respond to questions such as “What is the cardinal direction of 2?” (orientation prediction).

3. Predicting the directional relation using the intrinsic frame of reference is learned by combin

ing the two preceding competencies, allowing an agent to answer a question similar to “What is the relation between 1 and 2 from the perspective of 2?” (relation prediction). Likewise, predicting which target object is in a specific relation to some reference object (link prediction) can be answered, e.g., “Taking 2’s perspective, which object is in relation to it?”.

4. Based on all previous stages, an agent can deter

mine whether a specific directional relationship exists between the two objects (triple classification), thus successfully providing an answer to a question like “From 2’s perspective, is 1 left of 2?”.

In previous work [1], we showed that enabling a VQA

tial bias that confounds the analysis of reasoning about relative directions.

In the present work, we introduce GRiD-A-3D, a novel and simplified diagnostic VQA dataset, which allows for a more eficient and targeted analysis of the corresponding reasoning process by removing possible biases from using real-world objects. Subsequently, we report the performance of the two established end-to-end VQA models MAC [ 2 ] and FiLM [ 3 ] on this dataset. With our experiments, we show that, when trained on GRiD-A-3D, both models depict a similar qualitative learning behavior compared with their replica trained on the more complex non-abstract GRiD-3D [ 1 ] dataset. At the same time, training converges up to three times faster, thus allowing more eficient neural experiments.

We summarize the contributions made in this paper as follows: • We complement our GRiD-3D benchmark suite1 with a novel GRiD-A-3D (Grounding Relative Directions with Abstract objects in 3D) dataset that enables a faster and less biased evaluation of spatial reasoning behavior in VQA compared with the original GRiD-3D dataset. • We verify our previous research findings with the new dataset, thus underpinning our hypothesis that multi-task learning enables neural models to learn to ground relative directions in VQA. • Furthermore, we add evidence to our hypothesis that during multi-task learning, spatial reasoning abilities of a neural model develop along the intuitive order of corresponding subtasks, thus forming an implicit curriculum.

2. Related Work

Aiming to provide a suitable setup to assess the reasoning capabilities of neural models on vision-language tasks, diagnostic datasets have been introduced [ 4, 5 ]. One of the major advantages of such datasets is that they provide structured and tightly controlled scenes to prevent models from circumventing reasoning by exploiting conditional biases that commonly arise with real-world architecture to reason about relative directions is viable, images. A particular advantage of diagnostic datasets provided that all of the learning stages listed above are based on synthetic images is that their generation proencapsulated in corresponding subtasks as summarized cess is scalable, customizable, and therefore allows for a in Fig. 2. Beyond that, the following two observations more fine-grained performance analysis. were made: First, the subtasks that are found earlier in The vast majority of diagnostic VQA datasets is limited the chronology of learning stages are also learned earlier to spatial reasoning tasks based on the absolute frame of by the models, and second, this behavior is consistent reference, i.e., object positions are relative to the viewer for diferent neural end-to-end models. However, these ifndings are based on experiments involving images with 3D models of real objects, that may introduce a poten- 1https://github.com/knowledgetechnologyuhh/grid-3d to more quickly learn how to ground relative directions, which may be of particular value for few-shot, transfer, and curriculum learning scenarios. Accordingly, we extend the GRiD-3D benchmark suite towards another diagnostic VQA dataset with abstract objects.

3. GRiD-A-3D Abstract VQA Dataset

With the introduction of the GRiD-3D dataset [ 1 ], we could show that neural VQA models are capable of grounding relative directions by implicitly deriving a curriculum of subtasks. In order to further generalize the previous findings, we extend our GRiD-3D suite towards a diagnostic dataset based on abstract objects whose cardinal direction is indicated by colored arrows.

Overview and statistics With our new GRiD-A-3D dataset, we address the following six tasks: Existence of the image. Yet taking into account more realistic sce- Prediction, Orientation Prediction, Link Prediction, Relation narios such as multi-agent dialogue in a situated envi- Prediction, Counting, and Triple Classification. All 8 000 renronment, understanding relative directions is a prerequi- dered images are split without overlap into 6 400 for trainsite for meaningful communication. As a consequence, ing, 800 for validation, and 800 for testing. The 432 948 early models to learn symbolic reasoning with relative corresponding input questions follow largely the same directions have been proposed [ 6, 7, 8 ]. However, they 80:10:10 ratio, yielding 346 984, 43 393, and 42 571 quesinherently assume the availability of scene annotations tions for each set, respectively. The GRiD-A-3D dataset in terms of object labels and spatial relations instead of has an order of magnitude comparable with the GRiD-3D requiring a model to infer such information implicitly. dataset, both in terms of image and question counts.

An early synthetic dataset providing a test bed for grounding relative directions is Rel3D [ 9 ]. Since Rel3D Image generation For is restricted to two objects per scene and one single task, each image, we generate a i.e., binary prediction of (object1, relation, object2) triples, scene by randomly placing GRiD-3D [ 1 ] was introduced, which combines the advan- three to five distinct objects tage of a rich number of tasks and questions as found in onto a plane and render the traditional synthetic VQA datasets with the challenge of corresponding image with grounding relative directions. 480x320 pixel resolution via

GRiD-3D is the first-of-its-kind to target multi-task Blender.2 We choose a conlearning of relative directions in a controlled setting. sistent lighting setup across With this dataset, it was shown that, before learning all images, add shadows to Figure 4: The six abstract how to answer the question whether a triple (object1, each object, and restrict the objects used in relation, object2) holds, neural end-to-end VQA models image generation to a fixed the GRiD-A-3D rely on an implicit curriculum of related subtasks such camera angle, thus obtaining dataset. as object detection, orientation estimation, and relation one image per scene. prediction [ 1 ]. Objects in GRiD-3D cover a variety of cate- Our object set comprises gray-coloured polygonal gories, ranging from humanoids and animals to furniture prisms approximating a cylinder shape, each marked and vehicles. Naturally, such objects difer in terms of with an arrow in one of the six diferent colours: three proportions, complexity, and, most importantly, symme- primary colours (red, blue, and green) and three additive try, which can be a crucial determinant of how easily a secondary colours (yellow, cyan, and magenta). The tip neural network can infer their orientation (and perform of each arrow depicts the object’s front side, allowing associated tasks). for distinct relative directions between objects in the im

In this work, we aim to provide a variation of the origi- age. An overview of all six objects can be found in Fig. 4. nal dataset that ensures the elimination of such potential distortions (see Fig. 3 for examples), enabling a model Note that the overall object count in the original GRiD- image and question features are fed to special neural 3D dataset is 28, whereas GRiD-A-3D is restricted to six units called residual blocks (FiLM) or MAC cells (MAC). diferent objects. A chain of such units provides the core of the reasoning process.

We use existing PyTorch3 implementations of FiLM and MAC with their default hyperparameters for the published CLEVR [ 4 ] dataset evaluations, except for the number of MAC cells that we reduce to four to prevent overfitting. All experiments are run for 100 epochs and repeated three times with diferent seeds to reduce the impact of the random initialization of the models on the results. Fig. 6 shows the mean and the standard deviation of the evaluations.

We interpret our results in the following way: Existence and Orientation Prediction are learned earlier than other tasks. We explain this observation with the fact that these (a) FiLM (b) MAC tasks only require a model to focus on one single object.

For the most straightforward task of Existence Prediction, uFsigedurfeor5o:uNreeuxpraelriemnedn-ttos.-eTnhde VgeQnAermicoudneiltss F( hiLeMre caonldorMedAiCn we observe similar behavior for the two datasets: Both orange and blue, respectively) control how the question and converge to an accuracy of almost 100% at nearly the image features are being processed. same time. For the Orientation Prediction task, we observe convergence to an accuracy of over 80% for both datasets.

Noticeably, the learning happens faster for the abstract GRiD-A-3D dataset. The shorter learning time can be Question generation In addition to rendering the im- attributed to the more unequivocal identification of front ages from our sampled scenes, we obtain scene graphs and back sides of the abstract objects due to the lack equipped with ground truth information such as absolute of symmetry related noise as shown in Fig. 3. The fact position, orientation, and relative directions of objects, that the accuracy on Orientation Prediction is capped at that we use to generate questions related to the six tasks about 85% can be explained by objects placed close to the contained in GRiD-A-3D. Our question generation builds border between two cardinal directions, as such cases are upon the framework provided with CLEVR [ 4 ], whose dificult for the models to learn and classify. question templates, synonym, and metadata files we tai- A similar learning behavior can be observed for the lor to our dataset. Likewise, our question generation more complex tasks of Relation Prediction, Triple Classifipipeline is expressed as a template-based functional pro- cation and Link Prediction, where both models converge gram executed on each scene graph. faster when trained on GRiD-A-3D and also reach slightly

We follow the depth-first search strategy to determine higher accuracy. Similarly to the results on the Orientaand instantiate question-answer pairs that comply with tion Prediction task, the main reason for these observathe scene information and can therefore be considered tions may lie in the facilitated learning conditions due to valid. We set additional constraints to make sure that an- the lack of front-back symmetries or strong occlusions swers are uniformly distributed for each task. To ensure with the abstract objects. This efect is most pronounced a wide variety of natural language questions, we sample for Link Prediction, i.e., predicting which target object from a rich set of diferently phrased question templates is in a given relation to some reference object. We atfor each reasoning task and randomly omit utterances or tribute this observation to the smaller set of objects in replace words with suitable synonyms. the GRiD-A-3D dataset.

Finally, we observe a mixed result for the Counting task: 4. Evaluations While learning of both VQA models converges faster for the GRiD-A-3D dataset, higher accuracy is reached for the GRiD-3D dataset. We hypothesize that this higher accuracy stems from the more diverse-looking objects in the GRiD-3D dataset, facilitating the models to distinguish and thus count multiple objects in close proximity.

In summary, our results suggest the following two facts: First, the abstract GRiD-A-3D dataset leads to faster For our experiments, we train MAC [ 2 ] and FiLM [ 3 ], two state-of-the-art neural end-to-end VQA architectures, on our new GRiD-A-3D dataset (cf. Fig. 5). Both architectures take raw RGB images and plain text question-answer pairs as input for training. Image features are extracted by a pretrained ResNet101 [ 10 ] for both models, while questions are encoded by a GRU [ 11 ] (FiLM) or a bidirectional LSTM [ 12 ] (MAC), respectively. Subsequently, 3https://pytorch.org/ 1.0 0.8 y rca0.6 u c cA0.4 0.2 10 20 50 10

20 FiLM on GRiD-3D MAC on GRiD-3D FiLM on GRiD-A-3D MAC on GRiD-A-3D 30 40

Epoch Counting

FiLM on GRiD-3D MAC on GRiD-3D FiLM on GRiD-A-3D MAC on GRiD-A-3D 30 40 learning and can thus enable more computationally ef- curriculum. ifcient experimentation while achieving comparable results to the original GRiD-3D dataset. Second, the results support our assumption of a chronology of subtasks, as Existence Prediction and Orientation Prediction are learned before the models can reason about relative directions.

Acknowledgments The authors gratefully acknowledge support from the German Research Foundation DFG for the projects CML TRR169, LeCAREbot and IDEAS. 5. Conclusions This work is an extension to previous work on grounding

relative directions with end-to-end neural VQA architectures. We provide a comprehensive, simplified GRiDA-3D dataset with abstract objects that shows similar behavior to the original GRiD-3D dataset when learned by the two established VQA models FiLM and MAC. With our experiments, we show that the learning of tasks that focus on a single object like object recognition and orientation prediction happens prior to learning to ground relative directions and object counting.

The abstract nature of the dataset eliminates approximate front-back object symmetries that can have a negative impact on object orientation prediction and all reasoning tasks about directional relations that build upon it. Furthermore, the simplification of the object set allows for conducting experiments with a more comprehensive dataset. In future work, this will allow us to conduct fast pilot studies on curriculum and transfer learning based on the intuitive dependency of the diferent spatial reasoning tasks on one another and the observed implicit

[1]

J. H.

Lee ,

Kerzel ,

Ahrens ,

Weber ,

Wermter , What is Right for Me is Not Yet Right for You: A Dataset for Grounding Relative Directions via Multi-Task Learning , in: Thirty-First International Joint Conference on Artificial Intelligence , 2022 .

[2]

D. A.

Hudson ,

C. D.

Manning , Compositional attention networks for machine reasoning , in: International Conference on Learning Representations , 2018 .

[3]

Perez ,

Strub , H. de Vries,

Dumoulin , A . Courville, FiLM: Visual Reasoning with a General Conditioning Layer , in: Thirty-Second AAAI Conference on Artificial Intelligence , 2018 .

[4]

Johnson ,

Hariharan , L. van der Maaten, L. FeiFei, C. Lawrence Zitnick,

Girshick , CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , in: IEEE Conference on Computer Vision and Pattern Recognition , 2017 .

[5]

D. A.

Hudson ,

C. D.

Manning , GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2019 .

[6]

Moratz , T. Tenbrink, Spatial Reference in Linguistic Human-Robot Interaction: Iterative, Empirically Supported Development of a Model of Projective Relations , Spatial Cognition & Computation 6 ( 2006 ). doi: 10 .1207/s15427633scc0601_ 3 .

[7]

J. H.

Lee ,

Renz , D. Wolter, StarVars: Efective Reasoning About Relative Directions , in: TwentyThird International Joint Conference on Artificial Intelligence , AAAI Press, 2013 .

[8]

Hua ,

Renz ,

Ge , Qualitative Representation and Reasoning over Direction Relations across Diferent Frames of Reference , in: Sixteenth International Conference on Principles of Knowledge Representation and Reasoning , 2018 .

[9]

Goyal ,

Yang ,

Yang , J. Deng, Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations in 3D , Advances in Neural Information Processing Systems 33 ( 2020 ).

[10]

He ,

Zhang , S. Ren,

Sun , Deep Residual Learning for Image Recognition , in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016 .

[11]

Chung ,

Gulcehre ,

Cho ,

Bengio , Empirical evaluation of gated recurrent neural networks on sequence modeling , in: NIPS 2014 Workshop on Deep Learning , 2014 .

[12]

Hochreiter ,

Schmidhuber , Long short-term memory , Neural Computation 9 ( 1997 ) 1735 - 1780 . doi: 10 .1162/neco. 1997 . 9 .8.1735.