=Paper=
{{Paper
|id=Vol-3190/paper4
|storemode=property
|title=Knowing Earlier What Right Means to You: A Comprehensive VQA Dataset for Grounding Relative Directions via Multi-task Learning
|pdfUrl=https://ceur-ws.org/Vol-3190/paper4.pdf
|volume=Vol-3190
|authors=Kyra Ahrens,Matthias Kerzel,Jae Hee Lee,Cornelius Weber,Stefan Wermter
|dblpUrl=https://dblp.org/rec/conf/ijcai/AhrensK0WW22
}}
==Knowing Earlier What Right Means to You: A Comprehensive VQA Dataset for Grounding Relative Directions via Multi-task Learning==
Knowing Earlier what Right Means to You: A Comprehensive VQA Dataset for Grounding Relative Directions via Multi-Task Learning Kyra Ahrens† , Matthias Kerzel† , Jae Hee Lee† , Cornelius Weber and Stefan Wermter University of Hamburg Abstract Spatial reasoning poses a particular challenge for intelligent agents and is at the same time a prerequisite for their successful interaction and communication in the physical world. One such reasoning task is to describe the position of a target object with respect to the intrinsic orientation of some reference object via relative directions. In this paper, we introduce GRiD-A-3D, a novel diagnostic visual question-answering (VQA) dataset based on abstract objects. Our dataset allows for a fine-grained analysis of end-to-end VQA models’ capabilities to ground relative directions. At the same time, model training requires considerably fewer computational resources compared with existing datasets, yet yields a comparable or even higher performance. Along with the new dataset, we provide a thorough evaluation based on two widely known end-to-end VQA architectures trained on GRiD-A-3D. We demonstrate that within a few epochs, the subtasks required to reason over relative directions, such as recognizing and locating objects in a scene and estimating their intrinsic orientations, are learned in the order in which relative directions are intuitively processed. 1. Introduction Reasoning to solve complex spatial tasks like grounding directional relations in an intrinsic frame of reference can be decomposed into a set of subtasks that are hier- archically organized. Consider two objects 𝑜1 and 𝑜2 in an image, where each of the objects has a clear front side and orientation. Learning to answer whether the triple (𝑜1 , 𝑟, 𝑜2 ) holds for a given directional relation 𝑟 in a frame of reference that is intrinsic to 𝑜2 spans the following stages (see Fig. 1 for an example): 1. Both the target object and the reference object have to be recognized in the image (existence Figure 1: Example of grounding relative directions, e.g., con- prediction). In other words, an agent must ini- sidering the green arrow’s perspective, the yellow arrow is on tially be capable of answering questions such as the left in front of it. “Is 𝑜1 in the image?” or “Is 𝑜2 in the image?”. 2. Next, the object’s pose that defines the relative ing the two preceding competencies, allowing an relation has to be discerned, enabling an agent to agent to answer a question similar to “What is the successfully respond to questions such as “What relation between 𝑜1 and 𝑜2 from the perspective is the cardinal direction of 𝑜2 ?” (orientation of 𝑜2 ?” (relation prediction). Likewise, predict- prediction). ing which target object is in a specific relation to 3. Predicting the directional relation using the in- some reference object (link prediction) can be trinsic frame of reference is learned by combin- answered, e.g., “Taking 𝑜2 ’s perspective, which object is in relation 𝑟 to it?”. STRL’22: First International Workshop on Spatio-Temporal Reasoning and Learning, July 24, 2022, Vienna, Austria † 4. Based on all previous stages, an agent can deter- These authors contributed equally. mine whether a specific directional relationship $ kyra.ahrens@uni-hamburg.de (K. Ahrens); exists between the two objects (triple classifica- matthias.kerzel@uni-hamburg.de (M. Kerzel); jae.hee.lee@uni-hamburg.de (J. H. Lee); tion), thus successfully providing an answer to a cornelius.weber@uni-hamburg.de (C. Weber); question like “From 𝑜2 ’s perspective, is 𝑜1 left of stefan.wermter@uni-hamburg.de (S. Wermter) 𝑜2 ?”. © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) In previous work [1], we showed that enabling a VQA tial bias that confounds the analysis of reasoning about relative directions. In the present work, we introduce GRiD-A-3D, a novel and simplified diagnostic VQA dataset, which allows for a more efficient and targeted analysis of the correspond- ing reasoning process by removing possible biases from using real-world objects. Subsequently, we report the per- formance of the two established end-to-end VQA models MAC [2] and FiLM [3] on this dataset. With our ex- periments, we show that, when trained on GRiD-A-3D, both models depict a similar qualitative learning behavior compared with their replica trained on the more com- plex non-abstract GRiD-3D [1] dataset. At the same time, training converges up to three times faster, thus allowing more efficient neural experiments. We summarize the contributions made in this paper as follows: • We complement our GRiD-3D benchmark suite1 with a novel GRiD-A-3D (Grounding Relative Directions with Abstract objects in 3D) dataset that enables a faster and less biased evaluation of spatial reasoning behavior in VQA compared with the original GRiD-3D dataset. • We verify our previous research findings with the new dataset, thus underpinning our hypothesis that multi-task learning enables neural models to learn to ground relative directions in VQA. • Furthermore, we add evidence to our hypothesis that during multi-task learning, spatial reason- ing abilities of a neural model develop along the intuitive order of corresponding subtasks, thus forming an implicit curriculum. 2. Related Work Figure 2: Top: Image from the GRiD-A-3D dataset. Bottom: Assumed hierarchy of spatial reasoning tasks to answer dif- Aiming to provide a suitable setup to assess the reasoning ferent question of the abstract GRiD-A-3D dataset. Arrows capabilities of neural models on vision-language tasks, indicate a chronological dependency of tasks, e.g., in orderdiagnostic datasets have been introduced [4, 5]. One of the major advantages of such datasets is that they to determine the orientation of an object, it first has to be recognized. provide structured and tightly controlled scenes to pre- vent models from circumventing reasoning by exploiting conditional biases that commonly arise with real-world architecture to reason about relative directions is viable, images. A particular advantage of diagnostic datasets provided that all of the learning stages listed above are based on synthetic images is that their generation pro- encapsulated in corresponding subtasks as summarized cess is scalable, customizable, and therefore allows for a in Fig. 2. Beyond that, the following two observations more fine-grained performance analysis. were made: First, the subtasks that are found earlier in The vast majority of diagnostic VQA datasets is limited the chronology of learning stages are also learned earlier to spatial reasoning tasks based on the absolute frame of by the models, and second, this behavior is consistent reference, i.e., object positions are relative to the viewer for different neural end-to-end models. However, these findings are based on experiments involving images with 3D models of real objects, that may introduce a poten- 1 https://github.com/knowledgetechnologyuhh/grid-3d to more quickly learn how to ground relative directions, which may be of particular value for few-shot, trans- fer, and curriculum learning scenarios. Accordingly, we extend the GRiD-3D benchmark suite towards another diagnostic VQA dataset with abstract objects. 3. GRiD-A-3D Abstract VQA Dataset With the introduction of the GRiD-3D dataset [1], we could show that neural VQA models are capable of Figure 3: Common challenges in grounding relative directions grounding relative directions by implicitly deriving a arising with real objects, exemplified by objects from the orig- curriculum of subtasks. In order to further generalize the inal GRiD-3D dataset. Top left: Occlusion due to variability previous findings, we extend our GRiD-3D suite towards in heights and shapes of objects. Bottom left: Symmetry of a diagnostic dataset based on abstract objects whose car- objects impairs the detection of their front sides. Top/bottom right: Replica of the images on the left using abstract objects dinal direction is indicated by colored arrows. from the GRiD-A-3D dataset. Overview and statistics With our new GRiD-A-3D dataset, we address the following six tasks: Existence of the image. Yet taking into account more realistic sce- Prediction, Orientation Prediction, Link Prediction, Relation narios such as multi-agent dialogue in a situated envi- Prediction, Counting, and Triple Classification. All 8 000 ren- ronment, understanding relative directions is a prerequi- dered images are split without overlap into 6 400 for train- site for meaningful communication. As a consequence, ing, 800 for validation, and 800 for testing. The 432 948 early models to learn symbolic reasoning with relative corresponding input questions follow largely the same directions have been proposed [6, 7, 8]. However, they 80:10:10 ratio, yielding 346 984, 43 393, and 42 571 ques- inherently assume the availability of scene annotations tions for each set, respectively. The GRiD-A-3D dataset in terms of object labels and spatial relations instead of has an order of magnitude comparable with the GRiD-3D requiring a model to infer such information implicitly. dataset, both in terms of image and question counts. An early synthetic dataset providing a test bed for grounding relative directions is Rel3D [9]. Since Rel3D Image generation For is restricted to two objects per scene and one single task, each image, we generate a i.e., binary prediction of (object1 , relation, object2 ) triples, scene by randomly placing GRiD-3D [1] was introduced, which combines the advan- three to five distinct objects tage of a rich number of tasks and questions as found in onto a plane and render the traditional synthetic VQA datasets with the challenge of corresponding image with grounding relative directions. 480x320 pixel resolution via 2 GRiD-3D is the first-of-its-kind to target multi-task Blender. We choose a con- learning of relative directions in a controlled setting. sistent lighting setup across With this dataset, it was shown that, before learning all images, add shadows to Figure 4: The six abstract how to answer the question whether a triple (object1 , each object, and restrict the objects used in relation, object2 ) holds, neural end-to-end VQA models image generation to a fixed the GRiD-A-3D rely on an implicit curriculum of related subtasks such camera angle, thus obtaining dataset. as object detection, orientation estimation, and relation one image per scene. prediction [1]. Objects in GRiD-3D cover a variety of cate- Our object set comprises gray-coloured polygonal gories, ranging from humanoids and animals to furniture prisms approximating a cylinder shape, each marked and vehicles. Naturally, such objects differ in terms of with an arrow in one of the six different colours: three proportions, complexity, and, most importantly, symme- primary colours (red, blue, and green) and three additive try, which can be a crucial determinant of how easily a secondary colours (yellow, cyan, and magenta). The tip neural network can infer their orientation (and perform of each arrow depicts the object’s front side, allowing associated tasks). for distinct relative directions between objects in the im- In this work, we aim to provide a variation of the origi- age. An overview of all six objects can be found in Fig. 4. nal dataset that ensures the elimination of such potential distortions (see Fig. 3 for examples), enabling a model 2 https://www.blender.org/ Note that the overall object count in the original GRiD- image and question features are fed to special neural 3D dataset is 28, whereas GRiD-A-3D is restricted to six units called residual blocks (FiLM) or MAC cells (MAC). different objects. A chain of such units provides the core of the reasoning process. We use existing PyTorch3 implementations of FiLM and MAC with their default hyperparameters for the published CLEVR [4] dataset evaluations, except for the number of MAC cells that we reduce to four to prevent overfitting. All experiments are run for 100 epochs and repeated three times with different seeds to reduce the impact of the random initialization of the models on the results. Fig. 6 shows the mean and the standard deviation of the evaluations. We interpret our results in the following way: Existence and Orientation Prediction are learned earlier than other tasks. We explain this observation with the fact that these (a) FiLM (b) MAC tasks only require a model to focus on one single object. For the most straightforward task of Existence Prediction, Figure 5: Neural end-to-end VQA models FiLM and MAC used for our experiments. The generic units (here colored in we observe similar behavior for the two datasets: Both orange and blue, respectively) control how the question and converge to an accuracy of almost 100% at nearly the image features are being processed. same time. For the Orientation Prediction task, we observe convergence to an accuracy of over 80% for both datasets. Noticeably, the learning happens faster for the abstract GRiD-A-3D dataset. The shorter learning time can be Question generation In addition to rendering the im- attributed to the more unequivocal identification of front ages from our sampled scenes, we obtain scene graphs and back sides of the abstract objects due to the lack equipped with ground truth information such as absolute of symmetry related noise as shown in Fig. 3. The fact position, orientation, and relative directions of objects, that the accuracy on Orientation Prediction is capped at that we use to generate questions related to the six tasks about 85% can be explained by objects placed close to the contained in GRiD-A-3D. Our question generation builds border between two cardinal directions, as such cases are upon the framework provided with CLEVR [4], whose difficult for the models to learn and classify. question templates, synonym, and metadata files we tai- A similar learning behavior can be observed for the lor to our dataset. Likewise, our question generation more complex tasks of Relation Prediction, Triple Classifi- pipeline is expressed as a template-based functional pro- cation and Link Prediction, where both models converge gram executed on each scene graph. faster when trained on GRiD-A-3D and also reach slightly We follow the depth-first search strategy to determine higher accuracy. Similarly to the results on the Orienta- and instantiate question-answer pairs that comply with tion Prediction task, the main reason for these observa- the scene information and can therefore be considered tions may lie in the facilitated learning conditions due to valid. We set additional constraints to make sure that an- the lack of front-back symmetries or strong occlusions swers are uniformly distributed for each task. To ensure with the abstract objects. This effect is most pronounced a wide variety of natural language questions, we sample for Link Prediction, i.e., predicting which target object from a rich set of differently phrased question templates is in a given relation to some reference object. We at- for each reasoning task and randomly omit utterances or tribute this observation to the smaller set of objects in replace words with suitable synonyms. the GRiD-A-3D dataset. Finally, we observe a mixed result for the Counting task: 4. Evaluations While learning of both VQA models converges faster for the GRiD-A-3D dataset, higher accuracy is reached for For our experiments, we train MAC [2] and FiLM [3], two the GRiD-3D dataset. We hypothesize that this higher state-of-the-art neural end-to-end VQA architectures, on accuracy stems from the more diverse-looking objects our new GRiD-A-3D dataset (cf. Fig. 5). Both architectures in the GRiD-3D dataset, facilitating the models to distin- take raw RGB images and plain text question-answer guish and thus count multiple objects in close proximity. pairs as input for training. Image features are extracted In summary, our results suggest the following two by a pretrained ResNet101 [10] for both models, while facts: First, the abstract GRiD-A-3D dataset leads to faster questions are encoded by a GRU [11] (FiLM) or a bidi- rectional LSTM [12] (MAC), respectively. Subsequently, 3 https://pytorch.org/ Existence Prediction Orientation Prediction Link Prediction 1.0 1.0 1.0 0.8 0.8 0.8 Accuracy Accuracy Accuracy 0.6 0.6 0.6 0.4 FiLM on GRiD-3D 0.4 FiLM on GRiD-3D 0.4 FiLM on GRiD-3D MAC on GRiD-3D MAC on GRiD-3D MAC on GRiD-3D 0.2 FiLM on GRiD-A-3D 0.2 FiLM on GRiD-A-3D 0.2 FiLM on GRiD-A-3D MAC on GRiD-A-3D MAC on GRiD-A-3D MAC on GRiD-A-3D 0.00 10 20 30 40 50 0.00 10 20 30 40 50 0.00 10 20 30 40 50 Epoch Epoch Epoch Relation Prediction Counting Triple Classification 1.0 1.0 1.0 0.8 0.8 0.8 Accuracy Accuracy Accuracy 0.6 0.6 0.6 0.4 FiLM on GRiD-3D 0.4 FiLM on GRiD-3D 0.4 FiLM on GRiD-3D MAC on GRiD-3D MAC on GRiD-3D MAC on GRiD-3D 0.2 FiLM on GRiD-A-3D 0.2 FiLM on GRiD-A-3D 0.2 FiLM on GRiD-A-3D MAC on GRiD-A-3D MAC on GRiD-A-3D MAC on GRiD-A-3D 0.00 10 20 30 40 50 0.00 10 20 30 40 50 0.00 10 20 30 40 50 Epoch Epoch Epoch Figure 6: Multi-task learning results of FiLM (orange lines) and MAC (blue lines) on each of the six reasoning tasks of the GRiD-A-3D dataset (solid lines) vs. training the same models on the original GRiD-3D dataset (dotted lines). learning and can thus enable more computationally ef- curriculum. ficient experimentation while achieving comparable re- sults to the original GRiD-3D dataset. Second, the results support our assumption of a chronology of subtasks, as Acknowledgments Existence Prediction and Orientation Prediction are learned The authors gratefully acknowledge support from the before the models can reason about relative directions. German Research Foundation DFG for the projects CML TRR169, LeCAREbot and IDEAS. 5. Conclusions This work is an extension to previous work on grounding References relative directions with end-to-end neural VQA architec- [1] J. H. Lee, M. Kerzel, K. Ahrens, C. Weber, S. Wermter, tures. We provide a comprehensive, simplified GRiD- What is Right for Me is Not Yet Right for You: A-3D dataset with abstract objects that shows similar A Dataset for Grounding Relative Directions via behavior to the original GRiD-3D dataset when learned Multi-Task Learning, in: Thirty-First International by the two established VQA models FiLM and MAC. With Joint Conference on Artificial Intelligence, 2022. our experiments, we show that the learning of tasks that [2] D. A. Hudson, C. D. Manning, Compositional at- focus on a single object like object recognition and ori- tention networks for machine reasoning, in: Inter- entation prediction happens prior to learning to ground national Conference on Learning Representations, relative directions and object counting. 2018. The abstract nature of the dataset eliminates approxi- [3] E. Perez, F. Strub, H. de Vries, V. Dumoulin, mate front-back object symmetries that can have a nega- A. Courville, FiLM: Visual Reasoning with a Gen- tive impact on object orientation prediction and all rea- eral Conditioning Layer, in: Thirty-Second AAAI soning tasks about directional relations that build upon it. Conference on Artificial Intelligence, 2018. Furthermore, the simplification of the object set allows [4] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei- for conducting experiments with a more comprehensive Fei, C. Lawrence Zitnick, R. Girshick, CLEVR: A dataset. In future work, this will allow us to conduct fast Diagnostic Dataset for Compositional Language pilot studies on curriculum and transfer learning based and Elementary Visual Reasoning, in: IEEE Confer- on the intuitive dependency of the different spatial rea- ence on Computer Vision and Pattern Recognition, soning tasks on one another and the observed implicit 2017. [5] D. A. Hudson, C. D. Manning, GQA: A New Dataset for Real-World Visual Reasoning and Composi- tional Question Answering, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [6] R. Moratz, T. Tenbrink, Spatial Reference in Linguis- tic Human-Robot Interaction: Iterative, Empirically Supported Development of a Model of Projective Re- lations, Spatial Cognition & Computation 6 (2006). doi:10.1207/s15427633scc0601_3. [7] J. H. Lee, J. Renz, D. Wolter, StarVars: Effective Reasoning About Relative Directions, in: Twenty- Third International Joint Conference on Artificial Intelligence, AAAI Press, 2013. [8] H. Hua, J. Renz, X. Ge, Qualitative Representa- tion and Reasoning over Direction Relations across Different Frames of Reference, in: Sixteenth Inter- national Conference on Principles of Knowledge Representation and Reasoning, 2018. [9] A. Goyal, K. Yang, D. Yang, J. Deng, Rel3D: A Mini- mally Contrastive Benchmark for Grounding Spa- tial Relations in 3D, Advances in Neural Informa- tion Processing Systems 33 (2020). [10] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [11] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, in: NIPS 2014 Workshop on Deep Learning, 2014. [12] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Computation 9 (1997) 1735–1780. doi:10.1162/neco.1997.9.8.1735.