=Paper=
{{Paper
|id=Vol-3190/paper4
|storemode=property
|title=Knowing Earlier What Right Means to You: A Comprehensive VQA Dataset for Grounding Relative Directions via Multi-task Learning
|pdfUrl=https://ceur-ws.org/Vol-3190/paper4.pdf
|volume=Vol-3190
|authors=Kyra Ahrens,Matthias Kerzel,Jae Hee Lee,Cornelius Weber,Stefan Wermter
|dblpUrl=https://dblp.org/rec/conf/ijcai/AhrensK0WW22
}}
==Knowing Earlier What Right Means to You: A Comprehensive VQA Dataset for Grounding Relative Directions via Multi-task Learning==
<pdf width="1500px">https://ceur-ws.org/Vol-3190/paper4.pdf</pdf>
<pre>
Knowing Earlier what Right Means to You: A
Comprehensive VQA Dataset for Grounding Relative
Directions via Multi-Task Learning
Kyra Ahrens† , Matthias Kerzel† , Jae Hee Lee† , Cornelius Weber and Stefan Wermter
University of Hamburg


                                          Abstract
                                          Spatial reasoning poses a particular challenge for intelligent agents and is at the same time a prerequisite for their successful
                                          interaction and communication in the physical world. One such reasoning task is to describe the position of a target
                                          object with respect to the intrinsic orientation of some reference object via relative directions. In this paper, we introduce
                                          GRiD-A-3D, a novel diagnostic visual question-answering (VQA) dataset based on abstract objects. Our dataset allows for a
                                          fine-grained analysis of end-to-end VQA models’ capabilities to ground relative directions. At the same time, model training
                                          requires considerably fewer computational resources compared with existing datasets, yet yields a comparable or even higher
                                          performance. Along with the new dataset, we provide a thorough evaluation based on two widely known end-to-end VQA
                                          architectures trained on GRiD-A-3D. We demonstrate that within a few epochs, the subtasks required to reason over relative
                                          directions, such as recognizing and locating objects in a scene and estimating their intrinsic orientations, are learned in the
                                          order in which relative directions are intuitively processed.


1. Introduction
Reasoning to solve complex spatial tasks like grounding
directional relations in an intrinsic frame of reference
can be decomposed into a set of subtasks that are hier-
archically organized. Consider two objects 𝑜1 and 𝑜2 in
an image, where each of the objects has a clear front
side and orientation. Learning to answer whether the
triple (𝑜1 , 𝑟, 𝑜2 ) holds for a given directional relation 𝑟
in a frame of reference that is intrinsic to 𝑜2 spans the
following stages (see Fig. 1 for an example):

               1. Both the target object and the reference object
                  have to be recognized in the image (existence Figure 1: Example of grounding relative directions, e.g., con-
                  prediction). In other words, an agent must ini- sidering the green arrow’s perspective, the yellow arrow is on
                  tially be capable of answering questions such as the left in front of it.
                  “Is 𝑜1 in the image?” or “Is 𝑜2 in the image?”.
               2. Next, the object’s pose that defines the relative
                                                                                                                                         ing the two preceding competencies, allowing an
                  relation has to be discerned, enabling an agent to
                                                                                                                                         agent to answer a question similar to “What is the
                  successfully respond to questions such as “What
                                                                                                                                         relation between 𝑜1 and 𝑜2 from the perspective
                  is the cardinal direction of 𝑜2 ?” (orientation
                                                                                                                                         of 𝑜2 ?” (relation prediction). Likewise, predict-
                  prediction).
                                                                                                                                         ing which target object is in a specific relation to
               3. Predicting the directional relation using the in-                                                                      some reference object (link prediction) can be
                  trinsic frame of reference is learned by combin-                                                                       answered, e.g., “Taking 𝑜2 ’s perspective, which
                                                                                                                                         object is in relation 𝑟 to it?”.
STRL’22: First International Workshop on Spatio-Temporal Reasoning
and Learning, July 24, 2022, Vienna, Austria
†
                                                                                                                                      4. Based on all previous stages, an agent can deter-
  These authors contributed equally.                                                                                                     mine whether a specific directional relationship
$ kyra.ahrens@uni-hamburg.de (K. Ahrens);                                                                                                exists between the two objects (triple classifica-
matthias.kerzel@uni-hamburg.de (M. Kerzel);
jae.hee.lee@uni-hamburg.de (J. H. Lee);
                                                                                                                                         tion), thus successfully providing an answer to a
cornelius.weber@uni-hamburg.de (C. Weber);                                                                                               question like “From 𝑜2 ’s perspective, is 𝑜1 left of
stefan.wermter@uni-hamburg.de (S. Wermter)                                                                                               𝑜2 ?”.
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                    Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)                                                        In previous work [1], we showed that enabling a VQA
                                                                tial bias that confounds the analysis of reasoning about
                                                                relative directions.
                                                                   In the present work, we introduce GRiD-A-3D, a novel
                                                                and simplified diagnostic VQA dataset, which allows for
                                                                a more efficient and targeted analysis of the correspond-
                                                                ing reasoning process by removing possible biases from
                                                                using real-world objects. Subsequently, we report the per-
                                                                formance of the two established end-to-end VQA models
                                                                MAC [2] and FiLM [3] on this dataset. With our ex-
                                                                periments, we show that, when trained on GRiD-A-3D,
                                                                both models depict a similar qualitative learning behavior
                                                                compared with their replica trained on the more com-
                                                                plex non-abstract GRiD-3D [1] dataset. At the same time,
                                                                training converges up to three times faster, thus allowing
                                                                more efficient neural experiments.
                                                                   We summarize the contributions made in this paper
                                                                as follows:

                                                                     • We complement our GRiD-3D benchmark suite1
                                                                       with a novel GRiD-A-3D (Grounding Relative
                                                                       Directions with Abstract objects in 3D) dataset
                                                                       that enables a faster and less biased evaluation
                                                                       of spatial reasoning behavior in VQA compared
                                                                       with the original GRiD-3D dataset.
                                                                     • We verify our previous research findings with the
                                                                       new dataset, thus underpinning our hypothesis
                                                                       that multi-task learning enables neural models to
                                                                       learn to ground relative directions in VQA.
                                                                     • Furthermore, we add evidence to our hypothesis
                                                                       that during multi-task learning, spatial reason-
                                                                       ing abilities of a neural model develop along the
                                                                       intuitive order of corresponding subtasks, thus
                                                                       forming an implicit curriculum.


                                                                2. Related Work
Figure 2: Top: Image from the GRiD-A-3D dataset. Bottom:
Assumed hierarchy of spatial reasoning tasks to answer dif- Aiming to provide a suitable setup to assess the reasoning
ferent question of the abstract GRiD-A-3D dataset. Arrows   capabilities of neural models on vision-language tasks,
indicate a chronological dependency of tasks, e.g., in orderdiagnostic datasets have been introduced [4, 5]. One
                                                            of the major advantages of such datasets is that they
to determine the orientation of an object, it first has to be
recognized.                                                 provide structured and tightly controlled scenes to pre-
                                                            vent models from circumventing reasoning by exploiting
                                                            conditional biases that commonly arise with real-world
architecture to reason about relative directions is viable, images. A particular advantage of diagnostic datasets
provided that all of the learning stages listed above are based on synthetic images is that their generation pro-
encapsulated in corresponding subtasks as summarized cess is scalable, customizable, and therefore allows for a
in Fig. 2. Beyond that, the following two observations more fine-grained performance analysis.
were made: First, the subtasks that are found earlier in      The vast majority of diagnostic VQA datasets is limited
the chronology of learning stages are also learned earlier to spatial reasoning tasks based on the absolute frame of
by the models, and second, this behavior is consistent reference, i.e., object positions are relative to the viewer
for different neural end-to-end models. However, these
findings are based on experiments involving images with
3D models of real objects, that may introduce a poten- 1 https://github.com/knowledgetechnologyuhh/grid-3d
                                                                to more quickly learn how to ground relative directions,
                                                                which may be of particular value for few-shot, trans-
                                                                fer, and curriculum learning scenarios. Accordingly, we
                                                                extend the GRiD-3D benchmark suite towards another
                                                                diagnostic VQA dataset with abstract objects.


                                                                3. GRiD-A-3D Abstract VQA
                                                                   Dataset
                                                                 With the introduction of the GRiD-3D dataset [1], we
                                                                 could show that neural VQA models are capable of
Figure 3: Common challenges in grounding relative directions grounding relative directions by implicitly deriving a
arising with real objects, exemplified by objects from the orig- curriculum of subtasks. In order to further generalize the
inal GRiD-3D dataset. Top left: Occlusion due to variability
                                                                 previous findings, we extend our GRiD-3D suite towards
in heights and shapes of objects. Bottom left: Symmetry of
                                                                 a diagnostic dataset based on abstract objects whose car-
objects impairs the detection of their front sides. Top/bottom
right: Replica of the images on the left using abstract objects dinal direction is indicated by colored arrows.
from the GRiD-A-3D dataset.
                                                                   Overview and statistics With our new GRiD-A-3D
                                                                   dataset, we address the following six tasks: Existence
of the image. Yet taking into account more realistic sce- Prediction, Orientation Prediction, Link Prediction, Relation
narios such as multi-agent dialogue in a situated envi- Prediction, Counting, and Triple Classification. All 8 000 ren-
ronment, understanding relative directions is a prerequi- dered images are split without overlap into 6 400 for train-
site for meaningful communication. As a consequence, ing, 800 for validation, and 800 for testing. The 432 948
early models to learn symbolic reasoning with relative corresponding input questions follow largely the same
directions have been proposed [6, 7, 8]. However, they 80:10:10 ratio, yielding 346 984, 43 393, and 42 571 ques-
inherently assume the availability of scene annotations tions for each set, respectively. The GRiD-A-3D dataset
in terms of object labels and spatial relations instead of has an order of magnitude comparable with the GRiD-3D
requiring a model to infer such information implicitly.            dataset, both in terms of image and question counts.
   An early synthetic dataset providing a test bed for
grounding relative directions is Rel3D [9]. Since Rel3D Image                   generation For
is restricted to two objects per scene and one single task, each image, we generate a
i.e., binary prediction of (object1 , relation, object2 ) triples, scene by randomly placing
GRiD-3D [1] was introduced, which combines the advan- three to five distinct objects
tage of a rich number of tasks and questions as found in onto a plane and render the
traditional synthetic VQA datasets with the challenge of corresponding image with
grounding relative directions.                                     480x320 pixel resolution via
                                                                             2
   GRiD-3D is the first-of-its-kind to target multi-task Blender. We choose a con-
learning of relative directions in a controlled setting.           sistent  lighting  setup across
With this dataset, it was shown that, before learning all images, add shadows to Figure 4: The six abstract
how to answer the question whether a triple (object1 , each object, and restrict the                          objects used in
relation, object2 ) holds, neural end-to-end VQA models image generation to a fixed                           the GRiD-A-3D
rely on an implicit curriculum of related subtasks such            camera    angle, thus obtaining            dataset.
as object detection, orientation estimation, and relation one image per scene.
prediction [1]. Objects in GRiD-3D cover a variety of cate-           Our object set comprises gray-coloured polygonal
gories, ranging from humanoids and animals to furniture            prisms   approximating a cylinder shape, each marked
and vehicles. Naturally, such objects differ in terms of with an arrow in one of the six different colours: three
proportions, complexity, and, most importantly, symme- primary colours (red, blue, and green) and three additive
try, which can be a crucial determinant of how easily a secondary colours (yellow, cyan, and magenta). The tip
neural network can infer their orientation (and perform of each arrow depicts the object’s front side, allowing
associated tasks).                                                 for distinct relative directions between objects in the im-
   In this work, we aim to provide a variation of the origi- age. An overview of all six objects can be found in Fig. 4.
nal dataset that ensures the elimination of such potential
distortions (see Fig. 3 for examples), enabling a model
                                                                2
                                                                    https://www.blender.org/
Note that the overall object count in the original GRiD- image and question features are fed to special neural
3D dataset is 28, whereas GRiD-A-3D is restricted to six units called residual blocks (FiLM) or MAC cells (MAC).
different objects.                                           A chain of such units provides the core of the reasoning
                                                             process.
                                                                We use existing PyTorch3 implementations of FiLM
                                                             and MAC with their default hyperparameters for the
                                                             published CLEVR [4] dataset evaluations, except for the
                                                             number of MAC cells that we reduce to four to prevent
                                                             overfitting. All experiments are run for 100 epochs and
                                                             repeated three times with different seeds to reduce the
                                                             impact of the random initialization of the models on the
                                                             results. Fig. 6 shows the mean and the standard deviation
                                                             of the evaluations.
                                                                We interpret our results in the following way: Existence
                                                             and Orientation Prediction are learned earlier than other
                                                             tasks. We explain this observation with the fact that these
          (a) FiLM                        (b) MAC            tasks only require a model to focus on one single object.
                                                             For the most straightforward task of Existence Prediction,
Figure 5: Neural end-to-end VQA models FiLM and MAC
used for our experiments. The generic units (here colored in
                                                             we  observe similar behavior for the two datasets: Both
orange and blue, respectively) control how the question and converge to an accuracy of almost 100% at nearly the
image features are being processed.                          same time. For the Orientation Prediction task, we observe
                                                             convergence to an accuracy of over 80% for both datasets.
                                                             Noticeably, the learning happens faster for the abstract
                                                             GRiD-A-3D dataset. The shorter learning time can be
Question generation In addition to rendering the im- attributed to the more unequivocal identification of front
ages from our sampled scenes, we obtain scene graphs and back sides of the abstract objects due to the lack
equipped with ground truth information such as absolute of symmetry related noise as shown in Fig. 3. The fact
position, orientation, and relative directions of objects, that the accuracy on Orientation Prediction is capped at
that we use to generate questions related to the six tasks about 85% can be explained by objects placed close to the
contained in GRiD-A-3D. Our question generation builds border between two cardinal directions, as such cases are
upon the framework provided with CLEVR [4], whose difficult for the models to learn and classify.
question templates, synonym, and metadata files we tai-         A similar learning behavior can be observed for the
lor to our dataset. Likewise, our question generation more complex tasks of Relation Prediction, Triple Classifi-
pipeline is expressed as a template-based functional pro- cation and Link Prediction, where both models converge
gram executed on each scene graph.                           faster when trained on GRiD-A-3D and also reach slightly
   We follow the depth-first search strategy to determine higher accuracy. Similarly to the results on the Orienta-
and instantiate question-answer pairs that comply with tion Prediction task, the main reason for these observa-
the scene information and can therefore be considered tions may lie in the facilitated learning conditions due to
valid. We set additional constraints to make sure that an- the lack of front-back symmetries or strong occlusions
swers are uniformly distributed for each task. To ensure with the abstract objects. This effect is most pronounced
a wide variety of natural language questions, we sample for Link Prediction, i.e., predicting which target object
from a rich set of differently phrased question templates is in a given relation to some reference object. We at-
for each reasoning task and randomly omit utterances or tribute this observation to the smaller set of objects in
replace words with suitable synonyms.                        the GRiD-A-3D dataset.
                                                                Finally, we observe a mixed result for the Counting task:
4. Evaluations                                               While learning of both VQA models converges faster for
                                                             the GRiD-A-3D dataset, higher accuracy is reached for
For our experiments, we train MAC [2] and FiLM [3], two the GRiD-3D dataset. We hypothesize that this higher
state-of-the-art neural end-to-end VQA architectures, on accuracy stems from the more diverse-looking objects
our new GRiD-A-3D dataset (cf. Fig. 5). Both architectures in the GRiD-3D dataset, facilitating the models to distin-
take raw RGB images and plain text question-answer guish and thus count multiple objects in close proximity.
pairs as input for training. Image features are extracted       In summary, our results suggest the following two
by a pretrained ResNet101 [10] for both models, while        facts: First, the abstract GRiD-A-3D dataset leads to faster
questions are encoded by a GRU [11] (FiLM) or a bidi-
rectional LSTM [12] (MAC), respectively. Subsequently, 3 https://pytorch.org/
                  Existence Prediction                                  Orientation Prediction                                     Link Prediction
           1.0                                                   1.0                                                   1.0
           0.8                                                   0.8                                                   0.8
Accuracy


                                                      Accuracy


                                                                                                            Accuracy
           0.6                                                   0.6                                                   0.6
           0.4                 FiLM on GRiD-3D                   0.4                 FiLM on GRiD-3D                   0.4                   FiLM on GRiD-3D
                               MAC on GRiD-3D                                        MAC on GRiD-3D                                          MAC on GRiD-3D
           0.2                 FiLM on GRiD-A-3D                 0.2                 FiLM on GRiD-A-3D                 0.2                   FiLM on GRiD-A-3D
                               MAC on GRiD-A-3D                                      MAC on GRiD-A-3D                                        MAC on GRiD-A-3D
           0.00   10    20       30     40       50              0.00   10    20       30     40       50              0.00   10      20       30     40       50
                           Epoch                                                 Epoch                                                   Epoch
                  Relation Prediction                                         Counting                                        Triple Classification
           1.0                                                   1.0                                                   1.0
           0.8                                                   0.8                                                   0.8
Accuracy


                                                      Accuracy


                                                                                                            Accuracy
           0.6                                                   0.6                                                   0.6
           0.4                 FiLM on GRiD-3D                   0.4                 FiLM on GRiD-3D                   0.4                   FiLM on GRiD-3D
                               MAC on GRiD-3D                                        MAC on GRiD-3D                                          MAC on GRiD-3D
           0.2                 FiLM on GRiD-A-3D                 0.2                 FiLM on GRiD-A-3D                 0.2                   FiLM on GRiD-A-3D
                               MAC on GRiD-A-3D                                      MAC on GRiD-A-3D                                        MAC on GRiD-A-3D
           0.00   10    20       30     40       50              0.00   10    20       30     40       50              0.00   10      20       30     40       50
                           Epoch                                                 Epoch                                                   Epoch

Figure 6: Multi-task learning results of FiLM (orange lines) and MAC (blue lines) on each of the six reasoning tasks of the
GRiD-A-3D dataset (solid lines) vs. training the same models on the original GRiD-3D dataset (dotted lines).


learning and can thus enable more computationally ef- curriculum.
ficient experimentation while achieving comparable re-
sults to the original GRiD-3D dataset. Second, the results
support our assumption of a chronology of subtasks, as Acknowledgments
Existence Prediction and Orientation Prediction are learned
                                                            The authors gratefully acknowledge support from the
before the models can reason about relative directions.
                                                            German Research Foundation DFG for the projects CML
                                                            TRR169, LeCAREbot and IDEAS.
5. Conclusions
This work is an extension to previous work on grounding                           References
relative directions with end-to-end neural VQA architec-
                                                                                    [1] J. H. Lee, M. Kerzel, K. Ahrens, C. Weber, S. Wermter,
tures. We provide a comprehensive, simplified GRiD-
                                                                                        What is Right for Me is Not Yet Right for You:
A-3D dataset with abstract objects that shows similar
                                                                                        A Dataset for Grounding Relative Directions via
behavior to the original GRiD-3D dataset when learned
                                                                                        Multi-Task Learning, in: Thirty-First International
by the two established VQA models FiLM and MAC. With
                                                                                        Joint Conference on Artificial Intelligence, 2022.
our experiments, we show that the learning of tasks that
                                                                                    [2] D. A. Hudson, C. D. Manning, Compositional at-
focus on a single object like object recognition and ori-
                                                                                        tention networks for machine reasoning, in: Inter-
entation prediction happens prior to learning to ground
                                                                                        national Conference on Learning Representations,
relative directions and object counting.
                                                                                        2018.
   The abstract nature of the dataset eliminates approxi-
                                                                                    [3] E. Perez, F. Strub, H. de Vries, V. Dumoulin,
mate front-back object symmetries that can have a nega-
                                                                                        A. Courville, FiLM: Visual Reasoning with a Gen-
tive impact on object orientation prediction and all rea-
                                                                                        eral Conditioning Layer, in: Thirty-Second AAAI
soning tasks about directional relations that build upon it.
                                                                                        Conference on Artificial Intelligence, 2018.
Furthermore, the simplification of the object set allows
                                                                                    [4] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-
for conducting experiments with a more comprehensive
                                                                                        Fei, C. Lawrence Zitnick, R. Girshick, CLEVR: A
dataset. In future work, this will allow us to conduct fast
                                                                                        Diagnostic Dataset for Compositional Language
pilot studies on curriculum and transfer learning based
                                                                                        and Elementary Visual Reasoning, in: IEEE Confer-
on the intuitive dependency of the different spatial rea-
                                                                                        ence on Computer Vision and Pattern Recognition,
soning tasks on one another and the observed implicit
                                                                                        2017.
 [5] D. A. Hudson, C. D. Manning, GQA: A New Dataset
     for Real-World Visual Reasoning and Composi-
     tional Question Answering, in: Proceedings of
     the IEEE/CVF Conference on Computer Vision and
     Pattern Recognition (CVPR), 2019.
 [6] R. Moratz, T. Tenbrink, Spatial Reference in Linguis-
     tic Human-Robot Interaction: Iterative, Empirically
     Supported Development of a Model of Projective Re-
     lations, Spatial Cognition & Computation 6 (2006).
     doi:10.1207/s15427633scc0601_3.
 [7] J. H. Lee, J. Renz, D. Wolter, StarVars: Effective
     Reasoning About Relative Directions, in: Twenty-
     Third International Joint Conference on Artificial
     Intelligence, AAAI Press, 2013.
 [8] H. Hua, J. Renz, X. Ge, Qualitative Representa-
     tion and Reasoning over Direction Relations across
     Different Frames of Reference, in: Sixteenth Inter-
     national Conference on Principles of Knowledge
     Representation and Reasoning, 2018.
 [9] A. Goyal, K. Yang, D. Yang, J. Deng, Rel3D: A Mini-
     mally Contrastive Benchmark for Grounding Spa-
     tial Relations in 3D, Advances in Neural Informa-
     tion Processing Systems 33 (2020).
[10] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual
     Learning for Image Recognition, in: Proceedings
     of the IEEE Conference on Computer Vision and
     Pattern Recognition (CVPR), 2016.
[11] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical
     evaluation of gated recurrent neural networks on
     sequence modeling, in: NIPS 2014 Workshop on
     Deep Learning, 2014.
[12] S. Hochreiter, J. Schmidhuber, Long short-term
     memory, Neural Computation 9 (1997) 1735–1780.
     doi:10.1162/neco.1997.9.8.1735.

</pre>