VinVL+L: Enriching Visual Representation with Location Context in VQA Jiří Vyskočil1,* , Lukáš Picek1 1 Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia, Technická 8, Pilsen, Czech Republic Abstract In this paper, we describe a novel method – VinVL+L – that enriches the visual representations (i.e. object tags and region features) of the State-of-the-Art Vision and Language (VL) method – VinVL – with Location information. To verify the importance of such metadata for VL models, we (i) trained a Swin-B model on the Places365 dataset and obtained additional sets of visual and tag features; both were made public to allow reproducibility and further experiments, (ii) did an architectural update to the existing VinVL method to include the new feature sets, and (iii) provide a qualitative and quantitative evaluation. By including just binary location metadata, the VinVL+L method provides incremental improvement to the State-of-the-Art VinVL in Visual Question Answering (VQA). The VinVL+L achieved an accuracy of 64.85% and increased the performance by +0.32% in terms of accuracy on the GQA dataset; the statistical significance of the new representations is verified via Approximate Randomization. The code and newly generated sets of features are available at https://github.com/vyskocj/VinVL-L. Keywords Vision and Language, Visual Question Answering, Location Recognition, Oscar, VinVL 1. Introduction Multi-modal understanding systems can answer general questions from visual and textual data. These questions are largely focused on objects and their relations, appear- ances, or behaviors. Rest of them are asked about the overall scene, such as location or weather. Most of multi- modal systems are split into visual and textual modules, followed by image-text alignment. Faster R-CNN [1] re- gion features of detected objects are commonly used for + visual representation and BERT [2] embeddings for the Where is it? textual. However, such visual model only provides infor- mation about objects, from which the entire multi-modal VinVL (our) VinVL+L system must decide simple questions like "Are people shop (24.6%) bedroom (59.9%) inside or outside?". store (22.5%) living room (15.1%) We intuitively feel that in general, the objects are re- porch (9.0%) hotel room (2.3%) lated to indoor/outdoor scene division even if they cannot be directly assigned. They have a certain weight on the Figure 1: Example predictions of the proposed VinVL+L. basis of which the correct answer can be decided. For We compare VinVL+L with the State-of-the-Art VinVL on the example, cars, sky, and trees are more likely to belong randomly selected input pair (i.e. image and question) from to an outdoor scene, however, the scene may be indoors, the GQA test set. The VinVL+L better aligns the answer to and these categories can be detected through the garage the question thanks to the enriched visual features. door. In addition to [3, 4, 5], the mentioned paradigm of splitting image-text modules follows VinVL [6] based on Oscar [3] that additionally adds object tags, i.e., tex- features. However, a clear cross-modal representation of tual output from an Object Detection network, to region the scene is still missing, which can harm the network, as shown in Figure 1. 26th Computer Vision Winter Workshop, Robert Sablatnig and Florian Our method, based on VinVL, brings a new represen- Kleber (eds.), Krems, Lower Austria, Austria, Feb. 15-17, 2023 * Corresponding author. tation including information about the location into the $ vyskocj@kky.zcu.cz (J. Vyskočil) system. This representation is obtained using a classifi-  0000-0002-6443-2051 (J. Vyskočil); 0000-0002-6041-9722 cation network trained on the Places365 dataset having a (L. Picek) total of 365 location categories. All of these categories are © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). directly split into one of the indoor and outdoor supercat- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1 Jiří Vyskočil et al. CEUR Workshop Proceedings 1–9 egories. All of these labels are then passed as scene tags BERT-based VL Methods End-to-end methods such to our VinVL+L method to predict the answers. Besides, as MDETR [7] use a pre-trained image classification we utilize scene features that are generated in the same backbone to extract features and concatenate them with way as the region features of Oscar/VinVL. Finally, we word embeddings taken from a BERT-based model [2, 25]. evaluate influences on answers while using these novel- However, some existing VL methods [3, 4] reuse the ex- ties. An example of the top 3 predictions of the VinVL and tracted features from another approach, e.g., a bottom-up our VinVL+L is visualized in Figure 1. More examples are mechanism [5] that extracts object regions via Faster R- shown in Section 5.3 and Appendix A. Our contributions CNN, to fine-tune a novel method with an unchanged are: visual model. These methods include Oscar [3], which introduces object tags as cross-modal representation to • We enrich visual representations of the VinVL improve the alignment of the image-text pairs. Based using the global information about the image - on Oscar, VinVL [6] improves visual representation by location. pre-training larger model on multiple object detection • We present the effectiveness of each new cross- datasets. Since this method holds State-of-the-Art results modal representation as we compare their related on the GQA dataset [26] and represents the image as a models including a reproduced version of the set of regional features while suppressing global scene VinVL. information, we decided to improve the alignment of the • We improve the VinVL in visual question answer- cross-modal representation by location recognition. ing (VQA) with an overall accuracy of 64.85% on the GQA dataset. • We provide data with the location context that 3. Datasets we generated for the GQA dataset. The early datasets, such as VQA [27] and COCO-QA [28], contain only the core annotation needed for the vision 2. Related Work question answering: an image, a question, and a desired one-word answer. However, we are interested in dataset Many Vision and Language (VL) methods, like [3, 6, 7, 8], containing richer annotations to recognize types of loca- focus on pre-training generic models by combining mul- tions in the image input. It does contain the GQA dataset, tiple datasets from different tasks. Then the models are but only for part of the images. Therefore, there are two fine-tuned to downstream tasks that include: image cap- existing datasets Places365 and GQA suitable for our task. tioning, visual reasoning, or visual question answering. Both datasets are thoroughly described below. In this section, we briefly review recent approaches to VL tasks and their commonly used Vision Encoders, which Places365 [29] This dataset consists of 365 location are the most relevant for our work. categories that we can directly map to indoors/outdoors category. The balanced training set varies from 3,068 to Vision Encoders Convolutional Neural Networks 5,000 images per location category, while the validation (CNNs) gained popularity in image classification when set consists of 50 images per category. AlexNet [9] won the ImageNet 2012 competition. In the subsequent period, models with skip-connections [10, GQA dataset [26] This dataset consists of 22,669,678 11, 12, 13] with blocks having small feed-forward net- questions (from which the test2019 split contains works in parallel connections [12, 14, 15], or with a fo- 4,237,524 questions) over 113,018 images with 1,878 pos- cus on optimization [16, 17, 18, 19, 20] were created. In sible answers to open and binary yes/no questions. In recent years, Transformer-based methods, such as Vi- addition to questions and answers, each image contains sion Transformer [21], or its modification with shifted annotations of objects, the relations between them, and windows [22], gained favor thanks to computational effi- their attributes. Besides, each image contains global in- ciency and accuracy. These image classification models formation in the form of location and weather, the dis- are often used as backbone architectures in object detec- tribution of which is shown in Table 1. Regarding the tion to predict bounding boxes with a classification of evaluation of the results, the following metrics are used: each object in the image. The most popular detectors are the one-shot Yolo-based architectures [23, 24] and • Accuracy – overall accuracy, primary metric, two-shot Faster R-CNN-based architectures [1], which • Binary – accuracy of yes/no questions, are generally slower but more accurate than the one-shot • Open – accuracy of open questions, ones. The image classification or object detection models • Consistency – overall accuracy including equiva- are further used as Visual Encoders in the VL tasks. lent answers, 2 Jiří Vyskočil et al. CEUR Workshop Proceedings 1–9 • Plausibility – relative number of answers making sequence of the text, 𝑞 is the word embedding sequence sense with respect to the dataset, of the scene and object tags detected from the image, and • Validity – relative number of answers that are in 𝑣 is the visual embedding sequence of the entire image the question scope, and all detected regions. This input can be viewed from • Distribution – overall match between the distri- two different perspectives as [3, 6]: butions of true answers and model predictions. 𝑥 ≜ [, , ,𝑤, 𝑞, , ,, , ,𝑣, ,] 𝑜𝑟 [ ,𝑤, , , ,𝑞, 𝑣, , ] (1) ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ 𝑄&𝐴 𝑖𝑚𝑔 𝑐𝑎𝑝𝑡𝑖𝑜𝑛 𝑡𝑎𝑔𝑠&𝑖𝑚𝑔 Table 1 ⏟ ⏞ ⏟ ⏞ 𝐷𝑖𝑐𝑡𝑖𝑜𝑛𝑎𝑟𝑦 𝑉 𝑖𝑒𝑤 𝑀 𝑜𝑑𝑎𝑙𝑖𝑡𝑦 𝑉 𝑖𝑒𝑤 GQA dataset. Distribution of annotated global information about the scenes on the training and validation split. where Dictionary View defines Masked Token Loss Metadata Training Validation ℒ𝑀 𝑇 𝐿 , applied on the discrete token sequence ℎ ≜ # of images 74,942 10,696 [𝑤, 𝑞], to predict the masked tokens ℎ𝑖 based on their with weather 6,600 (8.8%) 952 (8.9%) surrounding tokens ℎ∖𝑖 : with location 23,370 (31.2%) 3,265 (30.5%) indoors 4,520 (19.3%) 638 (19.5%) ℒ𝑀 𝑇 𝐿 = −E(ℎ,𝑣)∼𝒟 log 𝑝(ℎ𝑖 |ℎ∖𝑖 , 𝑣) (2) outdoors 18,850 (80.7%) 2,627 (80.5%) Modality View defines Contrastive Loss ℒ𝐶𝐿 for the image representation ℎ′ ≜ [𝑞, 𝑣], which is "polluted" by randomly replacing 𝑞 with another sequence of tags from 4. Methodology the dataset 𝒟. To distinguish the original pair (𝑦 = 1) The Vision and Language (VL) approaches are commonly from the polluted one (𝑦 = 0), a binary classifier 𝑓 (.) as divided into two phases: pre-training and fine-tuning. In a fully-connected layer is applied on the top of the [CLS] pre-training, multiple datasets of different tasks are com- token. This loss function is defined as [3]: bined to create generic models. In fine-tuning, these mod- ℒ𝐶𝐿 = −E(𝑤,ℎ′ ;𝑦)∼𝒟 log 𝑝(𝑦|𝑓 (𝑤, ℎ)) (3) els are then trained on each of these datasets, called down- stream tasks. In this study, we focus on improving the Alternatively, VinVL [6] applies the 3-way Con- current State-of-the-Art VinVL [6] in GQA dataset [26]. strastive Loss ℒ 𝐶𝐿3 on ℎ ≜ [𝑤, 𝑞, 𝑣], instead of the * This improved version learns the image-text representa- binary ℒ 𝐶𝐿 used in Oscar [3], to predict whether the tion with respect to the global information of an entire (𝑤, 𝑞, 𝑣) triplet is the original one (𝑐 = 0), contains a image, such as indoors/outdoors, which is given by novel polluted 𝑤 (𝑐 = 1), or contains a polluted 𝑞 (𝑐 = 2): scene tags and features. ℒ𝐶𝐿3 = −E(ℎ* ;𝑐)∼𝒟 log 𝑝(𝑐|𝑓 (𝑤, 𝑞, 𝑣)) (4) 4.1. Adding locations to VinVL By fusing Equation 2 and 4, or 2 and 3, the full pre- Based on VinVL [6], we present an extended architecture training objective is: with scene tags and features. In our work, these repre- sentations are simply generated using a fine-tuned clas- ℒ𝑃 𝑟𝑒−𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 = ℒ𝑀 𝐿𝑇 + ℒ𝐶𝐿3 (𝑜𝑟 𝐶𝐿) (5) sification network on the Places365 dataset [29] with an accuracy of up to 96.1% in case of binary indoor/outdoor 4.2. Implementation Details classification (see Section 5.1 for more details). Scene tags are the predicted location categories. Scene features We use the same feature-vector size (i.e. 2,048) in order are made in the same style as their object counterparts, to match the size of VinVL. These features are then i.e., as a 2,048 feature vector (obtained via Global Aver- concatenated with positions and sizes, as described in age Pooling) concatenated with top-left & bottom-right Section 4.1. The models used from the Timm library [30] corners, and height & width. Besides, the novel scene are: resnext50d_32x4d [13], gluon_inception_v3 [15], mo- representations are prepended before the object ones so bilenetv3_small_100 [18], gc_efficientnetv2_rw_t [20], that the scenes in the embeddings always have the same vit_large_patch16_224_in21k [21], and swin_base_ position for each image-text pair input, as outlined in patch4_window7_224_in22k [22]. All models are Figure 2. fine-tuned for 20 epochs with SGD and Focal Loss. We Even though we do not perform pre-training on var- use an initial learning rate of 0.01 and we reduce it with ious tasks with the new representation, in general, the a plateau scheduler. The batch size is 64 with 2 yet-established pre-training objective of Oscar/VinVL [6] accumulation steps. We use horizontal flip (probability can be followed. The change is only in the definition of of 50%), random resized crop (scale from 0.8 to 1.0), and the (𝑤, 𝑞, 𝑣) triple input, where 𝑤 is the word embedding random brightness contrast (probability of 20%). 3 Jiří Vyskočil et al. CEUR Workshop Proceedings 1–9 Contrastive Loss Masked Token Loss Features Network Multi-Layer Transformers Embeddings Living Indoors Tree Dog Snowman [SEP] [CLS] Is [MASK] indoors or outdoors ? [SEP] room Data Word Tokens Scene & Object Tags Scene & Region Features Language Image Modality Language Image Dictionary Figure 2: Illustration of VinVL+L. We represent the image-text pair as a quintuple [ word tokens , scene tags , object tags , scene features , region features ], where word tokens, object tags and region features are taken from VinVL [6]. Scene tags and features are proposed to improve the alignment of cross-domain semantics. The example shows a case where only detected objects could be classified as outdoors rather than indoors. In the case of the VL model, we use the pre-trained Table 2 Oscar+BASE with VinVL features and follow their pre- Performance evaluation of selected networks. We do the sented procedure which is the same as the original Oscar, evaluation on the Places365-val dataset on all categories and i.e., pre-training on the unbalanced "all-split" of the GQA AccuracyIO on their binary supercategories (Indoor/Outdoor). dataset for 5 epochs, and fine-tune the best model with Backbone Accuracy Top3 AccuracyIO respect to overall accuracy on the "balanced-split" for 2 MobileNetV3 47.9 70.8 94.6 epochs. All the results are shown in Section 5.2 together InceptionV3 53.1 76.0 95.3 with a reproduction of VinVL that we improve. ResNeXt-50-D 54.2 77.0 95.6 EfficientNetV2 54.7 77.4 95.6 ViT-Large 54.9 77.7 95.5 5. Experiments Swin-Base 56.0 78.7 96.1 Our approach is divided into two separate steps. First, we adapt several image classification models to the Places365 which is 1.1% higher than that of the second-best ViT- dataset and select the most accurate model to generate Large. Therefore, this model is further used to extract a visual representation for the VL model. Then, we fine- novel visual representations for our VinVL+L. tune the VL model using its original and our new visual features. 5.2. Visual Question Answering 5.1. Location Recognition Statistical significance of novel features We show the advantages of the new visual representations by com- We selected several pre-trained image classification net- paring our method with the reproduced VinVL using the works in order to cover a certain range of different ap- same training pipeline – see Table 5. The used scene tags proaches to location recognition. These methods include as 365 location categories (C), or indoors/outdoors (IO), those focused on high inference speed, methods con- are denoted in subscripts of the model name. Besides, we taining skip-connections, parallel paths, or transformers. compute the statistical significance [33] between the two Results of fine-tuned models on the Places365 dataset models to show that recognizing the location categories are shown in Table 2. ResNeXt-50, EfficientNetV2, and truly brings benefits and it is not just a coincidence. For ViT-Large have similar performance, while ViT-Large demonstration, we compare the reproduced VinVL with performs slightly worse in indoors/outdoors classifica- our VinVL+LC on the validation dataset. Our goal is to tion. It is because when the ViT-Large is not right, it is reject the null hypothesis defined as "there is no differ- more often the incorrect indoors/outdoors supercategory ence between system A and B". To do this, we shuffle the than in the case of the previous two mentioned models. predictions between systems A and B with a probability The best results are achieved by Swin-Base in both 365 of 50%, and we compare the performance with the initial locations and binary indoors/outdoors recognition. It one (all repeated 10,000 times). Consequently, we reject obtains 56% top-1 accuracy in recognizing 365 locations, the null hypothesis at the 95% significance level, i.e., a 4 Jiří Vyskočil et al. CEUR Workshop Proceedings 1–9 Table 3 Results of individual methods according to the official leaderboard. We show the prior State-of-the-Arts performance on GQA dataset, sorted by primary metric – Accuracy. The meaning of individual metrics is described in Section 3. Method ↑Accuracy ↑Binary ↑Open ↑Consist. ↑Plausib. ↑Valid. ↓Distrib. Bottom-Up [5] 49.74 66.64 34.83 78.71 84.57 96.18 5.98 MMN [4] 60.83 78.90 44.89 92.49 84.55 96.19 5.54 Oscar [3] 61.62 - - - - - - MDETR [31] 62.45 80.91 46.15 93.95 84.15 96.33 5.36 LXR955 [8] 62.71 79.79 47.64 93.10 85.21 96.36 6.42 NSM [32] 63.17 78.94 49.25 93.25 84.28 96.41 3.71 VinVL [6] 64.65 82.63 48.77 94.35 84.98 96.62 4.72 Table 4 Performance evaluation of individual scene tags. We compare the reproduced VinVL with additional scene tags as Indoors/Outdoors (IO), and/or 365 location category (C). The last row indicates improvement/deterioration as a difference between our best model and the reproduced VinVL. Method ↑Accuracy ↑Binary ↑Open ↑Consist. ↑Plausib. ↑Valid. ↓Distrib. VinVL (reproduced) 64.53 82.36 48.79 94.14 84.77 96.55 4.72 VinVL+LIO 64.65 82.43 48.94 94.17 84.81 96.61 4.73 VinVL+LC+IO 64.71 82.38 49.12 94.06 84.84 96.65 4.55 VinVL+LC 64.85 82.59 49.19 94.00 84.91 96.62 4.59 ΔVinVL+LC − VinVL +0.32 +0.23 +0.40 -0.14 +0.14 +0.07 -0.13 Table 5 We selected the best model with respect to overall accu- Accuracy of answers on the validation dataset. We evalu- racy on the validation set, and we pushed the results into ate the reproduced VinVL with our improved versions on the the evaluation server. The performances of the models balanced validation GQA dataset. are listed in Table 4. The reproduced version of VinVL Backbone Accuracy Binary Open still has worse performance than the original one, but VinVL (reproduced) 63.2 52.5 82.3 the difference is decreased with this modification of the VinVL+LC+IO 63.4 52.7 82.3 training. VinVL+LC 63.8 53.0 83.0 According to the results, all of our models answer more VinVL+LIO 64.1 53.7 82.6 accurately and outperform the reproduced model in all metrics, except in some cases of Consistency and Distri- bution. For example, even the VinVL+LC answers 0.40% threshold is equal to 0.05, with obtained 𝑝-𝑣𝑎𝑙𝑢𝑒 = 0.03. better on open questions and 0.23% better on yes/no ques- The same conclusion is reached for VinVL+LIO . In the tions, resulting in 0.32% higher overall accuracy, it has case of the VinVL+LC+IO , the difference is not significant, 0.14% lower performance in Consistency metric. This so the null hypothesis cannot be rejected. means that when our model fails, the prediction is truly The significance may seem small from a general point meaningless to the given question. However, this model of view. However, it should be considered that these shows the best performance compared with other ver- results were achieved by simply adding locations to the sions of VinVL+L: VinVL+LIO holds only the highest Con- system. To improve significance, scene features should sistency (+0.03% compared with reproduced VinVL and be generated from the same model as region features. In +0.17% compared with VinVL+LC ), and VinVL+LC+IO out- addition, other global information such as weather may performs all compared models in Validation and Distribu- be included. tion. We show the results of the prior State-of-the-Arts in Table 3. Even though, our VinVL+L method notice- Comparison on the test set Although we followed ably surpasses the original version in the primary metric: the original training pipeline, on which the results of our +0.20% of overall accuracy for VinVL+LC . models are based, it should be noted that the reproduced VinVL works worse than the original version. There- 5.3. Summary and Discussion fore, we decided to select models after the 1st, 3rd, and 5th epochs from the pre-training on the unbalanced set. An improvement in the visual question answering is Then we fine-tuned these models for 2 epochs on the achieved by taking global information about the visual balanced set to slightly increase the final performance. component into account. Table 4 and 5 confirm this fact 5 Jiří Vyskočil et al. CEUR Workshop Proceedings 1–9 Who is wearing the dress? What is the woman doing? Inside what is the pizza? What is inside the container next to the glass? VinVL (our) VinVL+L VinVL (our) VinVL+L VinVL (our) VinVL+L VinVL (our) VinVL+L Woman Woman Walking Sitting Box Box Straw Ice cream GT Women GT Resting GT Pizza box Pizza bos GT P Packet Figure 3: Wrong predictions w.r.t. Ground Truth labels (GT). We show the wrong predictions of two models (VinVL and VinVL+L) to randomly chosen image-question pairs from the validation set. for all our VinVL+L models. In addition, we show the the used novel visual representations. Therefore, our ar- wrong predictions of our VinVL+L (along with predic- ticle only shows the effectiveness of incorporating global tions of reproduced VinVL) against the Ground Truth location information into a system that works only on labels. The image-question pairs are randomly chosen the basis of objects. from the validation set, see Figure 3. Even if our model answers are wrong in the given examples, it is worth say- ing that some of the answers are not truly wrong, e.g., in 6. Conclusion the second example, in which the woman is truly sitting This paper presents VinVL+L, an enriched version of the and, in our opinion, there is missing additional informa- VinVL with location context as a novel visual representa- tion to say if she is really resting, instead of just sitting. tion. We generate the new representations as scene tags Besides these examples, we show predictions from the and features and we prepend them before the original em- test2019 set in Appendix A. beddings of the architecture. Our version achieves higher It is worth emphasizing that the listed models do not overall accuracy than the original method on the GQA use scene features, only tags. A model using both scene dataset, and we show that global information about the tags and features did not achieve the expected results. entire image influences the answers and thus should not This behavior was anticipated for two reasons. First, be ignored. The best results of 64.85% overall accuracy even if we follow the generating procedure of the scene are achieved with the model using 365 location categories features, the VL model obtains a vector with different as scene tags. Besides, we performed an Approximate semantics compared to region features. To solve this Randomization test to verify that the achieved results are issue, the scene features must be generated from the statistically significant. Similarly, weather recognition same model to avoid subsequent confusion. Second, all for outdoor scenes could be included in the concept to image and text representations are passed to the modified help the network with alignments of image-text pairs BERT model, which is still a language model pre-trained with respect to global information. All generated data on text corpora, with additional visual features added. and code are publicly available on our GitHub. Therefore, the words still have a higher weight than the visual features. Regarding the performance of the reproduced VinVL, Acknowledgments we used the original code including the pipeline pre- sented in [6]. However, the network reproduced by us The work has been supported by the grant of the Univer- achieved worse performance in all metrics, e.g., 0.12% in sity of West Bohemia, project No. SGS-2022-017. Com- overall accuracy. Since the main goal is to improve this putational resources were supplied by the project "e- method, we decided to primarily compare our models Infrastruktura CZ" (e-INFRA CZ LM2018140) supported with the reproduced version, on which the benefits are by the Ministry of Education, Youth and Sports of the best observed. All the listed models were trained using Czech Republic. the same device, hyperparameters settings, only differ in 6 Jiří Vyskočil et al. CEUR Workshop Proceedings 1–9 References pp. 1492–1500. [12] C. Szegedy, S. Ioffe, V. Vanhoucke, A. A. Alemi, [1] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Inception-v4, inception-resnet and the impact of Towards real-time object detection with region pro- residual connections on learning, in: Thirty-first posal networks, in: C. Cortes, N. D. Lawrence, D. D. AAAI conference on artificial intelligence, 2017. Lee, M. Sugiyama, R. Garnett (Eds.), Advances in [13] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, M. Li, Neural Information Processing Systems 28, Curran Bag of tricks for image classification with convo- Associates, Inc., 2015, pp. 91–99. lutional neural networks, in: Proceedings of the [2] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, IEEE/CVF Conference on Computer Vision and Pat- Bert: Pre-training of deep bidirectional transform- tern Recognition, 2019, pp. 558–567. ers for language understanding, arXiv preprint [14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, arXiv:1810.04805 (2018). D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabi- [3] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, novich, Going deeper with convolutions, in: Pro- L. Wang, H. Hu, L. Dong, F. Wei, et al., Oscar: ceedings of the IEEE conference on computer vision Object-semantics aligned pre-training for vision- and pattern recognition, 2015, pp. 1–9. language tasks, in: European Conference on Com- [15] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wo- puter Vision, Springer, 2020, pp. 121–137. jna, Rethinking the inception architecture for com- [4] W. Chen, Z. Gan, L. Li, Y. Cheng, W. Wang, J. Liu, puter vision, in: Proceedings of the IEEE conference Meta module network for compositional visual rea- on computer vision and pattern recognition, 2016, soning, in: Proceedings of the IEEE/CVF Winter pp. 2818–2826. Conference on Applications of Computer Vision, [16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, 2021, pp. 655–664. W. Wang, T. Weyand, M. Andreetto, H. Adam, Mo- [5] P. Anderson, X. He, C. Buehler, D. Teney, M. John- bilenets: Efficient convolutional neural networks son, S. Gould, L. Zhang, Bottom-up and top-down for mobile vision applications, arXiv preprint attention for image captioning and visual question arXiv:1704.04861 (2017). answering, in: Proceedings of the IEEE conference [17] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.- on computer vision and pattern recognition, 2018, C. Chen, Mobilenetv2: Inverted residuals and linear pp. 6077–6086. bottlenecks, in: Proceedings of the IEEE conference [6] P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, on computer vision and pattern recognition, 2018, Y. Choi, J. Gao, Vinvl: Revisiting visual representa- pp. 4510–4520. tions in vision-language models, in: Proceedings [18] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, of the IEEE/CVF Conference on Computer Vision M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, and Pattern Recognition, 2021, pp. 5579–5588. et al., Searching for mobilenetv3, in: Proceedings [7] A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, of the IEEE/CVF International Conference on Com- I. Misra, N. Carion, Mdetr-modulated detection puter Vision, 2019, pp. 1314–1324. for end-to-end multi-modal understanding, in: Pro- [19] M. Tan, Q. Le, Efficientnet: Rethinking model scal- ceedings of the IEEE/CVF International Conference ing for convolutional neural networks, in: Inter- on Computer Vision, 2021, pp. 1780–1790. national Conference on Machine Learning, PMLR, [8] H. Tan, M. Bansal, Lxmert: Learning cross-modality 2019, pp. 6105–6114. encoder representations from transformers, arXiv [20] M. Tan, Q. Le, Efficientnetv2: Smaller models and preprint arXiv:1908.07490 (2019). faster training, in: International Conference on [9] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet Machine Learning, PMLR, 2021, pp. 10096–10106. classification with deep convolutional neural net- [21] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis- works, in: F. Pereira, C. J. C. Burges, L. Bottou, senborn, X. Zhai, T. Unterthiner, M. Dehghani, K. Q. Weinberger (Eds.), Advances in Neural Infor- M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, mation Processing Systems 25, Curran Associates, N. Houlsby, An image is worth 16x16 words: Trans- Inc., 2012, pp. 1097–1105. formers for image recognition at scale, in: Inter- [10] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn- national Conference on Learning Representations, ing for image recognition, in: Proceedings of the Vienna, 2021. IEEE conference on computer vision and pattern [22] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, recognition, 2016, pp. 770–778. B. Guo, Swin transformer: Hierarchical vision trans- [11] S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, Ag- former using shifted windows, in: Proceedings of gregated residual transformations for deep neural the IEEE/CVF International Conference on Com- networks, in: Proceedings of the IEEE conference puter Vision, 2021, pp. 10012–10022. on computer vision and pattern recognition, 2017, [23] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You 7 Jiří Vyskočil et al. CEUR Workshop Proceedings 1–9 only look once: Unified, real-time object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788. [24] A. Bochkovskiy, C.-Y. Wang, H.-Y. M. Liao, Yolov4: Optimal speed and accuracy of object detection, arXiv preprint arXiv:2004.10934 (2020). [25] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining ap- proach, arXiv preprint arXiv:1907.11692 (2019). [26] D. A. Hudson, C. D. Manning, Gqa: A new dataset for real-world visual reasoning and com- positional question answering, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6700–6709. [27] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, D. Parikh, Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering, in: Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [28] M. Ren, R. Kiros, R. Zemel, Exploring models and data for image question answering, Advances in neural information processing systems 28 (2015). [29] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, A. Tor- ralba, Places: A 10 million image database for scene recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence (2017). [30] R. Wightman, Pytorch image models, https: //github.com/rwightman/pytorch-image-models, 2019. doi:10.5281/zenodo.4414861. [31] A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, N. Carion, Mdetr - modulated detection for end-to-end multi-modal understanding, in: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1780–1790. [32] D. Hudson, C. D. Manning, Learning by abstrac- tion: The neural state machine, Advances in Neural Information Processing Systems 32 (2019). [33] S. Riezler, J. T. Maxwell III, On some pitfalls in automatic evaluation and significance testing for mt, in: Proceedings of the ACL workshop on intrin- sic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 57–64. 8 Jiří Vyskočil et al. CEUR Workshop Proceedings 1–9 A. Additional prediction examples Where is she sitting? Where is the umpire? Is there any grass in the scene that is brown? What place is the photo at? VinVL (our) VinVL+L VinVL (our) VinVL+L VinVL (our) VinVL+L VinVL (our) VinVL+L Table Restaurant Home plate Field No Yes Classroom Restaurant What do you think is covered in snow? What is on the freezer in the kitchen? Where is the young boy running? Are there either any cars or vehicles in the image? VinVL (our) VinVL+L VinVL (our) VinVL+L VinVL (our) VinVL+L VinVL (our) VinVL+L Car Ground Shelf Sticker Sand Beach No Yes What is the piled vegetable? What is the red cabin made of? What is the donut covered with? Is there a bird or a cat that is sitting? VinVL (our) VinVL+L VinVL (our) VinVL+L VinVL (our) VinVL+L VinVL (our) VinVL+L Potato Cauliflower Metal Wood Icing Sprinkles No Yes Which side of the photo is the bus on? Is the man thin? What place is shown? What is the girl sitting on? VinVL (our) VinVL+L VinVL (our) VinVL+L VinVL (our) VinVL+L VinVL (our) VinVL+L Yes No No Yes Pen Zoo Towel Blanket Figure 4: Randomly selected predictions; VinVL+L and VinVL methods evaluated over GQA test2019 set. The VinVL+L method impacts a decision based on newly included binary location (i.e. indoor and outdoor) metadata. In most cases where VinVL+L prediction differs from VinVL, the VinVL+L produced a subjectively more reasonable prediction. 9