VinVL+L: Enriching Visual Representation with Location
Context in VQA
Jiří Vyskočil1,* , Lukáš Picek1
1
    Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia, Technická 8, Pilsen, Czech Republic


                                          Abstract
                                          In this paper, we describe a novel method – VinVL+L – that enriches the visual representations (i.e. object tags and
                                          region features) of the State-of-the-Art Vision and Language (VL) method – VinVL – with Location information. To verify
                                          the importance of such metadata for VL models, we (i) trained a Swin-B model on the Places365 dataset and obtained
                                          additional sets of visual and tag features; both were made public to allow reproducibility and further experiments, (ii) did
                                          an architectural update to the existing VinVL method to include the new feature sets, and (iii) provide a qualitative and
                                          quantitative evaluation. By including just binary location metadata, the VinVL+L method provides incremental improvement
                                          to the State-of-the-Art VinVL in Visual Question Answering (VQA). The VinVL+L achieved an accuracy of 64.85% and
                                          increased the performance by +0.32% in terms of accuracy on the GQA dataset; the statistical significance of the new
                                          representations is verified via Approximate Randomization. The code and newly generated sets of features are available at
                                          https://github.com/vyskocj/VinVL-L.

                                          Keywords
                                          Vision and Language, Visual Question Answering, Location Recognition, Oscar, VinVL


1. Introduction
Multi-modal understanding systems can answer general
questions from visual and textual data. These questions
are largely focused on objects and their relations, appear-
ances, or behaviors. Rest of them are asked about the
overall scene, such as location or weather. Most of multi-
modal systems are split into visual and textual modules,
followed by image-text alignment. Faster R-CNN [1] re-
gion features of detected objects are commonly used for                                                                                +
visual representation and BERT [2] embeddings for the                                                                             Where is it?
textual. However, such visual model only provides infor-
mation about objects, from which the entire multi-modal
                                                                                                                    VinVL                    (our) VinVL+L
system must decide simple questions like "Are people                                                              shop (24.6%)              bedroom (59.9%)
inside or outside?".                                                                                              store (22.5%)             living room (15.1%)
   We intuitively feel that in general, the objects are re-                                                       porch (9.0%)               hotel room (2.3%)
lated to indoor/outdoor scene division even if they cannot
be directly assigned. They have a certain weight on the                                             Figure 1: Example predictions of the proposed VinVL+L.
basis of which the correct answer can be decided. For                                               We compare VinVL+L with the State-of-the-Art VinVL on the
example, cars, sky, and trees are more likely to belong                                             randomly selected input pair (i.e. image and question) from
to an outdoor scene, however, the scene may be indoors,                                             the GQA test set. The VinVL+L better aligns the answer to
and these categories can be detected through the garage                                             the question thanks to the enriched visual features.
door. In addition to [3, 4, 5], the mentioned paradigm
of splitting image-text modules follows VinVL [6] based
on Oscar [3] that additionally adds object tags, i.e., tex-                                             features. However, a clear cross-modal representation of
tual output from an Object Detection network, to region                                                 the scene is still missing, which can harm the network,
                                                                                                        as shown in Figure 1.
26th Computer Vision Winter Workshop, Robert Sablatnig and Florian                                         Our method, based on VinVL, brings a new represen-
Kleber (eds.), Krems, Lower Austria, Austria, Feb. 15-17, 2023
*
  Corresponding author.
                                                                                                        tation including information about the location into the
$ vyskocj@kky.zcu.cz (J. Vyskočil)                                                                      system. This representation is obtained using a classifi-
 0000-0002-6443-2051 (J. Vyskočil); 0000-0002-6041-9722                                                cation network trained on the Places365 dataset having a
(L. Picek)                                                                                              total of 365 location categories. All of these categories are
           © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
           Attribution 4.0 International (CC BY 4.0).                                                   directly split into one of the indoor and outdoor supercat-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                1
Jiří Vyskočil et al. CEUR Workshop Proceedings                                                                  1–9


egories. All of these labels are then passed as scene tags BERT-based VL Methods End-to-end methods such
to our VinVL+L method to predict the answers. Besides,     as MDETR [7] use a pre-trained image classification
we utilize scene features that are generated in the same   backbone to extract features and concatenate them with
way as the region features of Oscar/VinVL. Finally, we     word embeddings taken from a BERT-based model [2, 25].
evaluate influences on answers while using these novel-    However, some existing VL methods [3, 4] reuse the ex-
ties. An example of the top 3 predictions of the VinVL and tracted features from another approach, e.g., a bottom-up
our VinVL+L is visualized in Figure 1. More examples are   mechanism [5] that extracts object regions via Faster R-
shown in Section 5.3 and Appendix A. Our contributions     CNN, to fine-tune a novel method with an unchanged
are:                                                       visual model. These methods include Oscar [3], which
                                                           introduces object tags as cross-modal representation to
      • We enrich visual representations of the VinVL improve the alignment of the image-text pairs. Based
         using the global information about the image - on Oscar, VinVL [6] improves visual representation by
         location.                                         pre-training larger model on multiple object detection
      • We present the effectiveness of each new cross- datasets. Since this method holds State-of-the-Art results
         modal representation as we compare their related on the GQA dataset [26] and represents the image as a
         models including a reproduced version of the set of regional features while suppressing global scene
         VinVL.                                            information, we decided to improve the alignment of the
      • We improve the VinVL in visual question answer- cross-modal representation by location recognition.
         ing (VQA) with an overall accuracy of 64.85% on
         the GQA dataset.
      • We provide data with the location context that 3. Datasets
         we generated for the GQA dataset.
                                                           The early datasets, such as VQA [27] and COCO-QA [28],
                                                           contain only the core annotation needed for the vision
2. Related Work                                            question answering: an image, a question, and a desired
                                                           one-word answer. However, we are interested in dataset
Many Vision and Language (VL) methods, like [3, 6, 7, 8], containing richer annotations to recognize types of loca-
focus on pre-training generic models by combining mul- tions in the image input. It does contain the GQA dataset,
tiple datasets from different tasks. Then the models are but only for part of the images. Therefore, there are two
fine-tuned to downstream tasks that include: image cap- existing datasets Places365 and GQA suitable for our task.
tioning, visual reasoning, or visual question answering. Both datasets are thoroughly described below.
In this section, we briefly review recent approaches to VL
tasks and their commonly used Vision Encoders, which Places365 [29] This dataset consists of 365 location
are the most relevant for our work.                        categories that we can directly map to indoors/outdoors
                                                           category. The balanced training set varies from 3,068 to
Vision Encoders Convolutional Neural Networks 5,000 images per location category, while the validation
(CNNs) gained popularity in image classification when set consists of 50 images per category.
AlexNet [9] won the ImageNet 2012 competition. In the
subsequent period, models with skip-connections [10, GQA dataset [26] This dataset consists of 22,669,678
11, 12, 13] with blocks having small feed-forward net- questions (from which the test2019 split contains
works in parallel connections [12, 14, 15], or with a fo- 4,237,524 questions) over 113,018 images with 1,878 pos-
cus on optimization [16, 17, 18, 19, 20] were created. In sible answers to open and binary yes/no questions. In
recent years, Transformer-based methods, such as Vi- addition to questions and answers, each image contains
sion Transformer [21], or its modification with shifted annotations of objects, the relations between them, and
windows [22], gained favor thanks to computational effi- their attributes. Besides, each image contains global in-
ciency and accuracy. These image classification models formation in the form of location and weather, the dis-
are often used as backbone architectures in object detec- tribution of which is shown in Table 1. Regarding the
tion to predict bounding boxes with a classification of evaluation of the results, the following metrics are used:
each object in the image. The most popular detectors
are the one-shot Yolo-based architectures [23, 24] and           • Accuracy – overall accuracy, primary metric,
two-shot Faster R-CNN-based architectures [1], which             • Binary – accuracy of yes/no questions,
are generally slower but more accurate than the one-shot         • Open – accuracy of open questions,
ones. The image classification or object detection models        • Consistency – overall accuracy including equiva-
are further used as Visual Encoders in the VL tasks.               lent answers,


                                                             2
Jiří Vyskočil et al. CEUR Workshop Proceedings                                                                              1–9


     • Plausibility – relative number of answers making          sequence of the text, 𝑞 is the word embedding sequence
       sense with respect to the dataset,                        of the scene and object tags detected from the image, and
     • Validity – relative number of answers that are in         𝑣 is the visual embedding sequence of the entire image
       the question scope,                                       and all detected regions. This input can be viewed from
     • Distribution – overall match between the distri-          two different perspectives as [3, 6]:
       butions of true answers and model predictions.
                                                                    𝑥 ≜ [, , ,𝑤, 𝑞, , ,, , ,𝑣, ,] 𝑜𝑟 [ ,𝑤, , , ,𝑞, 𝑣, , ]   (1)
                                                                           ⏟ ⏞            ⏟ ⏞          ⏟ ⏞   ⏟ ⏞
                                                                              𝑄&𝐴        𝑖𝑚𝑔         𝑐𝑎𝑝𝑡𝑖𝑜𝑛 𝑡𝑎𝑔𝑠&𝑖𝑚𝑔
Table 1                                                                  ⏟           ⏞              ⏟          ⏞
                                                                             𝐷𝑖𝑐𝑡𝑖𝑜𝑛𝑎𝑟𝑦 𝑉 𝑖𝑒𝑤           𝑀 𝑜𝑑𝑎𝑙𝑖𝑡𝑦 𝑉 𝑖𝑒𝑤
GQA dataset. Distribution of annotated global information
about the scenes on the training and validation split.
                                                                   where Dictionary View defines Masked Token Loss
     Metadata           Training         Validation              ℒ𝑀 𝑇 𝐿 , applied on the discrete token sequence ℎ ≜
     # of images      74,942            10,696                   [𝑤, 𝑞], to predict the masked tokens ℎ𝑖 based on their
     with weather      6,600 (8.8%)        952 (8.9%)            surrounding tokens ℎ∖𝑖 :
     with location    23,370 (31.2%)     3,265 (30.5%)
     indoors           4,520 (19.3%)       638 (19.5%)                    ℒ𝑀 𝑇 𝐿 = −E(ℎ,𝑣)∼𝒟 log 𝑝(ℎ𝑖 |ℎ∖𝑖 , 𝑣)             (2)
     outdoors         18,850 (80.7%)     2,627 (80.5%)
                                                               Modality View defines Contrastive Loss ℒ𝐶𝐿 for the
                                                             image representation ℎ′ ≜ [𝑞, 𝑣], which is "polluted" by
                                                             randomly replacing 𝑞 with another sequence of tags from
4. Methodology                                               the dataset 𝒟. To distinguish the original pair (𝑦 = 1)
The Vision and Language (VL) approaches are commonly from the polluted one (𝑦 = 0), a binary classifier 𝑓 (.) as
divided into two phases: pre-training and fine-tuning. In a fully-connected layer is applied on the top of the [CLS]
pre-training, multiple datasets of different tasks are com- token. This loss function is defined as [3]:
bined to create generic models. In fine-tuning, these mod-
                                                                     ℒ𝐶𝐿 = −E(𝑤,ℎ′ ;𝑦)∼𝒟 log 𝑝(𝑦|𝑓 (𝑤, ℎ))         (3)
els are then trained on each of these datasets, called down-
stream tasks. In this study, we focus on improving the         Alternatively, VinVL [6] applies the 3-way Con-
current State-of-the-Art VinVL [6] in GQA dataset [26]. strastive Loss ℒ
                                                                             𝐶𝐿3 on ℎ ≜ [𝑤, 𝑞, 𝑣], instead of the
                                                                                       *
This improved version learns the image-text representa- binary ℒ
                                                                      𝐶𝐿 used in Oscar [3], to predict whether the
tion with respect to the global information of an entire (𝑤, 𝑞, 𝑣) triplet is the original one (𝑐 = 0), contains a
image, such as indoors/outdoors, which is given by novel polluted 𝑤 (𝑐 = 1), or contains a polluted 𝑞 (𝑐 = 2):
scene tags and features.
                                                                    ℒ𝐶𝐿3 = −E(ℎ* ;𝑐)∼𝒟 log 𝑝(𝑐|𝑓 (𝑤, 𝑞, 𝑣))        (4)
4.1. Adding locations to VinVL
                                                                    By fusing Equation 2 and 4, or 2 and 3, the full pre-
Based on VinVL [6], we present an extended architecture          training objective is:
with scene tags and features. In our work, these repre-
sentations are simply generated using a fine-tuned clas-                ℒ𝑃 𝑟𝑒−𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 = ℒ𝑀 𝐿𝑇 + ℒ𝐶𝐿3 (𝑜𝑟 𝐶𝐿)               (5)
sification network on the Places365 dataset [29] with an
accuracy of up to 96.1% in case of binary indoor/outdoor         4.2. Implementation Details
classification (see Section 5.1 for more details). Scene
tags are the predicted location categories. Scene features       We use the same feature-vector size (i.e. 2,048) in order
are made in the same style as their object counterparts,         to match the size of VinVL. These features are then
i.e., as a 2,048 feature vector (obtained via Global Aver-       concatenated with positions and sizes, as described in
age Pooling) concatenated with top-left & bottom-right           Section 4.1. The models used from the Timm library [30]
corners, and height & width. Besides, the novel scene            are: resnext50d_32x4d [13], gluon_inception_v3 [15], mo-
representations are prepended before the object ones so          bilenetv3_small_100 [18], gc_efficientnetv2_rw_t [20],
that the scenes in the embeddings always have the same           vit_large_patch16_224_in21k [21], and swin_base_
position for each image-text pair input, as outlined in          patch4_window7_224_in22k [22].           All models are
Figure 2.                                                        fine-tuned for 20 epochs with SGD and Focal Loss. We
   Even though we do not perform pre-training on var-            use an initial learning rate of 0.01 and we reduce it with
ious tasks with the new representation, in general, the          a plateau scheduler. The batch size is 64 with 2
yet-established pre-training objective of Oscar/VinVL [6]        accumulation steps. We use horizontal flip (probability
can be followed. The change is only in the definition of         of 50%), random resized crop (scale from 0.8 to 1.0), and
the (𝑤, 𝑞, 𝑣) triple input, where 𝑤 is the word embedding        random brightness contrast (probability of 20%).


                                                             3
Jiří Vyskočil et al. CEUR Workshop Proceedings                                                                                                         1–9


              Contrastive Loss Masked Token Loss

  Features


  Network                                                            Multi-Layer Transformers


 Embeddings


                                                                          Living Indoors    Tree     Dog   Snowman [SEP]
              [CLS]    Is    [MASK] indoors   or   outdoors   ?   [SEP]
                                                                          room
    Data

                                     Word Tokens                                   Scene & Object Tags                       Scene & Region Features

                                                   Language                                      Image
  Modality
                                                                                                Language                             Image
 Dictionary


Figure 2: Illustration of VinVL+L. We represent the image-text pair as a quintuple [ word tokens , scene tags , object tags ,
 scene features , region features ], where word tokens, object tags and region features are taken from VinVL [6]. Scene tags
and features are proposed to improve the alignment of cross-domain semantics. The example shows a case where only detected
objects could be classified as outdoors rather than indoors.


   In the case of the VL model, we use the pre-trained                        Table 2
Oscar+BASE with VinVL features and follow their pre-                          Performance evaluation of selected networks. We do the
sented procedure which is the same as the original Oscar,                     evaluation on the Places365-val dataset on all categories and
i.e., pre-training on the unbalanced "all-split" of the GQA                   AccuracyIO on their binary supercategories (Indoor/Outdoor).
dataset for 5 epochs, and fine-tune the best model with                          Backbone                  Accuracy        Top3      AccuracyIO
respect to overall accuracy on the "balanced-split" for 2                        MobileNetV3                 47.9           70.8        94.6
epochs. All the results are shown in Section 5.2 together                        InceptionV3                 53.1           76.0        95.3
with a reproduction of VinVL that we improve.                                    ResNeXt-50-D                54.2           77.0        95.6
                                                                                 EfficientNetV2              54.7           77.4        95.6
                                                                                 ViT-Large                   54.9           77.7        95.5
5. Experiments                                                                   Swin-Base                   56.0          78.7         96.1

Our approach is divided into two separate steps. First, we
adapt several image classification models to the Places365 which is 1.1% higher than that of the second-best ViT-
dataset and select the most accurate model to generate Large. Therefore, this model is further used to extract
a visual representation for the VL model. Then, we fine- novel visual representations for our VinVL+L.
tune the VL model using its original and our new visual
features.
                                                                              5.2. Visual Question Answering
5.1. Location Recognition                                                     Statistical significance of novel features We show
                                                                              the advantages of the new visual representations by com-
We selected several pre-trained image classification net-                     paring our method with the reproduced VinVL using the
works in order to cover a certain range of different ap-                      same training pipeline – see Table 5. The used scene tags
proaches to location recognition. These methods include                       as 365 location categories (C), or indoors/outdoors (IO),
those focused on high inference speed, methods con-                           are denoted in subscripts of the model name. Besides, we
taining skip-connections, parallel paths, or transformers.                    compute the statistical significance [33] between the two
Results of fine-tuned models on the Places365 dataset                         models to show that recognizing the location categories
are shown in Table 2. ResNeXt-50, EfficientNetV2, and                         truly brings benefits and it is not just a coincidence. For
ViT-Large have similar performance, while ViT-Large                           demonstration, we compare the reproduced VinVL with
performs slightly worse in indoors/outdoors classifica-                       our VinVL+LC on the validation dataset. Our goal is to
tion. It is because when the ViT-Large is not right, it is                    reject the null hypothesis defined as "there is no differ-
more often the incorrect indoors/outdoors supercategory                       ence between system A and B". To do this, we shuffle the
than in the case of the previous two mentioned models.                        predictions between systems A and B with a probability
The best results are achieved by Swin-Base in both 365                        of 50%, and we compare the performance with the initial
locations and binary indoors/outdoors recognition. It                         one (all repeated 10,000 times). Consequently, we reject
obtains 56% top-1 accuracy in recognizing 365 locations,                      the null hypothesis at the 95% significance level, i.e., a


                                                                          4
Jiří Vyskočil et al. CEUR Workshop Proceedings                                                                            1–9


Table 3
Results of individual methods according to the official leaderboard. We show the prior State-of-the-Arts performance
on GQA dataset, sorted by primary metric – Accuracy. The meaning of individual metrics is described in Section 3.
         Method             ↑Accuracy      ↑Binary     ↑Open         ↑Consist.     ↑Plausib.    ↑Valid.     ↓Distrib.
         Bottom-Up [5]        49.74         66.64       34.83          78.71         84.57       96.18        5.98
         MMN [4]              60.83         78.90       44.89          92.49         84.55       96.19        5.54
         Oscar [3]            61.62           -           -              -             -           -            -
         MDETR [31]           62.45         80.91       46.15          93.95         84.15       96.33        5.36
         LXR955 [8]           62.71         79.79       47.64          93.10         85.21       96.36        6.42
         NSM [32]             63.17         78.94       49.25          93.25         84.28       96.41        3.71
         VinVL [6]            64.65         82.63       48.77          94.35         84.98       96.62        4.72


Table 4
Performance evaluation of individual scene tags. We compare the reproduced VinVL with additional scene tags as
Indoors/Outdoors (IO), and/or 365 location category (C). The last row indicates improvement/deterioration as a difference
between our best model and the reproduced VinVL.
       Method                 ↑Accuracy      ↑Binary     ↑Open         ↑Consist.    ↑Plausib.     ↑Valid.     ↓Distrib.
       VinVL (reproduced)       64.53         82.36       48.79          94.14        84.77        96.55        4.72
       VinVL+LIO                64.65         82.43       48.94          94.17        84.81        96.61        4.73
       VinVL+LC+IO              64.71         82.38       49.12          94.06        84.84        96.65        4.55
       VinVL+LC                 64.85         82.59       49.19          94.00        84.91        96.62        4.59
       ΔVinVL+LC − VinVL        +0.32         +0.23       +0.40          -0.14        +0.14        +0.07       -0.13


Table 5                                                    We selected the best model with respect to overall accu-
Accuracy of answers on the validation dataset. We evalu- racy on the validation set, and we pushed the results into
ate the reproduced VinVL with our improved versions on the the evaluation server. The performances of the models
balanced validation GQA dataset.                           are listed in Table 4. The reproduced version of VinVL
   Backbone              Accuracy       Binary   Open              still has worse performance than the original one, but
   VinVL (reproduced)      63.2          52.5     82.3             the difference is decreased with this modification of the
   VinVL+LC+IO             63.4          52.7     82.3             training.
   VinVL+LC                63.8          53.0    83.0                 According to the results, all of our models answer more
   VinVL+LIO               64.1          53.7     82.6             accurately and outperform the reproduced model in all
                                                                   metrics, except in some cases of Consistency and Distri-
                                                                   bution. For example, even the VinVL+LC answers 0.40%
threshold is equal to 0.05, with obtained 𝑝-𝑣𝑎𝑙𝑢𝑒 = 0.03.          better on open questions and 0.23% better on yes/no ques-
The same conclusion is reached for VinVL+LIO . In the              tions, resulting in 0.32% higher overall accuracy, it has
case of the VinVL+LC+IO , the difference is not significant,       0.14% lower performance in Consistency metric. This
so the null hypothesis cannot be rejected.                         means that when our model fails, the prediction is truly
  The significance may seem small from a general point             meaningless to the given question. However, this model
of view. However, it should be considered that these               shows the best performance compared with other ver-
results were achieved by simply adding locations to the            sions of VinVL+L: VinVL+LIO holds only the highest Con-
system. To improve significance, scene features should             sistency (+0.03% compared with reproduced VinVL and
be generated from the same model as region features. In            +0.17% compared with VinVL+LC ), and VinVL+LC+IO out-
addition, other global information such as weather may             performs all compared models in Validation and Distribu-
be included.                                                       tion. We show the results of the prior State-of-the-Arts
                                                                   in Table 3. Even though, our VinVL+L method notice-
Comparison on the test set Although we followed                    ably surpasses the original version in the primary metric:
the original training pipeline, on which the results of our        +0.20% of overall accuracy for VinVL+LC .
models are based, it should be noted that the reproduced
VinVL works worse than the original version. There-
                                                                   5.3. Summary and Discussion
fore, we decided to select models after the 1st, 3rd, and
5th epochs from the pre-training on the unbalanced set.            An improvement in the visual question answering is
Then we fine-tuned these models for 2 epochs on the                achieved by taking global information about the visual
balanced set to slightly increase the final performance.           component into account. Table 4 and 5 confirm this fact


                                                               5
Jiří Vyskočil et al. CEUR Workshop Proceedings                                                                                                                            1–9


        Who is wearing the dress?              What is the woman doing?                       Inside what is the pizza?       What is inside the container next to the glass?


       VinVL            (our) VinVL+L         VinVL              (our) VinVL+L              VinVL             (our) VinVL+L          VinVL                (our) VinVL+L

       Woman               Woman             Walking                Sitting                 Box                     Box             Straw                   Ice cream


  GT            Women                   GT             Resting                         GT            Pizza box
                                                                                                     Pizza bos                 GT                  P
                                                                                                                                                 Packet


Figure 3: Wrong predictions w.r.t. Ground Truth labels (GT). We show the wrong predictions of two models (VinVL and
VinVL+L) to randomly chosen image-question pairs from the validation set.


for all our VinVL+L models. In addition, we show the                                 the used novel visual representations. Therefore, our ar-
wrong predictions of our VinVL+L (along with predic-                                 ticle only shows the effectiveness of incorporating global
tions of reproduced VinVL) against the Ground Truth                                  location information into a system that works only on
labels. The image-question pairs are randomly chosen                                 the basis of objects.
from the validation set, see Figure 3. Even if our model
answers are wrong in the given examples, it is worth say-
ing that some of the answers are not truly wrong, e.g., in                           6. Conclusion
the second example, in which the woman is truly sitting
                                                                                     This paper presents VinVL+L, an enriched version of the
and, in our opinion, there is missing additional informa-
                                                                                     VinVL with location context as a novel visual representa-
tion to say if she is really resting, instead of just sitting.
                                                                                     tion. We generate the new representations as scene tags
Besides these examples, we show predictions from the
                                                                                     and features and we prepend them before the original em-
test2019 set in Appendix A.
                                                                                     beddings of the architecture. Our version achieves higher
   It is worth emphasizing that the listed models do not
                                                                                     overall accuracy than the original method on the GQA
use scene features, only tags. A model using both scene
                                                                                     dataset, and we show that global information about the
tags and features did not achieve the expected results.
                                                                                     entire image influences the answers and thus should not
This behavior was anticipated for two reasons. First,
                                                                                     be ignored. The best results of 64.85% overall accuracy
even if we follow the generating procedure of the scene
                                                                                     are achieved with the model using 365 location categories
features, the VL model obtains a vector with different
                                                                                     as scene tags. Besides, we performed an Approximate
semantics compared to region features. To solve this
                                                                                     Randomization test to verify that the achieved results are
issue, the scene features must be generated from the
                                                                                     statistically significant. Similarly, weather recognition
same model to avoid subsequent confusion. Second, all
                                                                                     for outdoor scenes could be included in the concept to
image and text representations are passed to the modified
                                                                                     help the network with alignments of image-text pairs
BERT model, which is still a language model pre-trained
                                                                                     with respect to global information. All generated data
on text corpora, with additional visual features added.
                                                                                     and code are publicly available on our GitHub.
Therefore, the words still have a higher weight than the
visual features.
   Regarding the performance of the reproduced VinVL,                                Acknowledgments
we used the original code including the pipeline pre-
sented in [6]. However, the network reproduced by us                                 The work has been supported by the grant of the Univer-
achieved worse performance in all metrics, e.g., 0.12% in                            sity of West Bohemia, project No. SGS-2022-017. Com-
overall accuracy. Since the main goal is to improve this                             putational resources were supplied by the project "e-
method, we decided to primarily compare our models                                   Infrastruktura CZ" (e-INFRA CZ LM2018140) supported
with the reproduced version, on which the benefits are                               by the Ministry of Education, Youth and Sports of the
best observed. All the listed models were trained using                              Czech Republic.
the same device, hyperparameters settings, only differ in


                                                                                 6
Jiří Vyskočil et al. CEUR Workshop Proceedings                                                                           1–9


References                                                            pp. 1492–1500.
                                                                 [12] C. Szegedy, S. Ioffe, V. Vanhoucke, A. A. Alemi,
 [1] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn:                Inception-v4, inception-resnet and the impact of
     Towards real-time object detection with region pro-              residual connections on learning, in: Thirty-first
     posal networks, in: C. Cortes, N. D. Lawrence, D. D.             AAAI conference on artificial intelligence, 2017.
     Lee, M. Sugiyama, R. Garnett (Eds.), Advances in            [13] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, M. Li,
     Neural Information Processing Systems 28, Curran                 Bag of tricks for image classification with convo-
     Associates, Inc., 2015, pp. 91–99.                               lutional neural networks, in: Proceedings of the
 [2] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,                    IEEE/CVF Conference on Computer Vision and Pat-
     Bert: Pre-training of deep bidirectional transform-              tern Recognition, 2019, pp. 558–567.
     ers for language understanding, arXiv preprint              [14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
     arXiv:1810.04805 (2018).                                         D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabi-
 [3] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang,                 novich, Going deeper with convolutions, in: Pro-
     L. Wang, H. Hu, L. Dong, F. Wei, et al., Oscar:                  ceedings of the IEEE conference on computer vision
     Object-semantics aligned pre-training for vision-                and pattern recognition, 2015, pp. 1–9.
     language tasks, in: European Conference on Com-             [15] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wo-
     puter Vision, Springer, 2020, pp. 121–137.                       jna, Rethinking the inception architecture for com-
 [4] W. Chen, Z. Gan, L. Li, Y. Cheng, W. Wang, J. Liu,               puter vision, in: Proceedings of the IEEE conference
     Meta module network for compositional visual rea-                on computer vision and pattern recognition, 2016,
     soning, in: Proceedings of the IEEE/CVF Winter                   pp. 2818–2826.
     Conference on Applications of Computer Vision,              [16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko,
     2021, pp. 655–664.                                               W. Wang, T. Weyand, M. Andreetto, H. Adam, Mo-
 [5] P. Anderson, X. He, C. Buehler, D. Teney, M. John-               bilenets: Efficient convolutional neural networks
     son, S. Gould, L. Zhang, Bottom-up and top-down                  for mobile vision applications, arXiv preprint
     attention for image captioning and visual question               arXiv:1704.04861 (2017).
     answering, in: Proceedings of the IEEE conference           [17] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-
     on computer vision and pattern recognition, 2018,                C. Chen, Mobilenetv2: Inverted residuals and linear
     pp. 6077–6086.                                                   bottlenecks, in: Proceedings of the IEEE conference
 [6] P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang,              on computer vision and pattern recognition, 2018,
     Y. Choi, J. Gao, Vinvl: Revisiting visual representa-            pp. 4510–4520.
     tions in vision-language models, in: Proceedings            [18] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen,
     of the IEEE/CVF Conference on Computer Vision                    M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan,
     and Pattern Recognition, 2021, pp. 5579–5588.                    et al., Searching for mobilenetv3, in: Proceedings
 [7] A. Kamath, M. Singh, Y. LeCun, G. Synnaeve,                      of the IEEE/CVF International Conference on Com-
     I. Misra, N. Carion, Mdetr-modulated detection                   puter Vision, 2019, pp. 1314–1324.
     for end-to-end multi-modal understanding, in: Pro-          [19] M. Tan, Q. Le, Efficientnet: Rethinking model scal-
     ceedings of the IEEE/CVF International Conference                ing for convolutional neural networks, in: Inter-
     on Computer Vision, 2021, pp. 1780–1790.                         national Conference on Machine Learning, PMLR,
 [8] H. Tan, M. Bansal, Lxmert: Learning cross-modality               2019, pp. 6105–6114.
     encoder representations from transformers, arXiv            [20] M. Tan, Q. Le, Efficientnetv2: Smaller models and
     preprint arXiv:1908.07490 (2019).                                faster training, in: International Conference on
 [9] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet              Machine Learning, PMLR, 2021, pp. 10096–10106.
     classification with deep convolutional neural net-          [21] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis-
     works, in: F. Pereira, C. J. C. Burges, L. Bottou,               senborn, X. Zhai, T. Unterthiner, M. Dehghani,
     K. Q. Weinberger (Eds.), Advances in Neural Infor-               M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit,
     mation Processing Systems 25, Curran Associates,                 N. Houlsby, An image is worth 16x16 words: Trans-
     Inc., 2012, pp. 1097–1105.                                       formers for image recognition at scale, in: Inter-
[10] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-            national Conference on Learning Representations,
     ing for image recognition, in: Proceedings of the                Vienna, 2021.
     IEEE conference on computer vision and pattern              [22] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin,
     recognition, 2016, pp. 770–778.                                  B. Guo, Swin transformer: Hierarchical vision trans-
[11] S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, Ag-                former using shifted windows, in: Proceedings of
     gregated residual transformations for deep neural                the IEEE/CVF International Conference on Com-
     networks, in: Proceedings of the IEEE conference                 puter Vision, 2021, pp. 10012–10022.
     on computer vision and pattern recognition, 2017,           [23] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You


                                                             7
Jiří Vyskočil et al. CEUR Workshop Proceedings                  1–9


     only look once: Unified, real-time object detection,
     in: Proceedings of the IEEE conference on computer
     vision and pattern recognition, 2016, pp. 779–788.
[24] A. Bochkovskiy, C.-Y. Wang, H.-Y. M. Liao, Yolov4:
     Optimal speed and accuracy of object detection,
     arXiv preprint arXiv:2004.10934 (2020).
[25] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
     O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
     Roberta: A robustly optimized bert pretraining ap-
     proach, arXiv preprint arXiv:1907.11692 (2019).
[26] D. A. Hudson, C. D. Manning, Gqa: A new
     dataset for real-world visual reasoning and com-
     positional question answering, in: Proceedings of
     the IEEE/CVF conference on computer vision and
     pattern recognition, 2019, pp. 6700–6709.
[27] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra,
     D. Parikh, Making the V in VQA matter: Elevating
     the role of image understanding in Visual Question
     Answering, in: Conference on Computer Vision
     and Pattern Recognition (CVPR), 2017.
[28] M. Ren, R. Kiros, R. Zemel, Exploring models and
     data for image question answering, Advances in
     neural information processing systems 28 (2015).
[29] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, A. Tor-
     ralba, Places: A 10 million image database for scene
     recognition, IEEE Transactions on Pattern Analysis
     and Machine Intelligence (2017).
[30] R. Wightman, Pytorch image models, https:
     //github.com/rwightman/pytorch-image-models,
     2019. doi:10.5281/zenodo.4414861.
[31] A. Kamath, M. Singh, Y. LeCun, G. Synnaeve,
     I. Misra, N. Carion, Mdetr - modulated detection
     for end-to-end multi-modal understanding, in: Pro-
     ceedings of the IEEE/CVF International Conference
     on Computer Vision (ICCV), 2021, pp. 1780–1790.
[32] D. Hudson, C. D. Manning, Learning by abstrac-
     tion: The neural state machine, Advances in Neural
     Information Processing Systems 32 (2019).
[33] S. Riezler, J. T. Maxwell III, On some pitfalls in
     automatic evaluation and significance testing for
     mt, in: Proceedings of the ACL workshop on intrin-
     sic and extrinsic evaluation measures for machine
     translation and/or summarization, 2005, pp. 57–64.


                                                            8
Jiří Vyskočil et al. CEUR Workshop Proceedings                                                                                                                                            1–9


A. Additional prediction examples


           Where is she sitting?                    Where is the umpire?                 Is there any grass in the scene that is brown?           What place is the photo at?


       VinVL              (our) VinVL+L         VinVL               (our) VinVL+L              VinVL                (our) VinVL+L                VinVL                  (our) VinVL+L

       Table               Restaurant        Home plate                    Field                 No                       Yes                 Classroom                   Restaurant


   What do you think is covered in snow?    What is on the freezer in the kitchen?           Where is the young boy running?              Are there either any cars or vehicles in the image?


       VinVL              (our) VinVL+L        VinVL                (our) VinVL+L              VinVL                (our) VinVL+L                VinVL                  (our) VinVL+L

        Car                  Ground             Shelf                  Sticker                 Sand                      Beach                     No                          Yes


       What is the piled vegetable?            What is the red cabin made of?                 What is the donut covered with?                Is there a bird or a cat that is sitting?


       VinVL              (our) VinVL+L        VinVL                (our) VinVL+L              VinVL                (our) VinVL+L                VinVL                  (our) VinVL+L

      Potato               Cauliflower          Metal                      Wood                 Icing                  Sprinkles                   No                          Yes


   Which side of the photo is the bus on?               Is the man thin?                           What place is shown?                            What is the girl sitting on?


       VinVL              (our) VinVL+L        VinVL                (our) VinVL+L              VinVL                (our) VinVL+L                VinVL                  (our) VinVL+L

        Yes                     No               No                        Yes                  Pen                       Zoo                    Towel                      Blanket


Figure 4: Randomly selected predictions; VinVL+L and VinVL methods evaluated over GQA test2019 set. The
VinVL+L method impacts a decision based on newly included binary location (i.e. indoor and outdoor) metadata. In most
cases where VinVL+L prediction differs from VinVL, the VinVL+L produced a subjectively more reasonable prediction.


                                                                                     9