1. Introduction

VinVL+L: Enriching Visual Representation with Location Context in VQA

Jiří Vyskočil

Lukáš Picek

0 0 Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia , Technická 8, Pilsen , Czech Republic

In this paper, we describe a novel method - VinVL+L - that enriches the visual representations (i.e. object tags and region features) of the State-of-the-Art Vision and Language (VL) method - VinVL - with Location information. To verify the importance of such metadata for VL models, we (i) trained a Swin-B model on the Places365 dataset and obtained additional sets of visual and tag features; both were made public to allow reproducibility and further experiments, (ii) did an architectural update to the existing VinVL method to include the new feature sets, and (iii) provide a qualitative and quantitative evaluation. By including just binary location metadata, the VinVL+L method provides incremental improvement to the State-of-the-Art VinVL in Visual Question Answering (VQA). The VinVL+L achieved an accuracy of 64.85% and increased the performance by +0.32% in terms of accuracy on the GQA dataset; the statistical significance of the new representations is verified via Approximate Randomization. The code and newly generated sets of features are available at https://github.com/vyskocj/VinVL-L.

eol>Vision and Language Visual Question Answering Location Recognition Oscar VinVL

1. Introduction

Multi-modal understanding systems can answer general questions from visual and textual data. These questions are largely focused on objects and their relations, appearances, or behaviors. Rest of them are asked about the overall scene, such as location or weather. Most of multimodal systems are split into visual and textual modules, followed by image-text alignment. Faster R-CNN [1] region features of detected objects are commonly used for + visual representation and BERT [2] embeddings for the Where is it? textual. However, such visual model only provides information about objects, from which the entire multi-modal VinVL (our) VinVL+L system must decide simple questions like "Are people shop (24.6%) bedroom (59.9%) inside or outside?". store (22.5%) living room (15.1%)

We intuitively feel that in general, the objects are re- porch (9.0%) hotel room (2.3%) lated to indoor/outdoor scene division even if they cannot be directly assigned. They have a certain weight on the Figure 1: Example predictions of the proposed VinVL+L. basis of which the correct answer can be decided. For We compare VinVL+L with the State-of-the-Art VinVL on the example, cars, sky, and trees are more likely to belong randomly selected input pair (i.e. image and question) from to an outdoor scene, however, the scene may be indoors, the GQA test set. The VinVL+L better aligns the answer to and these categories can be detected through the garage the question thanks to the enriched visual features. door. In addition to [3, 4, 5], the mentioned paradigm of splitting image-text modules follows VinVL [6] based on Oscar [3] that additionally adds object tags, i.e., tex- features. However, a clear cross-modal representation of tual output from an Object Detection network, to region the scene is still missing, which can harm the network, as shown in Figure 1. 26th Computer Vision Winter Workshop, Robert Sablatnig and Florian Our method, based on VinVL, brings a new represen*KCleobrerre(sepdosn.)d,iKnrge maust,hLoorw.er Austria, Austria, Feb. 15-17, 2023 tation including information about the location into the $ vyskocj@kky.zcu.cz (J. Vyskočil) system. This representation is obtained using a classifi0000-0002-6443-2051 (J. Vyskočil); 0000-0002-6041-9722 cation network trained on the Places365 dataset having a (L. Picek) total of 365 location categories. All of these categories are © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License directly split into one of the indoor and outdoor supercatCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) egories. All of these labels are then passed as scene tags BERT-based VL Methods End-to-end methods such to our VinVL+L method to predict the answers. Besides, as MDETR [7] use a pre-trained image classification we utilize scene features that are generated in the same backbone to extract features and concatenate them with way as the region features of Oscar/VinVL. Finally, we word embeddings taken from a BERT-based model [2, 25]. evaluate influences on answers while using these novel- However, some existing VL methods [3, 4] reuse the exties. An example of the top 3 predictions of the VinVL and tracted features from another approach, e.g., a bottom-up our VinVL+L is visualized in Figure 1. More examples are mechanism [5] that extracts object regions via Faster Rshown in Section 5.3 and Appendix A. Our contributions CNN, to fine-tune a novel method with an unchanged are: visual model. These methods include Oscar [3], which introduces object tags as cross-modal representation to • We enrich visual representations of the VinVL improve the alignment of the image-text pairs. Based using the global information about the image - on Oscar, VinVL [6] improves visual representation by location. pre-training larger model on multiple object detection • We present the efectiveness of each new cross- datasets. Since this method holds State-of-the-Art results modal representation as we compare their related on the GQA dataset [26] and represents the image as a models including a reproduced version of the set of regional features while suppressing global scene VinVL. information, we decided to improve the alignment of the • We improve the VinVL in visual question answer- cross-modal representation by location recognition. ing (VQA) with an overall accuracy of 64.85% on the GQA dataset. • We provide data with the location context that 3. Datasets we generated for the GQA dataset.

The early datasets, such as VQA [27] and COCO-QA [28],

contain only the core annotation needed for the vision 2. Related Work question answering: an image, a question, and a desired one-word answer. However, we are interested in dataset Many Vision and Language (VL) methods, like [3, 6, 7, 8], containing richer annotations to recognize types of locafocus on pre-training generic models by combining mul- tions in the image input. It does contain the GQA dataset, tiple datasets from diferent tasks. Then the models are but only for part of the images. Therefore, there are two ifne-tuned to downstream tasks that include: image cap- existing datasets Places365 and GQA suitable for our task. tioning, visual reasoning, or visual question answering. Both datasets are thoroughly described below. In this section, we briefly review recent approaches to VL tasks and their commonly used Vision Encoders, which are the most relevant for our work.

Vision Encoders Convolutional Neural Networks

(CNNs) gained popularity in image classification when AlexNet [9] won the ImageNet 2012 competition. In the subsequent period, models with skip-connections [10, GQA dataset [26] This dataset consists of 22,669,678 11, 12, 13] with blocks having small feed-forward net- quest ions (from which the test2019 split contains works in parallel connections [12, 14, 15], or with a fo- 4,237,524 questions) over 113,018 images with 1,878 poscus on optimization [16, 17, 18, 19, 20] were created. In sible answers to open and binary yes/no questions. In recent years, Transformer-based methods, such as Vi- addition to questions and answers, each image contains sion Transformer [21], or its modification with shifted annotations of objects, the relations between them, and windows [22], gained favor thanks to computational e-fi their attributes. Besides, each image contains global inciency and accuracy. These image classification models formation in the form of location and weather, the disare often used as backbone architectures in object detec- tribution of which is shown in Table 1. Regarding the tion to predict bounding boxes with a classification of evaluation of the results, the following metrics are used: each object in the image. The most popular detectors are the one-shot Yolo-based architectures [23, 24] and • Accuracy – overall accuracy, primary metric, two-shot Faster R-CNN-based architectures [1], which • Binary – accuracy of yes/no questions, are generally slower but more accurate than the one-shot • Open – accuracy of open questions, ones. The image classification or object detection models • Consistency – overall accuracy including equivaare further used as Visual Encoders in the VL tasks. lent answers,

Places365 [29] This dataset consists of 365 location

categories that we can directly map to indoors/outdoors category. The balanced training set varies from 3,068 to 5,000 images per location category, while the validation set consists of 50 images per category.

4. Methodology

The Vision and Language (VL) approaches are commonly divided into two phases: pre-training and fine-tuning. In pre-training, multiple datasets of diferent tasks are combined to create generic models. In fine-tuning, these models are then trained on each of these datasets, called downstream tasks. In this study, we focus on improving the current State-of-the-Art VinVL [6] in GQA dataset [26].

This improved version learns the image-text representation with respect to the global information of an entire image, such as indoors/outdoors, which is given by novel scene tags and features. 1–9 (1) (2) ≜ [, , ,, , , ,, , ,, ,] [ ,, , , ,, , , ]

⏟ &⏞ ⏟⏞ ⏟⏞ ⏟&⏞ ⏟ ⏞ ⏟ ⏞ where Dictionary View defines Masked Token Loss ℒ , applied on the discrete token sequence ℎ ≜ [, ], to predict the masked tokens ℎ based on their surrounding tokens ℎ∖:

ℒ = − E(ℎ,)∼ log (ℎ|ℎ∖, )

Modality View defines Contrastive Loss ℒ for the image representation ℎ′ ≜ [, ], which is "polluted" by randomly replacing with another sequence of tags from the dataset . To distinguish the original pair ( = 1) from the polluted one ( = 0), a binary classifier (.) as a fully-connected layer is applied on the top of the [CLS] token. This loss function is defined as [3]: ℒ = − E(,ℎ′;)∼ log (| (, ℎ))

(3)

Alternatively, VinVL [6] applies the 3-way Constrastive Loss ℒ3 on ℎ* ≜ [, , ], instead of the binary ℒ used in Oscar [3], to predict whether the (, , ) triplet is the original one ( = 0), contains a polluted ( = 1), or contains a polluted ( = 2): ℒ3 = − E(ℎ* ;)∼ log (| (, , )) (4) • Plausibility – relative number of answers making sequence of the text, is the word embedding sequence sense with respect to the dataset, of the scene and object tags detected from the image, and • Validity – relative number of answers that are in is the visual embedding sequence of the entire image the question scope, and all detected regions. This input can be viewed from • Distribution – overall match between the distri- two diferent perspectives as [3, 6]:

butions of true answers and model predictions.

4.1. Adding locations to VinVL

Based on VinVL [6], we present an extended architecture with scene tags and features. In our work, these representations are simply generated using a fine-tuned clas- ℒ − = ℒ + ℒ3 ( ) (5) sification network on the Places365 dataset [ 29] with an accuracy of up to 96.1% in case of binary indoor/outdoor 4.2. Implementation Details classification (see Section 5.1 for more details). Scene tags are the predicted location categories. Scene features We use the same feature-vector size (i.e. 2,048) in order are made in the same style as their object counterparts, to match the size of VinVL. These features are then i.e., as a 2,048 feature vector (obtained via Global Aver- concatenated with positions and sizes, as described in age Pooling) concatenated with top-left & bottom-right Section 4.1. The models used from the Timm library [30] corners, and height & width. Besides, the novel scene are: resnext50d_32x4d [13], gluon_inception_v3 [15], morepresentations are prepended before the object ones so bilenetv3_small_100 [18], gc_eficientnetv2_rw_t [ 20], that the scenes in the embeddings always have the same vit_large_patch16_224_in21k [21], and swin_base_ position for each image-text pair input, as outlined in patch4_window7_224_in22k [22]. All models are Figure 2. ifne-tuned for 20 epochs with SGD and Focal Loss. We

Even though we do not perform pre-training on var- use an initial learning rate of 0.01 and we reduce it with ious tasks with the new representation, in general, the a plateau scheduler. The batch size is 64 with 2 yet-established pre-training objective of Oscar/VinVL [6] accumulation steps. We use horizontal flip (probability can be followed. The change is only in the definition of of 50%), random resized crop (scale from 0.8 to 1.0), and the (, , ) triple input, where is the word embedding random brightness contrast (probability of 20%).

By fusing Equation 2 and 4, or 2 and 3, the full pretraining objective is:

Features

Network Embeddings

Data Modality Dictionary

Contrastive Loss Masked Token Loss

Multi-Layer Transformers [CLS]

Is [MASK] indoors or outdoors ? [SEP] Lriovoinmg Indoors Tree

Dog Snowman [SEP] Word Tokens

Scene & Object Tags

Scene & Region Features Language

Image Language

Image scene features , region features ], where word tokens, object tags and region features are taken from VinVL [6]. Scene tags and features are proposed to improve the alignment of cross-domain semantics. The example shows a case where only detected objects could be classified as outdoors rather than indoors.

In the case of the VL model, we use the pre-trained

Oscar+BASE with VinVL features and follow their presented procedure which is the same as the original Oscar, i.e., pre-training on the unbalanced "all-split" of the GQA dataset for 5 epochs, and fine-tune the best model with respect to overall accuracy on the "balanced-split" for 2 epochs. All the results are shown in Section 5.2 together with a reproduction of VinVL that we improve.

5. Experiments Our approach is divided into two separate steps. First, we

adapt several image classification models to the Places365 dataset and select the most accurate model to generate a visual representation for the VL model. Then, we finetune the VL model using its original and our new visual features. which is 1.1% higher than that of the second-best ViTLarge. Therefore, this model is further used to extract novel visual representations for our VinVL+L.

5.2. Visual Question Answering

5.1. Location Recognition Statistical significance of novel features We show the advantages of the new visual representations by comWe selected several pre-trained image classification net- paring our method with the reproduced VinVL using the works in order to cover a certain range of diferent ap- same training pipeline – see Table 5. The used scene tags proaches to location recognition. These methods include as 365 location categories (C), or indoors/outdoors (IO), those focused on high inference speed, methods con- are denoted in subscripts of the model name. Besides, we taining skip-connections, parallel paths, or transformers. compute the statistical significance [ 33] between the two Results of fine-tuned models on the Places365 dataset models to show that recognizing the location categories are shown in Table 2. ResNeXt-50, EficientNetV2, and truly brings benefits and it is not just a coincidence. For ViT-Large have similar performance, while ViT-Large demonstration, we compare the reproduced VinVL with performs slightly worse in indoors/outdoors classifica- our VinVL+LC on the validation dataset. Our goal is to tion. It is because when the ViT-Large is not right, it is reject the null hypothesis defined as "there is no difermore often the incorrect indoors/outdoors supercategory ence between system A and B". To do this, we shufle the than in the case of the previous two mentioned models. predictions between systems A and B with a probability The best results are achieved by Swin-Base in both 365 of 50%, and we compare the performance with the initial locations and binary indoors/outdoors recognition. It one (all repeated 10,000 times). Consequently, we reject obtains 56% top-1 accuracy in recognizing 365 locations, the null hypothesis at the 95% significance level, i.e., a Table 5 We selected the best model with respect to overall accuAccuracy of answers on the validation dataset. We evalu- racy on the validation set, and we pushed the results into ate the reproduced VinVL with our improved versions on the the evaluation server. The performances of the models balanced validation GQA dataset. are listed in Table 4. The reproduced version of VinVL Backbone Accuracy Binary Open still has worse performance than the original one, but VinVL (reproduced) 63.2 52.5 82.3 the diference is decreased with this modification of the VinVL+LC+IO 63.4 52.7 82.3 training.

VinVL+LC 63.8 53.0 83.0 According to the results, all of our models answer more VinVL+LIO 64.1 53.7 82.6 accurately and outperform the reproduced model in all metrics, except in some cases of Consistency and Distribution. For example, even the VinVL+LC answers 0.40% threshold is equal to 0.05, with obtained - = 0.03. better on open questions and 0.23% better on yes/no quesThe same conclusion is reached for VinVL+LIO. In the tions, resulting in 0.32% higher overall accuracy, it has case of the VinVL+LC+IO, the diference is not significant, 0.14% lower performance in Consistency metric. This so the null hypothesis cannot be rejected. means that when our model fails, the prediction is truly

The significance may seem small from a general point meaningless to the given question. However, this model of view. However, it should be considered that these shows the best performance compared with other verresults were achieved by simply adding locations to the sions of VinVL+L: VinVL+LIO holds only the highest Consystem. To improve significance, scene features should sistency (+0.03% compared with reproduced VinVL and be generated from the same model as region features. In +0.17% compared with VinVL+LC), and VinVL+LC+IO outaddition, other global information such as weather may performs all compared models in Validation and Distribube included. tion. We show the results of the prior State-of-the-Arts in Table 3. Even though, our VinVL+L method noticeably surpasses the original version in the primary metric: +0.20% of overall accuracy for VinVL+LC.

Comparison on the test set Although we followed the original training pipeline, on which the results of our models are based, it should be noted that the reproduced VinVL works worse than the original version. There- 5.3. Summary and Discussion fore, we decided to select models after the 1st, 3rd, and 5th epochs from the pre-training on the unbalanced set. An improvement in the visual question answering is Then we fine-tuned these models for 2 epochs on the achieved by taking global information about the visual balanced set to slightly increase the final performance. component into account. Table 4 and 5 confirm this fact Who is wearing the dress?

What is the woman doing?

Inside what is the pizza?

What is inside the container next to the glass? VinVL Woman (our) VinVL+L

Woman

VinVL Walking (our) VinVL+L

Sitting

VinVL Box (our) VinVL+L

Box

VinVL Straw (our) VinVL+L Ice cream GT

Women

Resting

Pizza bosx

PacPket for all our VinVL+L models. In addition, we show the the used novel visual representations. Therefore, our arwrong predictions of our VinVL+L (along with predic- ticle only shows the efectiveness of incorporating global tions of reproduced VinVL) against the Ground Truth location information into a system that works only on labels. The image-question pairs are randomly chosen the basis of objects. from the validation set, see Figure 3. Even if our model answers are wrong in the given examples, it is worth saying that some of the answers are not truly wrong, e.g., in 6. Conclusion the second example, in which the woman is truly sitting and, in our opinion, there is missing additional informa- This paper presents VinVL+L, an enriched version of the tion to say if she is really resting, instead of just sitting. VinVL with location context as a novel visual representaBesides these examples, we show predictions from the tion. We generate the new representat ions as scene tags test2019 set in Appendix A. and features and we prepend them before the original em

It is worth emphasizing that the listed models do not beddings of the architecture. Our version achieves higher use scene features, only tags. A model using both scene overall accuracy than the original method on the GQA tags and features did not achieve the expected results. dataset, and we show that global information about the This behavior was anticipated for two reasons. First, entire image influences the answers and thus should not even if we follow the generating procedure of the scene be ignored. The best results of 64.85% overall accuracy features, the VL model obtains a vector with diferent are achieved with the model using 365 location categories semantics compared to region features. To solve this as scene tags. Besides, we performed an Approximate issue, the scene features must be generated from the Randomization test to verify that the achieved results are same model to avoid subsequent confusion. Second, all statistically significant. Similarly, weather recognition image and text representations are passed to the modified for outdoor scenes could be included in the concept to BERT model, which is still a language model pre-trained help the network with alignments of image-text pairs on text corpora, with additional visual features added. with respect to global information. All generated data Therefore, the words still have a higher weight than the and code are publicly available on our GitHub. visual features.

Regarding the performance of the reproduced VinVL, Acknowledgments we used the original code including the pipeline presented in [6]. However, the network reproduced by us The work has been supported by the grant of the Univerachieved worse performance in all metrics, e.g., 0.12% in sity of West Bohemia, project No. SGS-2022-017. Comoverall accuracy. Since the main goal is to improve this putational resources were supplied by the project "emethod, we decided to primarily compare our models Infrastruktura CZ" (e-INFRA CZ LM2018140) supported with the reproduced version, on which the benefits are by the Ministry of Education, Youth and Sports of the best observed. All the listed models were trained using Czech Republic. the same device, hyperparameters settings, only difer in 1–9

A. Additional prediction examples

Where is she sitting? Where is the umpire? Is there any grass in the scene that is brown? What place is the photo at? Are there either any cars or vehicles in the image? Is there a bird or a cat that is sitting? What is the girl sitting on?

pp. 1492 - 1500 .

[12]

Szegedy ,

Iofe ,

Vanhoucke ,

A. A.

Alemi , [1]

Ren ,

He ,

Girshick ,

Sun , Faster r-cnn: Inception-v4, inception-resnet and the impact of

posal networks , in: C. Cortes , N. D. Lawrence , D. D. AAAI conference on artificial intelligence , 2017 .

Lee , M.

Sugiyama , R. Garnett (Eds.), Advances in [13] T.

He , Z.

Zhang , H.

Zhang , Z.

Zhang , J. Xie, M.

Li ,

Neural Information Processing Systems 28 , Curran

Bag

of tricks for image classification with convo-

Associates , Inc., 2015 , pp. 91 - 99 . lutional neural networks , in: Proceedings of the [2]

Devlin , M.-

Chang ,

Lee ,

Toutanova , IEEE/CVF Conference on Computer Vision and Pat-

Bert: Pre-training of deep bidirectional transform- tern Recognition , 2019 , pp. 558 - 567 .

ers for language understanding , arXiv preprint [14]

Szegedy , W. Liu,

Jia ,

Sermanet , S. Reed,

arXiv: 1810 . 04805 ( 2018 ). D. Anguelov , D.

Erhan , V.

Vanhoucke , A . Rabi[3]

Li ,

Yin ,

Li ,

Zhang ,

Hu , L. Zhang, novich, Going deeper with convolutions , in: Pro-

Object-semantics aligned pre-training for vision- and pattern recognition , 2015 , pp. 1 - 9 .

language tasks , in: European Conference on Com- [15]

Szegedy ,

Vanhoucke ,

Iofe ,

Shlens , Z . Wo-

puter Vision , Springer, 2020 , pp. 121 - 137 . jna, Rethinking the inception architecture for com [4]

Chen ,

Gan ,

Li , Y. Cheng, W. Wang, J. Liu, puter vision , in: Proceedings of the IEEE conference

Meta module network for compositional visual rea- on computer vision and pattern recognition, 2016 ,

soning, in: Proceedings of the IEEE/CVF Winter pp. 2818 - 2826 .

Conference on Applications of Computer Vision , [16]

A. G.

Howard ,

Zhu ,

Chen ,

Kalenichenko ,

2021 , pp. 655 - 664 . W. Wang,

Weyand ,

Andreetto ,

Adam , Mo[5]

Anderson ,

He ,

Buehler ,

Teney , M.

John- bilenets: Eficient convolutional neural networks

attention for image captioning and visual question arXiv:1704.04861 ( 2017 ).

answering, in: Proceedings of the IEEE conference [17]

Sandler ,

Howard ,

Zhu ,

Zhmoginov , L.-

on computer vision and pattern recognition, 2018, C. Chen, Mobilenetv2: Inverted residuals and linear

pp. 6077 - 6086 . bottlenecks, in: Proceedings of the IEEE conference [6]

Zhang ,

Li ,

Hu ,

Yang ,

Zhang , L. Wang, on computer vision and pattern recognition, 2018 ,

Choi ,

Gao , Vinvl: Revisiting visual representa- pp. 4510 - 4520 .

tions in vision-language models , in: Proceedings [18]

Howard ,

Sandler ,

Chu , L.- C. Chen , B. Chen ,

and Pattern

Recognition , 2021 , pp. 5579 - 5588 . et al., Searching for mobilenetv3 , in: Proceedings [7]

Kamath ,

Singh ,

LeCun , G. Synnaeve, of the IEEE/CVF International Conference on Com-

Misra ,

Carion , Mdetr-modulated detection puter Vision , 2019 , pp. 1314 - 1324 .

for end-to-end multi-modal understanding , in: Pro- [19]

Tan ,

Le , Eficientnet: Rethinking model scal-

on Computer Vision , 2021 , pp. 1780 - 1790 . national Conference on Machine Learning, PMLR, [8]

Tan ,

Bansal , Lxmert: Learning cross-modality 2019 , pp. 6105 - 6114 .

encoder representations from transformers , arXiv [20]

Tan ,

Le , Eficientnetv2: Smaller models and

preprint arXiv: 1908 . 07490 ( 2019 ). faster training , in: International Conference on [9]

Krizhevsky , I. Sutskever,

G. E.

Hinton , Imagenet Machine Learning, PMLR , 2021 , pp. 10096 - 10106 .

classification with deep convolutional neural net- [21]

Dosovitskiy ,

Beyer ,

Kolesnikov , D. Weis-

mation Processing Systems 25 , Curran

Associates

Houlsby , An image is worth 16x16 words: Trans-

Inc. , 2012 , pp. 1097 - 1105 . formers for image recognition at scale , in: Inter[10]

He ,

Zhang , S. Ren,

Sun , Deep residual learn- national Conference on Learning Representations,

ing for image recognition , in: Proceedings of the Vienna , 2021 .

IEEE conference on computer vision and pattern [22]

Liu ,

Lin ,

Cao ,

Hu ,

Wei ,

Zhang , S. Lin,

recognition , 2016 , pp. 770 - 778 . B. Guo , Swin transformer: Hierarchical vision trans [11]

Xie ,

Girshick ,

Dollár ,

Tu ,

He , Ag- former using shifted windows , in: Proceedings of

networks, in: Proceedings of the IEEE conference puter Vision , 2021 , pp. 10012 - 10022 .

on computer vision and pattern recognition, 2017 , [23]

Redmon ,

Divvala ,

Girshick ,

Farhadi , You

vision and pattern recognition , 2016 , pp. 779 - 788 . [24]

Bochkovskiy , C.-Y. Wang, H. -Y. M. Liao , Yolov4:

arXiv preprint arXiv: 2004 . 10934 ( 2020 ). [25]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi , D. Chen,

proach , arXiv preprint arXiv: 1907 . 11692 ( 2019 ). [26]

D. A.

Hudson ,

C. D.

Manning , Gqa: A new

pattern recognition , 2019 , pp. 6700 - 6709 . [27]

Goyal ,

Khot ,

Summers-Stay ,

Batra ,

and Pattern Recognition (CVPR) , 2017 . [28]

Ren ,

Kiros ,

Zemel , Exploring models and

neural information processing systems 28 ( 2015 ). [29]

Zhou ,

Lapedriza ,

Khosla ,

Oliva , A . Tor-

ralba, Places: A 10 million image database for scene

and Machine Intelligence ( 2017 ). [30]

Wightman , Pytorch image models, https:

2019. doi: 10 .5281/zenodo.4414861. [31]

Kamath ,

Singh ,

LeCun , G. Synnaeve,

on Computer Vision (ICCV), 2021 , pp. 1780 - 1790 . [32]

Hudson ,

C. D.

Manning , Learning by abstrac-

Information Processing Systems 32 ( 2019 ). [33]

Riezler , J. T. Maxwell

III

, On some pitfalls in

translation and/or summarization, 2005 , pp. 57 - 64 .