<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>VinVL+L: Enriching Visual Representation with Location Context in VQA</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiří Vyskočil</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lukáš Picek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia</institution>
          ,
          <addr-line>Technická 8, Pilsen</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we describe a novel method - VinVL+L - that enriches the visual representations (i.e. object tags and region features) of the State-of-the-Art Vision and Language (VL) method - VinVL - with Location information. To verify the importance of such metadata for VL models, we (i) trained a Swin-B model on the Places365 dataset and obtained additional sets of visual and tag features; both were made public to allow reproducibility and further experiments, (ii) did an architectural update to the existing VinVL method to include the new feature sets, and (iii) provide a qualitative and quantitative evaluation. By including just binary location metadata, the VinVL+L method provides incremental improvement to the State-of-the-Art VinVL in Visual Question Answering (VQA). The VinVL+L achieved an accuracy of 64.85% and increased the performance by +0.32% in terms of accuracy on the GQA dataset; the statistical significance of the new representations is verified via Approximate Randomization. The code and newly generated sets of features are available at https://github.com/vyskocj/VinVL-L.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Vision and Language</kwd>
        <kwd>Visual Question Answering</kwd>
        <kwd>Location Recognition</kwd>
        <kwd>Oscar</kwd>
        <kwd>VinVL</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Multi-modal understanding systems can answer general
questions from visual and textual data. These questions
are largely focused on objects and their relations,
appearances, or behaviors. Rest of them are asked about the
overall scene, such as location or weather. Most of
multimodal systems are split into visual and textual modules,
followed by image-text alignment. Faster R-CNN [1]
region features of detected objects are commonly used for +
visual representation and BERT [2] embeddings for the Where is it?
textual. However, such visual model only provides
information about objects, from which the entire multi-modal VinVL (our) VinVL+L
system must decide simple questions like "Are people shop (24.6%) bedroom (59.9%)
inside or outside?". store (22.5%) living room (15.1%)</p>
      <p>We intuitively feel that in general, the objects are re- porch (9.0%) hotel room (2.3%)
lated to indoor/outdoor scene division even if they cannot
be directly assigned. They have a certain weight on the Figure 1: Example predictions of the proposed VinVL+L.
basis of which the correct answer can be decided. For We compare VinVL+L with the State-of-the-Art VinVL on the
example, cars, sky, and trees are more likely to belong randomly selected input pair (i.e. image and question) from
to an outdoor scene, however, the scene may be indoors, the GQA test set. The VinVL+L better aligns the answer to
and these categories can be detected through the garage the question thanks to the enriched visual features.
door. In addition to [3, 4, 5], the mentioned paradigm
of splitting image-text modules follows VinVL [6] based
on Oscar [3] that additionally adds object tags, i.e., tex- features. However, a clear cross-modal representation of
tual output from an Object Detection network, to region the scene is still missing, which can harm the network,
as shown in Figure 1.
26th Computer Vision Winter Workshop, Robert Sablatnig and Florian Our method, based on VinVL, brings a new
represen*KCleobrerre(sepdosn.)d,iKnrge maust,hLoorw.er Austria, Austria, Feb. 15-17, 2023 tation including information about the location into the
$ vyskocj@kky.zcu.cz (J. Vyskočil) system. This representation is obtained using a
classifi0000-0002-6443-2051 (J. Vyskočil); 0000-0002-6041-9722 cation network trained on the Places365 dataset having a
(L. Picek) total of 365 location categories. All of these categories are
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License directly split into one of the indoor and outdoor
supercatCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
egories. All of these labels are then passed as scene tags BERT-based VL Methods End-to-end methods such
to our VinVL+L method to predict the answers. Besides, as MDETR [7] use a pre-trained image classification
we utilize scene features that are generated in the same backbone to extract features and concatenate them with
way as the region features of Oscar/VinVL. Finally, we word embeddings taken from a BERT-based model [2, 25].
evaluate influences on answers while using these novel- However, some existing VL methods [3, 4] reuse the
exties. An example of the top 3 predictions of the VinVL and tracted features from another approach, e.g., a bottom-up
our VinVL+L is visualized in Figure 1. More examples are mechanism [5] that extracts object regions via Faster
Rshown in Section 5.3 and Appendix A. Our contributions CNN, to fine-tune a novel method with an unchanged
are: visual model. These methods include Oscar [3], which
introduces object tags as cross-modal representation to
• We enrich visual representations of the VinVL improve the alignment of the image-text pairs. Based
using the global information about the image - on Oscar, VinVL [6] improves visual representation by
location. pre-training larger model on multiple object detection
• We present the efectiveness of each new cross- datasets. Since this method holds State-of-the-Art results
modal representation as we compare their related on the GQA dataset [26] and represents the image as a
models including a reproduced version of the set of regional features while suppressing global scene
VinVL. information, we decided to improve the alignment of the
• We improve the VinVL in visual question answer- cross-modal representation by location recognition.
ing (VQA) with an overall accuracy of 64.85% on
the GQA dataset.
• We provide data with the location context that 3. Datasets
we generated for the GQA dataset.</p>
      <sec id="sec-1-1">
        <title>The early datasets, such as VQA [27] and COCO-QA [28],</title>
        <p>contain only the core annotation needed for the vision
2. Related Work question answering: an image, a question, and a desired
one-word answer. However, we are interested in dataset
Many Vision and Language (VL) methods, like [3, 6, 7, 8], containing richer annotations to recognize types of
locafocus on pre-training generic models by combining mul- tions in the image input. It does contain the GQA dataset,
tiple datasets from diferent tasks. Then the models are but only for part of the images. Therefore, there are two
ifne-tuned to downstream tasks that include: image cap- existing datasets Places365 and GQA suitable for our task.
tioning, visual reasoning, or visual question answering. Both datasets are thoroughly described below.
In this section, we briefly review recent approaches to VL
tasks and their commonly used Vision Encoders, which
are the most relevant for our work.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Vision Encoders Convolutional Neural Networks</title>
        <p>
          (CNNs) gained popularity in image classification when
AlexNet [9] won the ImageNet 2012 competition. In the
subsequent period, models with skip-connections [10, GQA dataset [26] This dataset consists of 22,669,678
11, 12, 13] with blocks having small feed-forward net- quest
          <xref ref-type="bibr" rid="ref24">ions (from which the test2019</xref>
          split contains
works in parallel connections [12, 14, 15], or with a fo- 4,237,524 questions) over 113,018 images with 1,878
poscus on optimization [16, 17, 18, 19, 20] were created. In sible answers to open and binary yes/no questions. In
recent years, Transformer-based methods, such as Vi- addition to questions and answers, each image contains
sion Transformer [21], or its modification with shifted annotations of objects, the relations between them, and
windows [22], gained favor thanks to computational e-fi their attributes. Besides, each image contains global
inciency and accuracy. These image classification models formation in the form of location and weather, the
disare often used as backbone architectures in object detec- tribution of which is shown in Table 1. Regarding the
tion to predict bounding boxes with a classification of evaluation of the results, the following metrics are used:
each object in the image. The most popular detectors
are the one-shot Yolo-based architectures [23, 24] and • Accuracy – overall accuracy, primary metric,
two-shot Faster R-CNN-based architectures [1], which • Binary – accuracy of yes/no questions,
are generally slower but more accurate than the one-shot • Open – accuracy of open questions,
ones. The image classification or object detection models • Consistency – overall accuracy including
equivaare further used as Visual Encoders in the VL tasks. lent answers,
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>Places365 [29] This dataset consists of 365 location</title>
        <p>categories that we can directly map to indoors/outdoors
category. The balanced training set varies from 3,068 to
5,000 images per location category, while the validation
set consists of 50 images per category.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Methodology</title>
      <p>The Vision and Language (VL) approaches are commonly
divided into two phases: pre-training and fine-tuning. In
pre-training, multiple datasets of diferent tasks are
combined to create generic models. In fine-tuning, these
models are then trained on each of these datasets, called
downstream tasks. In this study, we focus on improving the
current State-of-the-Art VinVL [6] in GQA dataset [26].</p>
      <p>This improved version learns the image-text
representation with respect to the global information of an entire
image, such as indoors/outdoors, which is given by novel
scene tags and features.
1–9
(1)
(2)
 ≜ [, , ,, , , ,, , ,, ,]  [ ,, , , ,, , , ]</p>
      <p>⏟ &amp;⏞ ⏟⏞ ⏟⏞ ⏟&amp;⏞
⏟ ⏞   ⏟ ⏞  
where Dictionary View defines Masked Token Loss
ℒ , applied on the discrete token sequence ℎ ≜
[, ], to predict the masked tokens ℎ based on their
surrounding tokens ℎ∖:</p>
      <p>ℒ  = − E(ℎ,)∼ log (ℎ|ℎ∖, )</p>
      <p>Modality View defines Contrastive Loss ℒ for the
image representation ℎ′ ≜ [, ], which is "polluted" by
randomly replacing  with another sequence of tags from
the dataset . To distinguish the original pair ( = 1)
from the polluted one ( = 0), a binary classifier  (.) as
a fully-connected layer is applied on the top of the [CLS]
token. This loss function is defined as [3]:
ℒ = − E(,ℎ′;)∼ log (| (, ℎ))</p>
      <p>(3)</p>
      <p>Alternatively, VinVL [6] applies the 3-way
Constrastive Loss ℒ3 on ℎ* ≜ [, , ], instead of the
binary ℒ used in Oscar [3], to predict whether the
(, , ) triplet is the original one ( = 0), contains a
polluted  ( = 1), or contains a polluted  ( = 2):
ℒ3 = − E(ℎ* ;)∼ log (| (, , ))
(4)
• Plausibility – relative number of answers making sequence of the text,  is the word embedding sequence
sense with respect to the dataset, of the scene and object tags detected from the image, and
• Validity – relative number of answers that are in  is the visual embedding sequence of the entire image
the question scope, and all detected regions. This input can be viewed from
• Distribution – overall match between the distri- two diferent perspectives as [3, 6]:</p>
      <p>butions of true answers and model predictions.</p>
      <sec id="sec-2-1">
        <title>4.1. Adding locations to VinVL</title>
        <p>Based on VinVL [6], we present an extended architecture
with scene tags and features. In our work, these
representations are simply generated using a fine-tuned clas- ℒ −  = ℒ + ℒ3 ( ) (5)
sification network on the Places365 dataset [ 29] with an
accuracy of up to 96.1% in case of binary indoor/outdoor 4.2. Implementation Details
classification (see Section 5.1 for more details). Scene
tags are the predicted location categories. Scene features We use the same feature-vector size (i.e. 2,048) in order
are made in the same style as their object counterparts, to match the size of VinVL. These features are then
i.e., as a 2,048 feature vector (obtained via Global Aver- concatenated with positions and sizes, as described in
age Pooling) concatenated with top-left &amp; bottom-right Section 4.1. The models used from the Timm library [30]
corners, and height &amp; width. Besides, the novel scene are: resnext50d_32x4d [13], gluon_inception_v3 [15],
morepresentations are prepended before the object ones so bilenetv3_small_100 [18], gc_eficientnetv2_rw_t [ 20],
that the scenes in the embeddings always have the same vit_large_patch16_224_in21k [21], and swin_base_
position for each image-text pair input, as outlined in patch4_window7_224_in22k [22]. All models are
Figure 2. ifne-tuned for 20 epochs with SGD and Focal Loss. We</p>
        <p>Even though we do not perform pre-training on var- use an initial learning rate of 0.01 and we reduce it with
ious tasks with the new representation, in general, the a plateau scheduler. The batch size is 64 with 2
yet-established pre-training objective of Oscar/VinVL [6] accumulation steps. We use horizontal flip (probability
can be followed. The change is only in the definition of of 50%), random resized crop (scale from 0.8 to 1.0), and
the (, , ) triple input, where  is the word embedding random brightness contrast (probability of 20%).</p>
        <sec id="sec-2-1-1">
          <title>By fusing Equation 2 and 4, or 2 and 3, the full pretraining objective is:</title>
          <p>Features</p>
          <p>Network
Embeddings</p>
          <p>Data
Modality
Dictionary</p>
          <p>Contrastive Loss Masked Token Loss</p>
          <p>Multi-Layer Transformers
[CLS]</p>
          <p>Is [MASK] indoors or outdoors ?
[SEP] Lriovoinmg Indoors Tree</p>
          <p>Dog Snowman [SEP]
Word Tokens</p>
          <p>Scene &amp; Object Tags</p>
          <p>Scene &amp; Region Features
Language</p>
          <p>Image
Language</p>
          <p>Image
scene features , region features ], where word tokens, object tags and region features are taken from VinVL [6]. Scene tags
and features are proposed to improve the alignment of cross-domain semantics. The example shows a case where only detected
objects could be classified as outdoors rather than indoors.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>In the case of the VL model, we use the pre-trained</title>
          <p>Oscar+BASE with VinVL features and follow their
presented procedure which is the same as the original Oscar,
i.e., pre-training on the unbalanced "all-split" of the GQA
dataset for 5 epochs, and fine-tune the best model with
respect to overall accuracy on the "balanced-split" for 2
epochs. All the results are shown in Section 5.2 together
with a reproduction of VinVL that we improve.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Experiments</title>
      <sec id="sec-3-1">
        <title>Our approach is divided into two separate steps. First, we</title>
        <p>adapt several image classification models to the Places365
dataset and select the most accurate model to generate
a visual representation for the VL model. Then, we
finetune the VL model using its original and our new visual
features.
which is 1.1% higher than that of the second-best
ViTLarge. Therefore, this model is further used to extract
novel visual representations for our VinVL+L.</p>
        <sec id="sec-3-1-1">
          <title>5.2. Visual Question Answering</title>
          <p>5.1. Location Recognition Statistical significance of novel features We show
the advantages of the new visual representations by
comWe selected several pre-trained image classification net- paring our method with the reproduced VinVL using the
works in order to cover a certain range of diferent ap- same training pipeline – see Table 5. The used scene tags
proaches to location recognition. These methods include as 365 location categories (C), or indoors/outdoors (IO),
those focused on high inference speed, methods con- are denoted in subscripts of the model name. Besides, we
taining skip-connections, parallel paths, or transformers. compute the statistical significance [ 33] between the two
Results of fine-tuned models on the Places365 dataset models to show that recognizing the location categories
are shown in Table 2. ResNeXt-50, EficientNetV2, and truly brings benefits and it is not just a coincidence. For
ViT-Large have similar performance, while ViT-Large demonstration, we compare the reproduced VinVL with
performs slightly worse in indoors/outdoors classifica- our VinVL+LC on the validation dataset. Our goal is to
tion. It is because when the ViT-Large is not right, it is reject the null hypothesis defined as "there is no
difermore often the incorrect indoors/outdoors supercategory ence between system A and B". To do this, we shufle the
than in the case of the previous two mentioned models. predictions between systems A and B with a probability
The best results are achieved by Swin-Base in both 365 of 50%, and we compare the performance with the initial
locations and binary indoors/outdoors recognition. It one (all repeated 10,000 times). Consequently, we reject
obtains 56% top-1 accuracy in recognizing 365 locations, the null hypothesis at the 95% significance level, i.e., a
Table 5 We selected the best model with respect to overall
accuAccuracy of answers on the validation dataset. We evalu- racy on the validation set, and we pushed the results into
ate the reproduced VinVL with our improved versions on the the evaluation server. The performances of the models
balanced validation GQA dataset. are listed in Table 4. The reproduced version of VinVL
Backbone Accuracy Binary Open still has worse performance than the original one, but
VinVL (reproduced) 63.2 52.5 82.3 the diference is decreased with this modification of the
VinVL+LC+IO 63.4 52.7 82.3 training.</p>
          <p>VinVL+LC 63.8 53.0 83.0 According to the results, all of our models answer more
VinVL+LIO 64.1 53.7 82.6 accurately and outperform the reproduced model in all
metrics, except in some cases of Consistency and
Distribution. For example, even the VinVL+LC answers 0.40%
threshold is equal to 0.05, with obtained - = 0.03. better on open questions and 0.23% better on yes/no
quesThe same conclusion is reached for VinVL+LIO. In the tions, resulting in 0.32% higher overall accuracy, it has
case of the VinVL+LC+IO, the diference is not significant, 0.14% lower performance in Consistency metric. This
so the null hypothesis cannot be rejected. means that when our model fails, the prediction is truly</p>
          <p>The significance may seem small from a general point meaningless to the given question. However, this model
of view. However, it should be considered that these shows the best performance compared with other
verresults were achieved by simply adding locations to the sions of VinVL+L: VinVL+LIO holds only the highest
Consystem. To improve significance, scene features should sistency (+0.03% compared with reproduced VinVL and
be generated from the same model as region features. In +0.17% compared with VinVL+LC), and VinVL+LC+IO
outaddition, other global information such as weather may performs all compared models in Validation and
Distribube included. tion. We show the results of the prior State-of-the-Arts
in Table 3. Even though, our VinVL+L method
noticeably surpasses the original version in the primary metric:
+0.20% of overall accuracy for VinVL+LC.</p>
          <p>Comparison on the test set Although we followed
the original training pipeline, on which the results of our
models are based, it should be noted that the reproduced
VinVL works worse than the original version. There- 5.3. Summary and Discussion
fore, we decided to select models after the 1st, 3rd, and
5th epochs from the pre-training on the unbalanced set. An improvement in the visual question answering is
Then we fine-tuned these models for 2 epochs on the achieved by taking global information about the visual
balanced set to slightly increase the final performance. component into account. Table 4 and 5 confirm this fact
Who is wearing the dress?</p>
          <p>What is the woman doing?</p>
          <p>Inside what is the pizza?</p>
          <p>What is inside the container next to the glass?
VinVL
Woman
(our) VinVL+L</p>
          <p>Woman</p>
          <p>VinVL
Walking
(our) VinVL+L</p>
          <p>Sitting</p>
          <p>VinVL
Box
(our) VinVL+L</p>
          <p>Box</p>
          <p>VinVL
Straw
(our) VinVL+L
Ice cream
GT</p>
          <p>Women</p>
          <p>GT</p>
          <p>Resting</p>
          <p>GT</p>
          <p>Pizza bosx</p>
          <p>GT</p>
          <p>
            PacPket
for all our VinVL+L models. In addition, we show the the used novel visual representations. Therefore, our
arwrong predictions of our VinVL+L (along with predic- ticle only shows the efectiveness of incorporating global
tions of reproduced VinVL) against the Ground Truth location information into a system that works only on
labels. The image-question pairs are randomly chosen the basis of objects.
from the validation set, see Figure 3. Even if our model
answers are wrong in the given examples, it is worth
saying that some of the answers are not truly wrong, e.g., in 6. Conclusion
the second example, in which the woman is truly sitting
and, in our opinion, there is missing additional informa- This paper presents VinVL+L, an enriched version of the
tion to say if she is really resting, instead of just sitting. VinVL with location context as a novel visual
representaBesides these examples, we show predictions from the tion. We generate the new representat
            <xref ref-type="bibr" rid="ref24">ions as scene tags
test2019</xref>
            set in Appendix A. and features and we prepend them before the original
em
          </p>
          <p>It is worth emphasizing that the listed models do not beddings of the architecture. Our version achieves higher
use scene features, only tags. A model using both scene overall accuracy than the original method on the GQA
tags and features did not achieve the expected results. dataset, and we show that global information about the
This behavior was anticipated for two reasons. First, entire image influences the answers and thus should not
even if we follow the generating procedure of the scene be ignored. The best results of 64.85% overall accuracy
features, the VL model obtains a vector with diferent are achieved with the model using 365 location categories
semantics compared to region features. To solve this as scene tags. Besides, we performed an Approximate
issue, the scene features must be generated from the Randomization test to verify that the achieved results are
same model to avoid subsequent confusion. Second, all statistically significant. Similarly, weather recognition
image and text representations are passed to the modified for outdoor scenes could be included in the concept to
BERT model, which is still a language model pre-trained help the network with alignments of image-text pairs
on text corpora, with additional visual features added. with respect to global information. All generated data
Therefore, the words still have a higher weight than the and code are publicly available on our GitHub.
visual features.</p>
          <p>Regarding the performance of the reproduced VinVL, Acknowledgments
we used the original code including the pipeline
presented in [6]. However, the network reproduced by us The work has been supported by the grant of the
Univerachieved worse performance in all metrics, e.g., 0.12% in sity of West Bohemia, project No. SGS-2022-017.
Comoverall accuracy. Since the main goal is to improve this putational resources were supplied by the project
"emethod, we decided to primarily compare our models Infrastruktura CZ" (e-INFRA CZ LM2018140) supported
with the reproduced version, on which the benefits are by the Ministry of Education, Youth and Sports of the
best observed. All the listed models were trained using Czech Republic.
the same device, hyperparameters settings, only difer in
1–9</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>A. Additional prediction examples</title>
      <p>Where is she sitting?
Where is the umpire?
Is there any grass in the scene that is brown?
What place is the photo at?
Are there either any cars or vehicles in the image?
Is there a bird or a cat that is sitting?
What is the girl sitting on?</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          pp.
          <fpage>1492</fpage>
          -
          <lpage>1500</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Iofe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Alemi</surname>
          </string-name>
          , [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Faster</surname>
          </string-name>
          r-cnn:
          <article-title>Inception-v4, inception-resnet and the impact of</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>posal networks</article-title>
          , in: C.
          <string-name>
            <surname>Cortes</surname>
            ,
            <given-names>N. D.</given-names>
          </string-name>
          <string-name>
            <surname>Lawrence</surname>
          </string-name>
          , D.
          <source>D. AAAI conference on artificial intelligence</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sugiyama</surname>
            , R. Garnett (Eds.), Advances in [13]
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            , J. Xie,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>Neural Information Processing Systems</source>
          <volume>28</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Bag</given-names>
          </string-name>
          <article-title>of tricks for image classification with convo-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Associates</surname>
          </string-name>
          , Inc.,
          <year>2015</year>
          , pp.
          <fpage>91</fpage>
          -
          <lpage>99</lpage>
          .
          <article-title>lutional neural networks</article-title>
          , in: Proceedings of the [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , IEEE/CVF Conference on Computer Vision and Pat-
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>Bert: Pre-training of deep bidirectional transform-</article-title>
          tern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          ,
          <year>2019</year>
          , pp.
          <fpage>558</fpage>
          -
          <lpage>567</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>ers for language understanding</article-title>
          , arXiv preprint [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sermanet</surname>
          </string-name>
          , S. Reed,
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ). D.
          <string-name>
            <surname>Anguelov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Erhan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Vanhoucke</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Rabi[3]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <article-title>Zhang, novich, Going deeper with convolutions</article-title>
          , in: Pro-
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>Object-semantics aligned pre-training for vision-</article-title>
          and
          <source>pattern recognition</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>language tasks</article-title>
          , in: European Conference on Com- [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Iofe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shlens</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z</surname>
          </string-name>
          . Wo-
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>puter Vision</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>121</fpage>
          -
          <lpage>137</lpage>
          . jna,
          <article-title>Rethinking the inception architecture for com</article-title>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          , Y. Cheng, W. Wang, J. Liu,
          <article-title>puter vision</article-title>
          , in: Proceedings of the IEEE conference
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>Meta module network for compositional visual rea- on computer vision</article-title>
          and pattern recognition,
          <year>2016</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          soning,
          <source>in: Proceedings of the IEEE/CVF</source>
          Winter pp.
          <fpage>2818</fpage>
          -
          <lpage>2826</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>Conference on Applications of Computer Vision</source>
          , [16]
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kalenichenko</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <year>2021</year>
          , pp.
          <fpage>655</fpage>
          -
          <lpage>664</lpage>
          . W. Wang,
          <string-name>
            <given-names>T.</given-names>
            <surname>Weyand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Andreetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Adam</surname>
          </string-name>
          , Mo[5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Anderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Buehler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Teney</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>John- bilenets: Eficient convolutional neural networks</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>attention for image captioning</article-title>
          and
          <source>visual question arXiv:1704.04861</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          answering, in: Proceedings of the IEEE conference [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sandler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhmoginov</surname>
          </string-name>
          , L.-
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <article-title>on computer vision and pattern recognition, 2018, C. Chen, Mobilenetv2: Inverted residuals</article-title>
          and linear
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          pp.
          <fpage>6077</fpage>
          -
          <lpage>6086</lpage>
          . bottlenecks, in: Proceedings of the IEEE conference [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Wang,
          <article-title>on computer vision</article-title>
          and pattern recognition,
          <year>2018</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , Vinvl: Revisiting visual representa- pp.
          <fpage>4510</fpage>
          -
          <lpage>4520</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <article-title>tions in vision-language models</article-title>
          , in: Proceedings [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sandler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chu</surname>
          </string-name>
          , L.-
          <string-name>
            <surname>C. Chen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>and Pattern</given-names>
            <surname>Recognition</surname>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>5579</fpage>
          -
          <lpage>5588</lpage>
          . et al.,
          <article-title>Searching for mobilenetv3</article-title>
          , in: Proceedings [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kamath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , G. Synnaeve, of the IEEE/CVF International Conference on Com-
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>I.</given-names>
            <surname>Misra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Carion</surname>
          </string-name>
          , Mdetr-modulated
          <source>detection puter Vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1314</fpage>
          -
          <lpage>1324</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <article-title>for end-to-end multi-modal understanding</article-title>
          , in: Pro- [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          , Eficientnet: Rethinking model scal-
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>on Computer Vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1780</fpage>
          -
          <lpage>1790</lpage>
          . national Conference on Machine Learning, PMLR, [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bansal</surname>
          </string-name>
          , Lxmert: Learning cross-modality
          <year>2019</year>
          , pp.
          <fpage>6105</fpage>
          -
          <lpage>6114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <article-title>encoder representations from transformers</article-title>
          , arXiv [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Eficientnetv2: Smaller models and</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          preprint arXiv:
          <year>1908</year>
          .
          <volume>07490</volume>
          (
          <year>2019</year>
          ).
          <article-title>faster training</article-title>
          , in: International Conference on [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <source>Imagenet Machine Learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10096</fpage>
          -
          <lpage>10106</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <article-title>classification with deep convolutional neural net-</article-title>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          , D. Weis-
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <source>mation Processing Systems</source>
          <volume>25</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Trans-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Inc.</surname>
          </string-name>
          ,
          <year>2012</year>
          , pp.
          <fpage>1097</fpage>
          -
          <lpage>1105</lpage>
          .
          <article-title>formers for image recognition at scale</article-title>
          , in: Inter[10]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          , Deep residual learn- national
          <source>Conference on Learning Representations,</source>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <article-title>ing for image recognition</article-title>
          ,
          <source>in: Proceedings of the Vienna</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <article-title>IEEE conference on computer vision</article-title>
          and pattern [22]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Lin,
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>recognition</surname>
          </string-name>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          . B.
          <string-name>
            <surname>Guo</surname>
          </string-name>
          , Swin transformer:
          <source>Hierarchical vision trans</source>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Ag- former using shifted windows</article-title>
          , in: Proceedings of
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          networks,
          <source>in: Proceedings of the IEEE conference puter Vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10012</fpage>
          -
          <lpage>10022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <article-title>on computer vision</article-title>
          and pattern recognition,
          <year>2017</year>
          , [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Divvala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          , You
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <source>vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>779</fpage>
          -
          <lpage>788</lpage>
          . [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bochkovskiy</surname>
          </string-name>
          , C.-Y. Wang, H.
          <string-name>
            <surname>-Y. M. Liao</surname>
          </string-name>
          , Yolov4:
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>10934</volume>
          (
          <year>2020</year>
          ). [25]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          , D. Chen,
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <string-name>
            <surname>proach</surname>
          </string-name>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ). [26]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Hudson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          , Gqa: A new
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <source>pattern recognition</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>6700</fpage>
          -
          <lpage>6709</lpage>
          . [27]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Khot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Summers-Stay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <source>and Pattern Recognition (CVPR)</source>
          ,
          <year>2017</year>
          . [28]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zemel</surname>
          </string-name>
          , Exploring models and
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <source>neural information processing systems</source>
          <volume>28</volume>
          (
          <year>2015</year>
          ). [29]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lapedriza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Tor-
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <article-title>ralba, Places: A 10 million image database for scene</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <source>and Machine Intelligence</source>
          (
          <year>2017</year>
          ). [30]
          <string-name>
            <given-names>R.</given-names>
            <surname>Wightman</surname>
          </string-name>
          , Pytorch image models,
          <source>https:</source>
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          2019. doi:
          <volume>10</volume>
          .5281/zenodo.4414861. [31]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kamath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , G. Synnaeve,
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <source>on Computer Vision</source>
          (ICCV),
          <year>2021</year>
          , pp.
          <fpage>1780</fpage>
          -
          <lpage>1790</lpage>
          . [32]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hudson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          , Learning by abstrac-
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          <source>Information Processing Systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ). [33]
          <string-name>
            <given-names>S.</given-names>
            <surname>Riezler</surname>
          </string-name>
          , J. T.
          <string-name>
            <surname>Maxwell</surname>
            <given-names>III</given-names>
          </string-name>
          ,
          <article-title>On some pitfalls in</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          translation and/or summarization,
          <year>2005</year>
          , pp.
          <fpage>57</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>