<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>AIMH at MULTI-Fake-DetectIVE: System Report</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giovanni Puccetti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Esuli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ISTI • Area della Ricerca CNR</institution>
          ,
          <addr-line>via G. Moruzzi 1, 56124 Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This report describes our contribution to the EVALITA 2023 shared task MULTI-Fake-DetectIVE which involves the classification of news including textual and visual components. To experiment on this task we focus on textual data augmentation, extending the Italian text and the Images available in the training set using machine translation models and image captioning ones. To train using diferent set of input features, we use diferent transformer encoders for each variant of text (Italian, English) and modality (Image). For Task 1, among the models we test, we find that using the Italian text together with its translation improves the model performance while the captions don´t provide any improvement. We test the same architecture also on Task 2 although in this case we achieve less satisfactory results.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;MULTI-Fake-DetectIVE</kwd>
        <kwd>Fake News</kwd>
        <kwd>Multimodality</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        To perform the task we focus on exploring the
efectiveness of augmenting the dataset by adding variants
Misinformation, intentional or not, is an ubiquitous phe- of the input extrapolated from both the existing text in
nomenon in social media. Whether due to malicious Italian as well as the images leveraging the knowledge
intent or scarce reviews, the number of outlets producing available in pre-trained models.
incorrect information is growing over time [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. While the The idea of exploiting knowledge implicitly encoded
only true mean to protect one self from misinformation is in large pretrained models is used in several contexts
careful review of trustworthy sources, the development with diferent goals, ranging from Neural Databases [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
of sound quantitative approaches for fake news detection to synthetic text detection [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
is a worthy endeavour. The rest of the report is structured as follows: section 2
      </p>
      <p>
        In this context there are works providing benchmark reports relevant literature, section 3 covers details of the
datasets for the very task of fake news detection in Twit- dataset we found while preparing the models, section 4 is
ter [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], however this is generally tackled in a unimodal the System Description, section 5 outlines the results we
setting where textual information is the only one exam- obtained, and finally in section 6 we draw the conclusions
ined. In this context, the MULTI-Fake-DetectIVE task of this work.
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], part of the EVALITA 2023 campaign [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] proposes to
add multimodality, by challenging participants to classify
fake news using both textual and visual features. 2. State of the Art
      </p>
      <p>The task consists in classifying tweets reporting news
about the war in Ukraine with both textual and visual
content according to whether the reported news is true
or fake. The task is subdivided into two subtasks:
• the first subtask is about detecting fake news by
assigning a label among Certainly False, Probably</p>
      <p>False, Probably True, Certainly True;
• the second subtask is focused on detecting the
agreement between text and image by assigning
a label among Misleading, Non Misleading,
Unrelated, which respectively indicate if the content of
text and image support diferent interpretations,
the same interpretation or are unrelated.</p>
      <p>
        Recently, multimodal classification is tackled with visual
language models such as OSCAR [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], VinVL [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] or with
separate text and image encoding networks [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Built
upon the idea of creating a shared representation space
between text and images, developed in CLIP [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
several image captioning models have also been developed
such as CoCa [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], we also try experimenting with these
architecture for data augmentation. We could also use
Multimodal Large Language Models for this same goal,
i.e. augmenting data, some of the best performing ones
are BLIP-2 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and Llava [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] these are too
computationally costly and we avoid using them. Instead, to perform
data augmentation across languages we employ Italian
to English Neural Machine Translation models [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. Data</title>
      <p>We perform an analysis of the dataset meant to
understand if there are task specific preprocessing we have to
apply to the data.</p>
      <p>Figure 1 shows the distribution of labels in both tasks,
we notice that both have heavily unbalanced distribution.
The dataset of Task 1 Figure 1 shows how (likely) True
news are the majority of samples, indeed, while
ubiquitous in our everyday experience on the web, (likely) Fake
news are still a minority of the total information shared.</p>
      <p>Accordingly, for Task 2 Figure 1 shows that instances
where Image and Text are heavily non aligned are also a
minority.</p>
      <p>While inspecting Task 1 training dataset, we observe
non negligible data duplication, more specifically, there
is 13.6% duplicated training samples, which we remove.
On the contrary the dataset for Task 2 does not show any
repetition.</p>
    </sec>
    <sec id="sec-3">
      <title>4. Description of the System</title>
      <p>on, by sample we refer to a set of texts and images
composing a single piece of news. Similarly, by features we
indicate both texts or images.</p>
      <p>To explore several data augmentation possibilities
we build a unique pipeline that allows to add
multiple pretrained models to process diferent input features
schemes, based on diferent sets of texts and images.</p>
      <p>Figure 2 outlines our architecture, using the same
notation as in the figure, for each input sequence/image
( ) in a sample, we use a pretrained model to
embed it (), then we add a linear layer ()
that maps all embeddings to the same dimension, finally
we sum all such embeddings (entry-wise) to create a
shared hidden state ( ) and pass this
vector through a linear layer (  ) that
maps it to a vector with length equal to the number of
classes, 4 for task 1 and 3 for task 2. During training we
optimize all parameters, including those of the pretrained
models .</p>
      <sec id="sec-3-1">
        <title>4.1. Data Augmentation</title>
        <p>
          The architecture we use allows us to seamlessly use as
input any number of texts and images for each sample,
in particular by adding extra features. We add features
in two ways:
• We translate the textual documents to English
using an open-source machine translation model
[
          <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
          ], in particular an Italian to English model1;
• We caption the images using an image captioning
model CoCa [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] fine tuned on the MSCOCO [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ],
we use an open source version 2.
        </p>
        <p>Adding these extra inputs gives us the possibility to
compose samples with diferent sets of features among
Italian Text, English Text, English Caption, Image. We
evaluate three sets of features:
• English Text, Image;
• English Text, Italian Text, Image;
• English Text, English Caption, Image.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Small Scale Ablation Study</title>
        <p>
          In this Section we describe the methodology we
developed to tackle the MULTI-Fake-DetectIVE task. We re- All the models we test share the same high level
archiport the choices made and the steps that led us to them. tecture as shown in Figure 2, as mentioned above we
In particular, we focus on data augmentation, for which use diferent pretrained transformer encoders to embed
we mainly adopt two systems working either on text or diferent modalities, sum all the embeddings entry wise
on images. Our architecture follows the one proposed by after mapping them to the same dimension through a
Gallo et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. linear layer and finally with another linear layer we map
        </p>
        <p>We focus on data augmentation because the dataset to a vector with length the number of labels, finally we
is composed of Italian texts and since there aren’t many compute the usual Cross Entropy Loss for classification.
models pre-trained specifically on this language we ex- 1www.huggingface.co/Helsinki-NLP/
plore how well translating to English works. From here 2https://laion.ai/blog/coca/</p>
        <p>
          While summing the encoding of separate features, we
multiply each of them by a coeficient, let us call it   (e.g.
 −  is the coeficient multiplying the embedding
for the English translation), that modulates the relative
importance of each feature. Similarly each feature has its
own pre-trained encoder, we use the following ones:
• VIT [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] and in particular the
vit-large-patch32
        </p>
        <p>384 version3 to encode images;
• RoBERTa large [18] to encode text in English,
either the translated texts or the generated
captions;
• a version of BERT-base pretrained on Italian4 to</p>
        <p>encode all Italian text we use.</p>
        <p>We perform all our validation test by splitting the
training dataset in 80% training and 20% validation. The
main architecture choices we make are, the shared size to
which we map the embeddings output by each encoder,
3https://huggingface.co/google/vit-large-patch32-384
4https://github.com/dbmdz/berts
the   that multiply each of the embeddings before
summing them, the classification head shape and the
pretrained models we use. Let us list how we chose each of
them:
• For the vector size, we experiment with 512 and
1024, seeing that performance does not change
depending on these two setting we use the smaller
value, 512 in all our experiments.
• Concerning the   of each modality, we notice
that  −  is the most relevant one and after
some tests, we choose the parameters as follows,
 −  = 1.0 and all others equal to 0.1.
• The final ablation we have performed concerns
the classification head, which eventually we
choose to be a single linear layer with input size
512 and output size the number of labels, 4 for
task 1 and 3 for task 2.
• Initially, we tried a diferent version with two</p>
        <p>Linear layers with tanh activation function in
between and the hidden size of 2048, but this leads</p>
        <sec id="sec-3-2-1">
          <title>Certainly Fake</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Probably Fake</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>Probably Real Certainly Real weighted avg support</title>
          <p>precision
recall
f1-score
16
using English Text and Images only, however this did not
seem to afect performance 6.</p>
          <p>Comparing the results obtained on our validation set
when using diferent groups of features we eventually
choose to only use the translated text together with the
Images, as adding Italian didn’t appear to provide
significant improvements.</p>
          <p>We tackle Task 2 keeping everything as we did in Task
1 switching training set.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <sec id="sec-4-1">
        <title>4.3. Hyper Parameter Selection</title>
        <p>For Task 1 we perform a grid search using as features:
English Text, Image.</p>
        <p>We sweep over the following hyper parameters:
to lower performance (although comparable) in
all our experiments.
• Similarly, while choosing architecture we experi- 5.1. Task 1
mented with smaller versions of each transformer Table 1 shows the performance of our approach on the
encoder, namely: (a) VIT with patch16-224 instead ifrst task. In bold we report the metric that has been used
of patch32-384; (b) roberta-base instead of roberta- to evaluate our model, it reports how the class balance
large; (c) bert-base pretrained on English instead in the training set is reflected into per-class performance
of Italian. However, while faster to train, switch- into the oficial test set (measured with the oficial
evaluing any pretrained model to its smaller version ation script). Indeed the Certainly Real class is the most
reduced performance and therefore we opt for numerous in this case too as well as the one where our
the larger ones when performing the grid search model is best performing. It is interesting to notice how
to choose our best model. the model performs better on the Certainly False class
compared to the Certainly Real one despite the second
being more populated, we speculate this is due to the
similarity with the Probably Real class.</p>
        <p>
          Although we chose a diferent method to submit to
Task 1, we show that including the Italian text results
into promising results on the oficial test set. Table 3
• learning rate: 1e-5, 2e-5, 3e-5, 5e-5; shows how this approach performs on the oficial test set
• Max epochs: 3, 4, 5, 10, 20; and indeed it would improve over our submission.
• Warmup steps: 0, 100; Unlike adding the Italian text, using the captions does
• Batch size: 4, 8 (other values would not fit into not results in performance improvements. Table 4 shows
our machine). the performance obtained while adding the Captions for
the images, generated by CoCa [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and processed with
a diferent roberta-large model.
        </p>
        <p>The best performance on our validation set is obtained
with warmup 0, batch-size 8, epochs 4 and learning rate
1e-5 and therefore we use this set of hyper parameters 5.2. Task 2
when training with all groups of features5.</p>
        <p>Due to limitations in GPU memory, we clip all se- For task 2, we chose to keep all parameters as in Task 1.
quences to 256 length. We also tested a length of 400</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusions</title>
      <p>We have tackled the MULTI-Fake-DetectIVE task trying
and improve performance with textual data
augmentation techniques.</p>
      <p>We show that our approach does provide some
improvements and this is relevant as text-based data
augmentation is a novel way to exploit the knowledge
present within large pretrained models, made recently
possible by pretrained models and has several application
settings [19].</p>
      <p>Moreover, in this report we show how using both
Italian and English data at once, even though the English one
is the translation of the Italian text, provides significant
improvements in Task 1.</p>
      <p>On the contrary, the lower performance of the model
in Task 2 underlines how the relations between text and
images are not well captured by our model and this ofers
the opportunity for further improvements.</p>
      <p>A structural limitation of our approach is that,
although we know that the dataset is composed of both
tweets and articles and that the second document type
is generally much longer than the tweets, we have not
experimented with ways to use this longer context.</p>
      <p>This too ofers a promising future step, using longer
context transformers when embedding text, while
keeping our overall scheme of translating to English might
give further improvements.</p>
      <p>Indeed, given the scarcity of longer context
transformers trained on Italian the English translation might be
useful in this case as well.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work is supported by the European Union under the
scheme HORIZON-INFRA-2021-DEV-02-01 –
Preparatory phase of new ESFRI research infrastructure projects,
Grant Agreement n.101079043, “SoBigData RI PPP:
SoBigData RI Preparatory Phase Project”</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zafarani</surname>
          </string-name>
          ,
          <article-title>A survey of fake news: Fundamental theories, detection methods, and opportunities</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>53</volume>
          (
          <year>2020</year>
          ). URL: https:// doi.org/10.1145/3395046. doi:
          <volume>10</volume>
          .1145/3395046.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Fagni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Falchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gambini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Martella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tesconi</surname>
          </string-name>
          , Tweepfake:
          <article-title>About detecting deepfake tweets</article-title>
          ,
          <source>PLOS ONE 16</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          . URL: https://doi. org/10.1371/journal.pone.0251415. doi:
          <volume>10</volume>
          .1371/ journal.pone.
          <volume>0251415</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondielli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dell'Oglio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lenci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Marcelloni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Passaro</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Sabbatini, Multi-fake-detective at evalita 2023: Overview of the multimodal fake news detection and verification task, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2023</year>
          ), CEUR.org, Parma, Italy,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          , G. Venturi,
          <year>Evalita 2023</year>
          :
          <article-title>Overview of the 8th evaluation campaign of natural language processing and speech tools for italian, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2023</year>
          ), CEUR.org, Parma, Italy,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Thorne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yazdani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saeidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Halevy</surname>
          </string-name>
          ,
          <article-title>From natural language processing to neural databases</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>14</volume>
          (
          <year>2021</year>
          )
          <fpage>1033</fpage>
          -
          <lpage>1039</lpage>
          . URL: https://doi.org/10. 14778/3447689.3447706. doi:
          <volume>10</volume>
          .14778/3447689. 3447706.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khazatsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Finn</surname>
          </string-name>
          , Detectgpt:
          <article-title>Zero-shot machine-generated text detection using probability curvature</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2301</volume>
          .
          <fpage>11305</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , Oscar:
          <article-title>Object-semantics aligned pre-training for vision-language tasks</article-title>
          ,
          <source>ECCV</source>
          <year>2020</year>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Wang,
          <article-title>formers for image recognition at scale</article-title>
          , in: InY. Choi,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , Vinvl:
          <article-title>Making visual representa- ternational Conference on Learning Representations matter in vision-language models</article-title>
          ,
          <source>CVPR 2021 tions</source>
          ,
          <year>2021</year>
          . URL: https://openreview.net/forum?id= (
          <year>2021</year>
          ). YicbFdNTTy.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>I.</given-names>
            <surname>Gallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Calefati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nawaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Janjua</surname>
          </string-name>
          , Image [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>and encoded text fusion for multi-modal classifica- O.</article-title>
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
          </string-name>
          , V. Stoyanov, tion,
          <year>2018</year>
          <article-title>Digital Image Computing: Techniques Ro{bert}a: A robustly optimized {bert} pretrainand Applications (DICTA) (</article-title>
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . ing approach,
          <year>2020</year>
          . URL: https://openreview.net/
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Ramesh, forum?id=SyxS0T4tvS. G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , P. Mishkin, [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mumuni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mumuni</surname>
          </string-name>
          ,
          <article-title>Data augmentation: A comJ.</article-title>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Krueger</surname>
            ,
            <given-names>I. Sutskever</given-names>
          </string-name>
          ,
          <article-title>Learning trans- prehensive survey of modern approaches, Array 16 ferable visual models from natural language su-</article-title>
          (
          <year>2022</year>
          )
          <article-title>100258</article-title>
          . URL: https://www.sciencedirect.com/ pervision, in: M.
          <string-name>
            <surname>Meila</surname>
          </string-name>
          , T. Zhang (Eds.), Pro- science/article/pii/S2590005622000911. doi:https: ceedings of the 38th International Conference //doi.org/10.1016/j.array.
          <source>2022.100258. on Machine Learning</source>
          , volume
          <volume>139</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          . URL: https://proceedings.mlr.press/ v139/radford21a.html.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vasudevan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yeung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seyedhosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          , Coca:
          <article-title>Contrastive captioners are image-text foundation models</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <fpage>2205</fpage>
          .
          <year>01917</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Savarese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C. H.</given-names>
            <surname>Hoi</surname>
          </string-name>
          , BLIP-2
          <article-title>: bootstrapping language-image pre-training with frozen image encoders and large language models</article-title>
          ,
          <source>CoRR abs/2301</source>
          .12597 (
          <year>2023</year>
          ). URL: https://doi.org/ 10.48550/arXiv.2301.12597. doi:
          <volume>10</volume>
          .48550/arXiv. 2301.12597. arXiv:
          <volume>2301</volume>
          .
          <fpage>12597</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usuyama</surname>
          </string-name>
          , H. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Llava-med: Training a large language-andvision assistant for biomedicine in one day</article-title>
          ,
          <source>CoRR abs/2306</source>
          .00890 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2306.00890. doi:
          <volume>10</volume>
          .48550/arXiv.2306.00890. arXiv:
          <volume>2306</volume>
          .
          <fpage>00890</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          ,
          <article-title>Parallel data, tools and interfaces in OPUS</article-title>
          ,
          <source>in: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)</source>
          ,
          <source>European Language Resources Association (ELRA)</source>
          , Istanbul, Turkey,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          , S. Thottingal,
          <article-title>OPUS-MT - Building open translation services for the World</article-title>
          ,
          <source>in: Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)</source>
          , Lisbon, Portugal,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Belongie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Bourdev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hays</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Perona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Doll'a r</surname>
          </string-name>
          , C. L. Zitnick,
          <string-name>
            <surname>Microsoft</surname>
            <given-names>COCO</given-names>
          </string-name>
          :
          <article-title>common objects in context</article-title>
          ,
          <source>CoRR abs/1405</source>
          .0312 (
          <year>2014</year>
          ). URL: http: //arxiv.org/abs/1405.0312. arXiv:
          <volume>1405</volume>
          .
          <fpage>0312</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Trans-</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>