<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>October</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>On the Categorization of Corporate Multimodal Disinformation with Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ana-Maria Bucur</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sónia Gonçalves</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Rosso</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Interdisciplinary School of Doctoral Studies, University of Bucharest</institution>
          ,
          <country country="RO">Romania</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>PRHLT Research Center, Universitat Politècnica de València</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universidad de Sevilla</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>ValgrAI Valencian Graduate School and Research Network of Artificial Intelligence</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>20</volume>
      <issue>2024</issue>
      <fpage>29</fpage>
      <lpage>39</lpage>
      <abstract>
        <p>Disinformation is becoming more prevalent in the corporate sphere, especially as brands choose to promote their products through influencers or micro-celebrities who are perceived as reliable and impartial, but may facilitate false information. The spread of disinformation can have negative economic impacts on companies and brands, which can even afect their reputation. Artificial Intelligence can help detect false information and has become increasingly important in combating disinformation. The current work addresses the problem of characterizing multimodal disinformation targeting corporations and provides a collection of content that spreads disinformation in digital media. The content was manually annotated with information about the target (Organization, Brand, or Other) and the source (Corporate, Advertising, or Other) of the false content. We conduct comprehensive experiments to evaluate the efectiveness of state-of-the-art Unimodal and Multimodal Large Language Models in identifying the source and target of the content.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Corporate Multimodal Disinformation</kwd>
        <kwd>Multimodal Large Language Models</kwd>
        <kwd>Spanish</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Related Work</title>
      <p>
        According to [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the concept of disinformation refers to a deliberate and organized attempt to confuse
or manipulate people by providing dishonest information. In the corporate sphere, disinformation is
gaining more ground. It is orchestrated to persuade audiences and hold great appeal for advertisers
who promote their dissemination as a lure “because it fits more easily into people’s prejudices” [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
The issue can become even more dangerous when we consider that more and more brands choose to
promote their products through influencers or micro-celebrities, which can facilitate false information
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These opinion leaders are perceived with high levels of reliability and impartiality, allowing them
to recommend products and services on various social media platforms and generate word of mouth
that brands leverage for their commercialization [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        The spread of disinformation can be a risk to companies and brands and cause a negative economic
impact [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] that can even afect their reputation. Disinformation that can impact a company’s reputation
may stem from political, financial, emotional, or internal motivations, such as discontented employees
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Therefore, it is important for organizations to manage trusting relationships with the public.
Organizations can become victims of individuals and advanced technologies with the intention to
damage their reputation for twisted purposes [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] through the use of deepfakes, a new form of fake
news that threatens companies, organizations, and brands [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ]. As the reputation of organizations
can be afected by the spread of disinformation, to protect the corporate image, communication oficers
need to be aware of strategies to combat it, such as fact-checking. Artificial Intelligence has enabled the
implementation of automated approaches capable of detecting false information [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ], also from a
multimodal perspective [
        <xref ref-type="bibr" rid="ref13 ref14 ref15 ref16 ref17 ref18">13, 14, 15, 16, 17, 18</xref>
        ].
      </p>
      <p>Unlike general disinformation, which can target individuals, events, or broad societal issues, corporate
disinformation often has direct financial implications and can damage trust in brands and organizations.
Recognizing the unique characteristics and potential impacts of such disinformation, our work aims to
deepen the understanding of what are the actors targeted by corporate disinformation and the sources
spreading it. By classifying the target of the false content, we can identify whether the afected entity is
an organization or a brand. Furthermore, identifying the source will enable afected entities to take
action and develop appropriate responses to counter the disinformation being spread about them.</p>
      <p>
        As there are many previous works on multimodal fake content detection [
        <xref ref-type="bibr" rid="ref13 ref14 ref16 ref17 ref18">18, 14, 13, 16, 17</xref>
        ], we aim
to characterize content that has been already fact-checked and confirmed as false. To the best of our
knowledge, this is the first time that the problem of multimodal disinformation targeting corporations
has been addressed automatically. For this purpose, a collection of multimodal content in Spanish
that was already fact-checked is collected and annotated by expert annotators with information about
the target and source of the content (Figure 1). Our dataset consists of 534 samples, together with
annotations for the target (Organization, Brand, or Other) and the source (Corporate, Advertising,
or Other) spreading disinformation. The false content can be targeted at an Organization, such as
a company, institution, or an individual representing them. It can also target a Brand or a person
associated with it. Alternatively, disinformation can be classified as Other, meaning it is not aimed at an
organization or brand but contains misleading information intended to deceive the general population.
Furthermore, false content can originate from various sources. It may stem from a Corporate origin,
where a corporate entity is responsible for spreading disinformation, rather than just an individual.
Alternatively, it could be a result of persuasive Advertising, typically in the form of paid posts on social
media. Lastly, false content may originate from Other sources, such as online users disseminating
misleading information.
      </p>
      <p>In this paper, we address the problem of characterizing multimodal disinformation targeting
corporations. Our work makes the following contributions:
• A collection of multimodal false content (visual and textual information in Spanish) that spread
disinformation in digital media on corporations is compiled and annotated with information
about the source and target of the false content;
• Comprehensive experiments are conducted to evaluate the efectiveness of state-of-the-art
Unimodal and Multimodal Large Language Models (LLMs) in characterizing false content.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data Collection</title>
      <p>The dataset used in this work is obtained from the IBERIFIER repository1, which includes online content
that has been fact-checked and verified 2. IBERIFIER is a project that aims to fight disinformation in
digital media in Spain and Portugal, in which data from various fact-checking websites is collected
and analyzed. In our research, we specifically focus on false content in Spanish that was verified by
EFE Verifica 3 and Maldita.es4, as these organizations contributed the most content to the IBERIFIER
database. Our dataset consists solely of posts that were confirmed by these fact-checking entities to
contain false information. This limits the dataset size, as obtaining fact-checked data is challenging. Our
dataset contains 496 samples from Maldita.es and 38 samples from EFE Verifica, with multimodal data
represented through both visual and textual information in Spanish. By deliberately focusing on posts
that have been verified to contain disinformation, we can more efectively evaluate the performance of
pre-trained visual transformer models and LLMs in characterizing deceptive information. This dataset
allows us to study and understand how these models identify the diferent targets and sources spreading
disinformation. The dataset is an essential resource for studying the efectiveness of LLMs in classifying
false content from visual and textual cues found in images.</p>
      <p>For each of the collected images, we also retrieved information about the format of the content and
the platform used to spread it using the IBERIFIER API. In Figure 2, we present the various formats
of false content. The most common type of false content is represented by pictures, followed by
screenshots from social media. Figure 3 shows the platforms used to spread the disinformation content.
The data suggests that social media platforms like Twitter, Facebook, TikTok, and Instagram are the
primary channels used to spread false content. However, we found that a considerable amount of false
information is also shared through messaging apps like WhatsApp.</p>
      <p>Two expert annotators have labeled each instance of false content with information about the target
and source. The target of the disinformation can be an Organization (either a company, an institution,
or a person representing it), a Brand (or a person representing it), or it can be Other, meaning that it is
not targeted towards an organization or a brand, and it contains false information intending to mislead
the general population about various topics, such as climate change, immigrants, conspiracy theories,
local news. With regard to the diferent sources of false content (i.e. the origin of the content), the
content can be of Corporate origin (usually, there is an entire corporate entity behind the spread of
disinformation, not just an individual), persuasive Advertising (usually paid posts on social media),
or Other - usually false content spread by other users. The Other class also contains false content in
which the identity of the spreader does not appear in or cannot be inferred from the image/text (see
Figure 1, 1st and 4th example). We obtained a strong agreement between the two annotators (Cohen’s
 0.90). The disagreements between them have been resolved by a senior researcher in the field. The
ifnal dataset contains 347 samples targeting an organization, 87 targeting a brand, and 100 targeting
other entities. Regarding the sources of the false content, the dataset is comprised of 52 Corporate, 4
Advertising, and 478 Other sources.</p>
      <p>We showcase 4 examples from the collected data in Figure 1. The dataset includes diferent types
of disinformation found in digital media, which makes it dificult to identify the source and target
spreading the content. The first example shows an image with a figure representing the electoral results
from the Chueca neighborhood of Madrid. However, the image is spreading disinformation because the
results are actually from a municipality in Toledo with the same name. This is a classic example of how
disinformation can be spread by manipulating images and providing false information. The source of
the content was classified as Other because the origin of the information is unknown, it does not appear
in the text or the image. On the other hand, the target is Organization because the disinformation
publication afects one or more organizations, in this case, political parties (People’s Party (PP)) and
Spanish Socialist Workers’ Party (PSOE)).</p>
      <p>The second example is a sponsored post from Facebook, asking individuals to complete a brief
questionnaire for the chance to purchase a discounted vacuum cleaner. However, this image represents
a classic phishing post where individuals are persuaded to share their banking information with
malicious entities. This example illustrates how social media platforms can be used to spread phishing
scams that can deceive unsuspecting users. The source of the content was categorized as Advertising
due to the information originating from a clearly identified advertising publication (sponsored content),
indicating that the advertising is conducted on a social network through payment. Conversely, the
target is identified as Brand because the disinformation publication impacts brands, specifically Dyson
and Lidl.</p>
      <p>The third example is a screenshot from a website that claims to be of Repsol S.A., an energy and
petrochemical company from Spain. However, the website is not the real website of the company,
and it is used for phishing. Malicious actors are using the website to trick users into sharing their
personal data. The content was categorized as Corporate because the web page appears to be created by
a corporate entity rather than an individual. On the other hand, the target is Brand, as it targets Repsol.</p>
      <p>In the fourth example, we present a screenshot from social media that is not targeted towards a
corporate entity or a brand, and it was labeled as Other - trying to mislead the general population. The
source of the content was labeled as Other, with no information about the source provided in the text
or image.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>We perform experiments in zero-shot or few-shot settings to evaluate the efectiveness of state-of-the-art
visual transformer models and LLMs in characterizing false content within multimodal data.</p>
      <sec id="sec-3-1">
        <title>3.1. Pre-trained Visual Transformer Models</title>
        <p>
          Pre-trained visual transformer models, such as CLIP [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], have shown great performance on downstream
tasks without additional training, obtaining competitive results with a supervised baseline. CLIP was
pre-trained in a self-supervised manner on a large collection of image-text pairs with a contrastive
learning objective. The model was trained to maximize similarity between pairs of the same class and
minimize similarity between pairs of diferent classes. CLIP extracts embeddings by processing the
image and text through a visual and textual encoder, respectively. The embeddings are then mapped
to a shared space where similarities between image-text pairs can be computed. Pre-training allows
CLIP to represent images and text with similar content closer in the embedding space while unrelated
image-text pairs are represented further apart. In this way, the model can compute the relationship
between a given image and its corresponding textual description.
        </p>
        <p>
          We are exploring the efectiveness of using CLIP and similar models [
          <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
          ] for zero-shot classification.
To achieve this, we investigate how well the models can predict the target and the source of online
disinformation. The zero-shot classicfiation pipeline is presented in Figure 4. The process involves
passing images and texts, in our case, the names/descriptions of the categories, through frozen visual
and textual encoder models. The similarity between the image and each category name/description
is computed, and the category with the highest similarity score is selected as the final prediction. We
conducted our experiments in two settings: by providing the class names as labels and by providing
a short definition/description of the content we expect to find for each class. The two types of label
names, short and long, are shown in Figure 4. For target classification, we first experimented with short
label names such as Organization, Brand, and Other. We also experimented with longer names, such as
“a screenshot of false information targeting an organization (a company or an institution)”, etc. Inspired
by recent works highlighting the importance of the definitions of the concepts [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], we added more
information to the text describing the categories. For the source classification, we followed a similar
approach and experimented with both the short label names, such as Corporate, Advertising, and Other,
and longer variants.
        </p>
        <p>
          In our experiments, we have tested the abilities of various pre-trained transformer models like
CLIP [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], OpenCLIP [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], MetaCLIP [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], SigLIP [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. CLIP and OpenCLIP [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] have identical vision
transformer architecture, but OpenCLIP was trained on the open-source dataset LAION-2B [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], whereas
CLIP was trained on a private dataset of image-text pairs. MetaCLIP [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] uses the same architecture
and training regime as above, but the authors ensure that only high-quality image-text pairs are used
for pre-training. SigLIP [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] replaces the softmax-based contrastive loss from CLIP with a sigmoid loss.
We experiment with diferent variants of the models, either base, large, or huge, if available.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Large Language Models</title>
        <p>
          With the great success of leveraging LLMs in various vision and language tasks [
          <xref ref-type="bibr" rid="ref25 ref26 ref27 ref28">25, 26, 27, 28</xref>
          ], we
also choose to test their abilities in characterizing multimodal disinformation shared in digital media.
We experiment with two LLMs that have shown good results in language tasks, LLaMa-2 [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], and
Mistral [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. LLaMa is a competitive model, with good results over a suite of benchmarks related to
commonsense reasoning, word knowledge, reading comprehension, etc. [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. Mistral is another LLM
that surpasses LLaMa-2 on all the tested benchmarks [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. We chose these two models to evaluate
their classification performance on our dataset based solely on the text found in the image and its
caption. The text found in images is written in Spanish (as presented in Figure 1) and was extracted
using Pytesseract5. The caption of the image was generated using BLIP-2 [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. We conducted zero-shot
and few-shot experiments using the aforementioned LLMs.
        </p>
        <p>
          Although these LLMs are pre-trained on data that is mostly in English, LLaMa, for example, was
pre-trained on 1.3B Spanish tokens (0.13% of the total corpus). This amount of pre-training tokens
makes it capable of processing Spanish content, although the results may not be as accurate as for
English data [30]. No information about the data used for pre-training Mistral models is available [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ].
Because the text from the multimodal false content is in Spanish, we chose to include in our experiments
a fine-tuned version of LLaMa-7B on Spanish instructions 6.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Multimodal Large Language Models</title>
        <p>
          In our work, we also conduct experiments using the Multimodal LLM LLaVa [31], which is a
generalpurpose visual and language model (Figure 5). LLaVa uses a language model (in our case, LLaMa-2
[
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]) to process both the visual information from the image and the text of the language instructions.
LLaVa uses a pre-trained CLIP vision transformer to process visual input, which is then projected in
the same embedding space as the text. The visual and text embeddings are then fed to LLaMa, which
generates a suitable language response. In our experiments we use LLaVA-v1.5 [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] and LLaVA-v1.5
Q-Instruct [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. We chose to use LLaVA-v1.5, as it is an improved version of the original LLaVA,
and it achieves state-of-the-art results on various benchmarks related to visual question answering.
LLaVA-v1.5 Q-Instruct improves over the aforementioned versions by demonstrating low-level visual
perception [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ].
5https://github.com/madmaze/pytesseract
6clibrain/Llama-2-7b-ft-instruct-es
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>As part of our experiments, we tested the zero-shot and few-shot (one-shot) capabilities of various
models. Our test set is comprised of 519 samples, as 15 samples were kept to potentially be used for the
few-shot settings. We used the open-source implementations for all the models. Due to computational
limitations, we only experimented with 7B variants of LLMs and Multimodal LLMs. While generating
the output, we use the default temperature of 0.7. Additionally, we post-processed the generated output
to remove any punctuation, quotation marks, or explanations generated by the models. The prompts
for LLaMa-2-7B and Mistral-7B were written in English. For LLaMa-2-7B-ES, given that it is a model
ifne-tuned for the Spanish language, we use prompts written in Spanish.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>We evaluate each model for the two tasks, either target or source classification, by computing F 1
scores for each class. We also measure the performance over each task using Weighted-F1 score, given
that the categories of our dataset are highly imbalanced. We present the results of the zero-shot
classification using CLIP, MetaCLIP, OpenCLIP, and SigLIP in Table 1. For the majority of the models
and variants, using longer descriptions of the class names improved the results of the classification.
The best model for classifying the target of the false multimodal content was OpenCLIPℎ, obtaining
a Weighted-F1 score of 55.05%. Even if SigLIP obtained an 86.18% Weighted-F1 score for predicting
the source of disinformation, it cannot accurately make predictions for all the categories.</p>
      <p>In Table 2, we showcase the performance of the LLMs in zero-shot and few-shot settings. LLaMa-2-7B,
Mistral-7B and LLaMa-2-7B-ES use only the text extracted from the image and its generated caption.
By providing only one example in the prompt, the performance of LLaMa-2-7B improves by 28.15%.
For Mistral-7B, there is a 10.49% improvement in Weighted-F1 score for target classification, while,
for LLaMa-2-7B-ES, the improvement is minimal between zero-shot and few-shot settings. However,
the model fine-tuned on Spanish instructions, LLaMa-2-7B-ES, obtained the best Weighted F 1 score of
64.01% in the few-shot setting and second-best Weighted F1 score of 62.31% in the zero-shot setting.</p>
      <p>Model
LLaMa-2-7B (zero-shot)
LLaMa-2-7B (one-shot)
Mistral-7B (zero-shot)</p>
      <p>Mistral-7B (one-shot)
LLaMa-2-7B-ES (zero-shot)
LLaMa-2-7B-ES (one-shot)</p>
      <p>Predicting the target of disinformation is easier, usually relying on specific cues, such as the presence
of organizations’ or brands’ logos or names appearing in the picture or written in text. However,
predicting the source of disinformation from multimodal content is a harder task, as in many instances,
no information about it appears, and the source is unknown. For source classification, the LLMs
sometimes only predict the Other class, failing to predict other categories. Using the LLaMa-2-7B-ES
in one-shot setting with the text from the image and its caption as input was proven to be a suitable
approach for target classification, surpassing all other visual models, such as CLIP, MetaCLIP, OpenCLIP
and SigLIP. The limitations of general language models trained solely on English data are highlighted by
the best performance of LLaMa-2-7B-ES, which was adapted to Spanish data. This further emphasizes
the need to develop language-specialized LLMs.</p>
      <p>In Table 3, we show the results of LLaVA-v1.5-7B for zero-shot classification. LLaVA-v1.5-7B obtains
a better performance of 51.88% Weighted-F1 score for target classification, while LLaVA-v1.5-7B
(QInstruct) obtains a better performance for source classification (74.16% Weighted-F 1 score). In zero-shot
settings, LLaVA-v1.5-7B outperforms the English-based language-only counterparts, LLaMa-2-7B and
Mistral-7B, for target classification, obtaining a Weighted-F 1 score of 51.88%. However, it has a lower
performance than LLaMa-2-7B-ES. According to our experiments, while general LLMs pre-trained
on mostly English data can provide satisfactory results for identifying false content in our corporate
multimodal disinformation dataset, models specifically adapted for a particular language perform better.
This is because they can make use of the Spanish text present in the multimodal content, leading to
enhanced performance.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, our aim was to create a valuable resource for characterizing corporate multimodal
disinformation from digital media featuring both visual and textual elements in Spanish, annotated with
details about the source and target of the false content. By publishing our dataset, we aim to encourage
further research in this area and the development of more efective disinformation characterization
technologies. Our comprehensive experiments have assessed the eficacy of state-of-the-art multimodal
transformer models and LLMs in characterizing false content within images. Our findings reveal that
predicting the target of the false content is easier than predicting the source, as the latter requires
information that may not be easily represented in the multimodal data. In terms of zero-shot versus
fewshot settings, providing one example for each class improved the performance for target classification by
28.15% for LLaMa-2-7B and 10.49% for Mistral-7B in terms of Weighted-F1 score. LLaVA, the Multimodal
LLM that we had tested, obtained a Weighted-F1 score of 51.88% in a zero-shot setting for target
classification. The best result for target classification, of 64.01% Weighted-F 1 score, was obtained by
LLaMa-2-7B-ES in one-shot setting, suggesting that LLMs specifically adapted for a particular language
are needed when processing non-English data.</p>
      <p>Our goal is to assist corporate entities in monitoring digital streams for fake news that could potentially
harm their reputations. In our future work, we intend to expand our dataset and develop methods for
identifying the specific brands and organizations targeted by false content. Moreover, we would like to
expand our analysis to recently-released LLMs, such as LLama-37, LLaVA-NeXT8, GPT-4V [32], Gemini
Pro9, InstructBLIP [33].</p>
    </sec>
    <sec id="sec-7">
      <title>Limitations</title>
      <p>One of the limitations of the current study is the small and imbalanced number of samples in each
class from the collected dataset. Our approach relies on data that was already fact-checked, which
is challenging to obtain. Due to the insuficient samples in some categories, our models struggle to
accurately predict those classes. To address this limitation, our future work will focus on expanding the
dataset. Specifically, we will target the collection of more samples for underrepresented classes, such as
Brand for target classification and Corporate and Advertising for source classification.</p>
      <p>Another limitation is the use of 7B variants of LLMs and Multimodal LLMs in our experiments due
to computational limitations. Even if LLaMa-2-7B-ES and LLaVA-v1.5-7B have shown promising results
of 64.01% and 51.88% Weighted-F1 for source classification, using bigger variants of the models could
lead to further improvements in the results [34].</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>The work of Paolo Rosso was in the framework of FAKE news and HATE speech
(FAKEnHATEPdC) funded by MCIN/AEI/10.13039/501100011033 and by European Union NextGenerationEU/PRTR
(PDC2022-133118-I00), Iberian Digital Media Observatory (IBERIFIER Plus) funded by the EC
(DIGITAL2023-DEPLOY-04) under reference 101158511, and Malicious Actors Profiling and Detection in Online
Social Networks Through Artificial Intelligence (MARTINI) funded by MCIN/AEI/ 10.13039/501100011033
and by European Union NextGenerationEU/PRTR (PCI2022-135008-2).
image encoders and large language models, in: Proceedings of ICML, 2023.
[30] H. Choi, Y. Yoon, S. Yoon, K. Park, How does fake news use a thumbnail? clip-based multimodal
detection on the unrepresentative news image, in: Proceedings of the CONSTRAINT Workshop,
2022, pp. 86–94.
[31] H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, in: Proceedings of NeurIPS, 2024.
[32] OpenAI, Gpt-4v(ision) system card, preprint (2023).
[33] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, S. Hoi, Instructblip: Towards
general-purpose vision-language models with instruction tuning, 2023. arXiv:2305.06500.
[34] J. Lucas, A. Uchendu, M. Yamashita, J. Lee, S. Rohatgi, D. Lee, Fighting fire with fire: The dual
role of llms in crafting and detecting elusive disinformation, in: Proceedings of EMNLP, 2023, pp.
14279–14305.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ireton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Posetti</surname>
          </string-name>
          , Journalism, fake news &amp;
          <article-title>disinformation: handbook for journalism education and training</article-title>
          , Unesco Publishing,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Berthon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Treen</surname>
          </string-name>
          , L. Pitt,
          <article-title>How truthiness, fake news and post-fact endanger brands and what to do about it</article-title>
          ,
          <source>NIM Marketing Intelligence Review</source>
          <volume>10</volume>
          (
          <year>2018</year>
          )
          <fpage>18</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Baker</surname>
          </string-name>
          ,
          <article-title>Alt. health inuflencers: how wellness culture and web culture have been weaponised to promote conspiracy theories and far-right extremism during the covid-19 pandemic</article-title>
          ,
          <source>European Journal of Cultural Studies</source>
          <volume>25</volume>
          (
          <year>2022</year>
          )
          <fpage>3</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>M. De Veirman</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Cauberghe</surname>
          </string-name>
          , L. Hudders,
          <article-title>Marketing through instagram influencers: the impact of number of followers and product divergence on brand attitude</article-title>
          ,
          <source>International journal of advertising 36</source>
          (
          <year>2017</year>
          )
          <fpage>798</fpage>
          -
          <lpage>828</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Christov</surname>
          </string-name>
          , et al.,
          <article-title>Economic efects of the fake news on companies and the need of new pr strategies</article-title>
          ,
          <source>Journal of Sustainable Development</source>
          <volume>8</volume>
          (
          <year>2018</year>
          )
          <fpage>41</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <article-title>What's the damage?. measuring the impact of fake news on corporate reputation can act as a guide for companies to navigate a post-truth landscape, CommunicationDirector</article-title>
          .com (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Peterson</surname>
          </string-name>
          ,
          <article-title>A high-speed world with fake news: brand managers take warning</article-title>
          ,
          <source>Journal of Product &amp; Brand Management</source>
          <volume>29</volume>
          (
          <year>2020</year>
          )
          <fpage>234</fpage>
          -
          <lpage>245</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W. A.</given-names>
            <surname>Galston</surname>
          </string-name>
          ,
          <article-title>Is seeing still believing? the deepfake challenge to truth in politics</article-title>
          , Brookings
          <string-name>
            <surname>Institution</surname>
          </string-name>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gomes-Gonçalves</surname>
          </string-name>
          ,
          <article-title>Los deepfakes como una nueva forma de desinformación corporativa-una revisión de la literatura</article-title>
          ,
          <source>IROCAMM: International Review of Communication and Marketing Mix</source>
          ,
          <volume>5</volume>
          (
          <issue>2</issue>
          ),
          <fpage>22</fpage>
          -
          <lpage>38</lpage>
          . (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Westerlund</surname>
          </string-name>
          ,
          <article-title>The emergence of deepfake technology: A review, Technology innovation management review 9 (</article-title>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Babakar</surname>
          </string-name>
          , W. Moy,
          <source>The state of automated factchecking, Full Fact</source>
          <volume>28</volume>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Rufo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Semeraro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giachanou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>Studying fake news spreading, polarisation dynamics, and manipulation by bots: A tale of networks and language</article-title>
          , Computer science review
          <volume>47</volume>
          (
          <year>2023</year>
          )
          <fpage>100531</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shu</surname>
          </string-name>
          , H. Liu,
          <article-title>Toward a multilingual and multimodal data repository for covid-19 disinformation, in: IEEE Big Data</article-title>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>4325</fpage>
          -
          <lpage>4330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Zhai,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Jeon,
          <article-title>Towards multimodal disinformation detection by vision-language knowledge interaction</article-title>
          ,
          <source>Information Fusion</source>
          <volume>102</volume>
          (
          <year>2024</year>
          )
          <fpage>102037</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giachanou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Scenefnd:
          <article-title>Multimodal fake news detection by modelling scene context information</article-title>
          ,
          <source>Journal of Information Science</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tufchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yadav</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <article-title>A comprehensive survey of multimodal fake news detection techniques: advances, challenges, and opportunities</article-title>
          ,
          <source>International Journal of Multimedia Information Retrieval</source>
          <volume>12</volume>
          (
          <year>2023</year>
          )
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wilson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wilkes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Teramoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hale</surname>
          </string-name>
          ,
          <article-title>Multimodal analysis of disinformation and misinformation</article-title>
          ,
          <source>Royal Society Open Science</source>
          <volume>10</volume>
          (
          <year>2023</year>
          )
          <fpage>230964</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondielli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dell'Oglio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lenci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Marcelloni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Passaro</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Sabbatini, Multi-fake-detective at evalita 2023: Overview of the multimodal fake news detection and verification task</article-title>
          ,
          <source>CEUR Workshop Proceedings</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: Proceedings of ICML</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. E.</given-names>
            <surname>Tan</surname>
          </string-name>
          , P.-Y. Huang,
          <string-name>
            <given-names>R.</given-names>
            <surname>Howes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Feichtenhofer</surname>
          </string-name>
          ,
          <article-title>Demystifying clip data</article-title>
          ,
          <source>in: Proceedings of ICLR</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mustafa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          , L. Beyer,
          <article-title>Sigmoid loss for language image pre-training</article-title>
          ,
          <source>in: Proceedings of ICCV</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Peskine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Korenčić</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Grubisic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>Definitions matter: Guiding gpt for multi-label classification</article-title>
          ,
          <source>in: Findings of ACL: EMNLP</source>
          <year>2023</year>
          ,
          <year>2023</year>
          , pp.
          <fpage>4054</fpage>
          -
          <lpage>4063</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ilharco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wortsman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wightman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Carlini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Taori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Shankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Namkoong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          , L. Schmidt, Openclip,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>C.</given-names>
            <surname>Schuhmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Beaumont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vencu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wightman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cherti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Coombes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Katta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mullis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wortsman</surname>
          </string-name>
          , et al.,
          <article-title>Laion-5b: An open large-scale dataset for training next generation image-text models</article-title>
          ,
          <source>in: Proceedings of NeurIPS</source>
          , volume
          <volume>35</volume>
          ,
          <year>2022</year>
          , pp.
          <fpage>25278</fpage>
          -
          <lpage>25294</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. d. l. Casas,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          , G. Lengyel,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          , et al.,
          <source>Mistral 7b, arXiv preprint arXiv:2310.06825</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Improved baselines with visual instruction tuning</article-title>
          ,
          <source>in: Proceedings of ITIF Workshop</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almahairi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Babaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bashlykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhargava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhosale</surname>
          </string-name>
          , et al.,
          <source>Llama</source>
          <volume>2</volume>
          :
          <article-title>Open foundation and fine-tuned chat models</article-title>
          ,
          <source>arXiv preprint arXiv:2307.09288</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , E. Zhang,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhai</surname>
          </string-name>
          , et al.,
          <article-title>Qinstruct: Improving low-level visual abilities for multi-modality foundation models</article-title>
          ,
          <source>arXiv preprint arXiv:2311.06783</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Savarese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <article-title>Blip-2: bootstrapping language-image pre-training with frozen</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>