1. Introduction and Related Work

October

On the Categorization of Corporate Multimodal Disinformation with Large Language Models

Ana-Maria Bucur

0 1

Sónia Gonçalves

Paolo Rosso

1 3 0 Interdisciplinary School of Doctoral Studies, University of Bucharest , Romania 1 PRHLT Research Center, Universitat Politècnica de València , Spain 2 Universidad de Sevilla , Spain 3 ValgrAI Valencian Graduate School and Research Network of Artificial Intelligence , Spain

2024

20 2024 29 39

Disinformation is becoming more prevalent in the corporate sphere, especially as brands choose to promote their products through influencers or micro-celebrities who are perceived as reliable and impartial, but may facilitate false information. The spread of disinformation can have negative economic impacts on companies and brands, which can even afect their reputation. Artificial Intelligence can help detect false information and has become increasingly important in combating disinformation. The current work addresses the problem of characterizing multimodal disinformation targeting corporations and provides a collection of content that spreads disinformation in digital media. The content was manually annotated with information about the target (Organization, Brand, or Other) and the source (Corporate, Advertising, or Other) of the false content. We conduct comprehensive experiments to evaluate the efectiveness of state-of-the-art Unimodal and Multimodal Large Language Models in identifying the source and target of the content.

eol>Corporate Multimodal Disinformation Multimodal Large Language Models Spanish

1. Introduction and Related Work

According to [ 1 ], the concept of disinformation refers to a deliberate and organized attempt to confuse or manipulate people by providing dishonest information. In the corporate sphere, disinformation is gaining more ground. It is orchestrated to persuade audiences and hold great appeal for advertisers who promote their dissemination as a lure “because it fits more easily into people’s prejudices” [ 2 ]. The issue can become even more dangerous when we consider that more and more brands choose to promote their products through influencers or micro-celebrities, which can facilitate false information [ 3 ]. These opinion leaders are perceived with high levels of reliability and impartiality, allowing them to recommend products and services on various social media platforms and generate word of mouth that brands leverage for their commercialization [ 4 ].

The spread of disinformation can be a risk to companies and brands and cause a negative economic impact [ 5 ] that can even afect their reputation. Disinformation that can impact a company’s reputation may stem from political, financial, emotional, or internal motivations, such as discontented employees [ 6 ]. Therefore, it is important for organizations to manage trusting relationships with the public. Organizations can become victims of individuals and advanced technologies with the intention to damage their reputation for twisted purposes [ 7 ] through the use of deepfakes, a new form of fake news that threatens companies, organizations, and brands [ 8, 9, 10 ]. As the reputation of organizations can be afected by the spread of disinformation, to protect the corporate image, communication oficers need to be aware of strategies to combat it, such as fact-checking. Artificial Intelligence has enabled the implementation of automated approaches capable of detecting false information [ 11, 12 ], also from a multimodal perspective [ 13, 14, 15, 16, 17, 18 ].

Unlike general disinformation, which can target individuals, events, or broad societal issues, corporate disinformation often has direct financial implications and can damage trust in brands and organizations. Recognizing the unique characteristics and potential impacts of such disinformation, our work aims to deepen the understanding of what are the actors targeted by corporate disinformation and the sources spreading it. By classifying the target of the false content, we can identify whether the afected entity is an organization or a brand. Furthermore, identifying the source will enable afected entities to take action and develop appropriate responses to counter the disinformation being spread about them.

As there are many previous works on multimodal fake content detection [ 18, 14, 13, 16, 17 ], we aim to characterize content that has been already fact-checked and confirmed as false. To the best of our knowledge, this is the first time that the problem of multimodal disinformation targeting corporations has been addressed automatically. For this purpose, a collection of multimodal content in Spanish that was already fact-checked is collected and annotated by expert annotators with information about the target and source of the content (Figure 1). Our dataset consists of 534 samples, together with annotations for the target (Organization, Brand, or Other) and the source (Corporate, Advertising, or Other) spreading disinformation. The false content can be targeted at an Organization, such as a company, institution, or an individual representing them. It can also target a Brand or a person associated with it. Alternatively, disinformation can be classified as Other, meaning it is not aimed at an organization or brand but contains misleading information intended to deceive the general population. Furthermore, false content can originate from various sources. It may stem from a Corporate origin, where a corporate entity is responsible for spreading disinformation, rather than just an individual. Alternatively, it could be a result of persuasive Advertising, typically in the form of paid posts on social media. Lastly, false content may originate from Other sources, such as online users disseminating misleading information.

In this paper, we address the problem of characterizing multimodal disinformation targeting corporations. Our work makes the following contributions: • A collection of multimodal false content (visual and textual information in Spanish) that spread disinformation in digital media on corporations is compiled and annotated with information about the source and target of the false content; • Comprehensive experiments are conducted to evaluate the efectiveness of state-of-the-art Unimodal and Multimodal Large Language Models (LLMs) in characterizing false content.

2. Data Collection

The dataset used in this work is obtained from the IBERIFIER repository1, which includes online content that has been fact-checked and verified 2. IBERIFIER is a project that aims to fight disinformation in digital media in Spain and Portugal, in which data from various fact-checking websites is collected and analyzed. In our research, we specifically focus on false content in Spanish that was verified by EFE Verifica 3 and Maldita.es4, as these organizations contributed the most content to the IBERIFIER database. Our dataset consists solely of posts that were confirmed by these fact-checking entities to contain false information. This limits the dataset size, as obtaining fact-checked data is challenging. Our dataset contains 496 samples from Maldita.es and 38 samples from EFE Verifica, with multimodal data represented through both visual and textual information in Spanish. By deliberately focusing on posts that have been verified to contain disinformation, we can more efectively evaluate the performance of pre-trained visual transformer models and LLMs in characterizing deceptive information. This dataset allows us to study and understand how these models identify the diferent targets and sources spreading disinformation. The dataset is an essential resource for studying the efectiveness of LLMs in classifying false content from visual and textual cues found in images.

For each of the collected images, we also retrieved information about the format of the content and the platform used to spread it using the IBERIFIER API. In Figure 2, we present the various formats of false content. The most common type of false content is represented by pictures, followed by screenshots from social media. Figure 3 shows the platforms used to spread the disinformation content. The data suggests that social media platforms like Twitter, Facebook, TikTok, and Instagram are the primary channels used to spread false content. However, we found that a considerable amount of false information is also shared through messaging apps like WhatsApp.

Two expert annotators have labeled each instance of false content with information about the target and source. The target of the disinformation can be an Organization (either a company, an institution, or a person representing it), a Brand (or a person representing it), or it can be Other, meaning that it is not targeted towards an organization or a brand, and it contains false information intending to mislead the general population about various topics, such as climate change, immigrants, conspiracy theories, local news. With regard to the diferent sources of false content (i.e. the origin of the content), the content can be of Corporate origin (usually, there is an entire corporate entity behind the spread of disinformation, not just an individual), persuasive Advertising (usually paid posts on social media), or Other - usually false content spread by other users. The Other class also contains false content in which the identity of the spreader does not appear in or cannot be inferred from the image/text (see Figure 1, 1st and 4th example). We obtained a strong agreement between the two annotators (Cohen’s 0.90). The disagreements between them have been resolved by a senior researcher in the field. The ifnal dataset contains 347 samples targeting an organization, 87 targeting a brand, and 100 targeting other entities. Regarding the sources of the false content, the dataset is comprised of 52 Corporate, 4 Advertising, and 478 Other sources.

We showcase 4 examples from the collected data in Figure 1. The dataset includes diferent types of disinformation found in digital media, which makes it dificult to identify the source and target spreading the content. The first example shows an image with a figure representing the electoral results from the Chueca neighborhood of Madrid. However, the image is spreading disinformation because the results are actually from a municipality in Toledo with the same name. This is a classic example of how disinformation can be spread by manipulating images and providing false information. The source of the content was classified as Other because the origin of the information is unknown, it does not appear in the text or the image. On the other hand, the target is Organization because the disinformation publication afects one or more organizations, in this case, political parties (People’s Party (PP)) and Spanish Socialist Workers’ Party (PSOE)).

The second example is a sponsored post from Facebook, asking individuals to complete a brief questionnaire for the chance to purchase a discounted vacuum cleaner. However, this image represents a classic phishing post where individuals are persuaded to share their banking information with malicious entities. This example illustrates how social media platforms can be used to spread phishing scams that can deceive unsuspecting users. The source of the content was categorized as Advertising due to the information originating from a clearly identified advertising publication (sponsored content), indicating that the advertising is conducted on a social network through payment. Conversely, the target is identified as Brand because the disinformation publication impacts brands, specifically Dyson and Lidl.

The third example is a screenshot from a website that claims to be of Repsol S.A., an energy and petrochemical company from Spain. However, the website is not the real website of the company, and it is used for phishing. Malicious actors are using the website to trick users into sharing their personal data. The content was categorized as Corporate because the web page appears to be created by a corporate entity rather than an individual. On the other hand, the target is Brand, as it targets Repsol.

In the fourth example, we present a screenshot from social media that is not targeted towards a corporate entity or a brand, and it was labeled as Other - trying to mislead the general population. The source of the content was labeled as Other, with no information about the source provided in the text or image.

3. Methodology

We perform experiments in zero-shot or few-shot settings to evaluate the efectiveness of state-of-the-art visual transformer models and LLMs in characterizing false content within multimodal data.

3.1. Pre-trained Visual Transformer Models

Pre-trained visual transformer models, such as CLIP [ 19 ], have shown great performance on downstream tasks without additional training, obtaining competitive results with a supervised baseline. CLIP was pre-trained in a self-supervised manner on a large collection of image-text pairs with a contrastive learning objective. The model was trained to maximize similarity between pairs of the same class and minimize similarity between pairs of diferent classes. CLIP extracts embeddings by processing the image and text through a visual and textual encoder, respectively. The embeddings are then mapped to a shared space where similarities between image-text pairs can be computed. Pre-training allows CLIP to represent images and text with similar content closer in the embedding space while unrelated image-text pairs are represented further apart. In this way, the model can compute the relationship between a given image and its corresponding textual description.

We are exploring the efectiveness of using CLIP and similar models [ 20, 21 ] for zero-shot classification. To achieve this, we investigate how well the models can predict the target and the source of online disinformation. The zero-shot classicfiation pipeline is presented in Figure 4. The process involves passing images and texts, in our case, the names/descriptions of the categories, through frozen visual and textual encoder models. The similarity between the image and each category name/description is computed, and the category with the highest similarity score is selected as the final prediction. We conducted our experiments in two settings: by providing the class names as labels and by providing a short definition/description of the content we expect to find for each class. The two types of label names, short and long, are shown in Figure 4. For target classification, we first experimented with short label names such as Organization, Brand, and Other. We also experimented with longer names, such as “a screenshot of false information targeting an organization (a company or an institution)”, etc. Inspired by recent works highlighting the importance of the definitions of the concepts [ 22 ], we added more information to the text describing the categories. For the source classification, we followed a similar approach and experimented with both the short label names, such as Corporate, Advertising, and Other, and longer variants.

In our experiments, we have tested the abilities of various pre-trained transformer models like CLIP [ 19 ], OpenCLIP [ 23 ], MetaCLIP [ 20 ], SigLIP [ 21 ]. CLIP and OpenCLIP [ 23 ] have identical vision transformer architecture, but OpenCLIP was trained on the open-source dataset LAION-2B [ 24 ], whereas CLIP was trained on a private dataset of image-text pairs. MetaCLIP [ 20 ] uses the same architecture and training regime as above, but the authors ensure that only high-quality image-text pairs are used for pre-training. SigLIP [ 21 ] replaces the softmax-based contrastive loss from CLIP with a sigmoid loss. We experiment with diferent variants of the models, either base, large, or huge, if available.

3.2. Large Language Models

With the great success of leveraging LLMs in various vision and language tasks [ 25, 26, 27, 28 ], we also choose to test their abilities in characterizing multimodal disinformation shared in digital media. We experiment with two LLMs that have shown good results in language tasks, LLaMa-2 [ 27 ], and Mistral [ 25 ]. LLaMa is a competitive model, with good results over a suite of benchmarks related to commonsense reasoning, word knowledge, reading comprehension, etc. [ 27 ]. Mistral is another LLM that surpasses LLaMa-2 on all the tested benchmarks [ 25 ]. We chose these two models to evaluate their classification performance on our dataset based solely on the text found in the image and its caption. The text found in images is written in Spanish (as presented in Figure 1) and was extracted using Pytesseract5. The caption of the image was generated using BLIP-2 [ 29 ]. We conducted zero-shot and few-shot experiments using the aforementioned LLMs.

Although these LLMs are pre-trained on data that is mostly in English, LLaMa, for example, was pre-trained on 1.3B Spanish tokens (0.13% of the total corpus). This amount of pre-training tokens makes it capable of processing Spanish content, although the results may not be as accurate as for English data [30]. No information about the data used for pre-training Mistral models is available [ 25 ]. Because the text from the multimodal false content is in Spanish, we chose to include in our experiments a fine-tuned version of LLaMa-7B on Spanish instructions 6.

3.3. Multimodal Large Language Models

In our work, we also conduct experiments using the Multimodal LLM LLaVa [31], which is a generalpurpose visual and language model (Figure 5). LLaVa uses a language model (in our case, LLaMa-2 [ 27 ]) to process both the visual information from the image and the text of the language instructions. LLaVa uses a pre-trained CLIP vision transformer to process visual input, which is then projected in the same embedding space as the text. The visual and text embeddings are then fed to LLaMa, which generates a suitable language response. In our experiments we use LLaVA-v1.5 [ 26 ] and LLaVA-v1.5 Q-Instruct [ 28 ]. We chose to use LLaVA-v1.5, as it is an improved version of the original LLaVA, and it achieves state-of-the-art results on various benchmarks related to visual question answering. LLaVA-v1.5 Q-Instruct improves over the aforementioned versions by demonstrating low-level visual perception [ 28 ]. 5https://github.com/madmaze/pytesseract 6clibrain/Llama-2-7b-ft-instruct-es

4. Experimental Setup

As part of our experiments, we tested the zero-shot and few-shot (one-shot) capabilities of various models. Our test set is comprised of 519 samples, as 15 samples were kept to potentially be used for the few-shot settings. We used the open-source implementations for all the models. Due to computational limitations, we only experimented with 7B variants of LLMs and Multimodal LLMs. While generating the output, we use the default temperature of 0.7. Additionally, we post-processed the generated output to remove any punctuation, quotation marks, or explanations generated by the models. The prompts for LLaMa-2-7B and Mistral-7B were written in English. For LLaMa-2-7B-ES, given that it is a model ifne-tuned for the Spanish language, we use prompts written in Spanish.

5. Results

We evaluate each model for the two tasks, either target or source classification, by computing F 1 scores for each class. We also measure the performance over each task using Weighted-F1 score, given that the categories of our dataset are highly imbalanced. We present the results of the zero-shot classification using CLIP, MetaCLIP, OpenCLIP, and SigLIP in Table 1. For the majority of the models and variants, using longer descriptions of the class names improved the results of the classification. The best model for classifying the target of the false multimodal content was OpenCLIPℎ, obtaining a Weighted-F1 score of 55.05%. Even if SigLIP obtained an 86.18% Weighted-F1 score for predicting the source of disinformation, it cannot accurately make predictions for all the categories.

In Table 2, we showcase the performance of the LLMs in zero-shot and few-shot settings. LLaMa-2-7B, Mistral-7B and LLaMa-2-7B-ES use only the text extracted from the image and its generated caption. By providing only one example in the prompt, the performance of LLaMa-2-7B improves by 28.15%. For Mistral-7B, there is a 10.49% improvement in Weighted-F1 score for target classification, while, for LLaMa-2-7B-ES, the improvement is minimal between zero-shot and few-shot settings. However, the model fine-tuned on Spanish instructions, LLaMa-2-7B-ES, obtained the best Weighted F 1 score of 64.01% in the few-shot setting and second-best Weighted F1 score of 62.31% in the zero-shot setting.

Model LLaMa-2-7B (zero-shot) LLaMa-2-7B (one-shot) Mistral-7B (zero-shot)

Mistral-7B (one-shot) LLaMa-2-7B-ES (zero-shot) LLaMa-2-7B-ES (one-shot)

Predicting the target of disinformation is easier, usually relying on specific cues, such as the presence of organizations’ or brands’ logos or names appearing in the picture or written in text. However, predicting the source of disinformation from multimodal content is a harder task, as in many instances, no information about it appears, and the source is unknown. For source classification, the LLMs sometimes only predict the Other class, failing to predict other categories. Using the LLaMa-2-7B-ES in one-shot setting with the text from the image and its caption as input was proven to be a suitable approach for target classification, surpassing all other visual models, such as CLIP, MetaCLIP, OpenCLIP and SigLIP. The limitations of general language models trained solely on English data are highlighted by the best performance of LLaMa-2-7B-ES, which was adapted to Spanish data. This further emphasizes the need to develop language-specialized LLMs.

In Table 3, we show the results of LLaVA-v1.5-7B for zero-shot classification. LLaVA-v1.5-7B obtains a better performance of 51.88% Weighted-F1 score for target classification, while LLaVA-v1.5-7B (QInstruct) obtains a better performance for source classification (74.16% Weighted-F 1 score). In zero-shot settings, LLaVA-v1.5-7B outperforms the English-based language-only counterparts, LLaMa-2-7B and Mistral-7B, for target classification, obtaining a Weighted-F 1 score of 51.88%. However, it has a lower performance than LLaMa-2-7B-ES. According to our experiments, while general LLMs pre-trained on mostly English data can provide satisfactory results for identifying false content in our corporate multimodal disinformation dataset, models specifically adapted for a particular language perform better. This is because they can make use of the Spanish text present in the multimodal content, leading to enhanced performance.

6. Conclusion

In this paper, our aim was to create a valuable resource for characterizing corporate multimodal disinformation from digital media featuring both visual and textual elements in Spanish, annotated with details about the source and target of the false content. By publishing our dataset, we aim to encourage further research in this area and the development of more efective disinformation characterization technologies. Our comprehensive experiments have assessed the eficacy of state-of-the-art multimodal transformer models and LLMs in characterizing false content within images. Our findings reveal that predicting the target of the false content is easier than predicting the source, as the latter requires information that may not be easily represented in the multimodal data. In terms of zero-shot versus fewshot settings, providing one example for each class improved the performance for target classification by 28.15% for LLaMa-2-7B and 10.49% for Mistral-7B in terms of Weighted-F1 score. LLaVA, the Multimodal LLM that we had tested, obtained a Weighted-F1 score of 51.88% in a zero-shot setting for target classification. The best result for target classification, of 64.01% Weighted-F 1 score, was obtained by LLaMa-2-7B-ES in one-shot setting, suggesting that LLMs specifically adapted for a particular language are needed when processing non-English data.

Our goal is to assist corporate entities in monitoring digital streams for fake news that could potentially harm their reputations. In our future work, we intend to expand our dataset and develop methods for identifying the specific brands and organizations targeted by false content. Moreover, we would like to expand our analysis to recently-released LLMs, such as LLama-37, LLaVA-NeXT8, GPT-4V [32], Gemini Pro9, InstructBLIP [33].

Limitations

One of the limitations of the current study is the small and imbalanced number of samples in each class from the collected dataset. Our approach relies on data that was already fact-checked, which is challenging to obtain. Due to the insuficient samples in some categories, our models struggle to accurately predict those classes. To address this limitation, our future work will focus on expanding the dataset. Specifically, we will target the collection of more samples for underrepresented classes, such as Brand for target classification and Corporate and Advertising for source classification.

Another limitation is the use of 7B variants of LLMs and Multimodal LLMs in our experiments due to computational limitations. Even if LLaMa-2-7B-ES and LLaVA-v1.5-7B have shown promising results of 64.01% and 51.88% Weighted-F1 for source classification, using bigger variants of the models could lead to further improvements in the results [34].

Acknowledgments

The work of Paolo Rosso was in the framework of FAKE news and HATE speech (FAKEnHATEPdC) funded by MCIN/AEI/10.13039/501100011033 and by European Union NextGenerationEU/PRTR (PDC2022-133118-I00), Iberian Digital Media Observatory (IBERIFIER Plus) funded by the EC (DIGITAL2023-DEPLOY-04) under reference 101158511, and Malicious Actors Profiling and Detection in Online Social Networks Through Artificial Intelligence (MARTINI) funded by MCIN/AEI/ 10.13039/501100011033 and by European Union NextGenerationEU/PRTR (PCI2022-135008-2). image encoders and large language models, in: Proceedings of ICML, 2023. [30] H. Choi, Y. Yoon, S. Yoon, K. Park, How does fake news use a thumbnail? clip-based multimodal detection on the unrepresentative news image, in: Proceedings of the CONSTRAINT Workshop, 2022, pp. 86–94. [31] H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, in: Proceedings of NeurIPS, 2024. [32] OpenAI, Gpt-4v(ision) system card, preprint (2023). [33] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, S. Hoi, Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023. arXiv:2305.06500. [34] J. Lucas, A. Uchendu, M. Yamashita, J. Lee, S. Rohatgi, D. Lee, Fighting fire with fire: The dual role of llms in crafting and detecting elusive disinformation, in: Proceedings of EMNLP, 2023, pp. 14279–14305.

[1]

Ireton ,

Posetti , Journalism, fake news & disinformation: handbook for journalism education and training , Unesco Publishing, 2018 .

[2]

Berthon ,

Treen , L. Pitt, How truthiness, fake news and post-fact endanger brands and what to do about it , NIM Marketing Intelligence Review 10 ( 2018 ) 18 - 23 .

[3]

S. A.

Baker , Alt. health inuflencers: how wellness culture and web culture have been weaponised to promote conspiracy theories and far-right extremism during the covid-19 pandemic , European Journal of Cultural Studies 25 ( 2022 ) 3 - 24 .

[4] M. De Veirman , V. Cauberghe , L. Hudders, Marketing through instagram influencers: the impact of number of followers and product divergence on brand attitude , International journal of advertising 36 ( 2017 ) 798 - 828 .

[5]

Christov , et al., Economic efects of the fake news on companies and the need of new pr strategies , Journal of Sustainable Development 8 ( 2018 ) 41 - 49 .

[6]

Reid , What's the damage?. measuring the impact of fake news on corporate reputation can act as a guide for companies to navigate a post-truth landscape, CommunicationDirector .com ( 2017 ).

[7]

Peterson , A high-speed world with fake news: brand managers take warning , Journal of Product & Brand Management 29 ( 2020 ) 234 - 245 .

[8]

W. A.

Galston , Is seeing still believing? the deepfake challenge to truth in politics , Brookings Institution ( 2020 ).

[9]

Gomes-Gonçalves , Los deepfakes como una nueva forma de desinformación corporativa-una revisión de la literatura , IROCAMM: International Review of Communication and Marketing Mix , 5 ( 2 ), 22 - 38 . ( 2022 ).

[10]

Westerlund , The emergence of deepfake technology: A review, Technology innovation management review 9 ( 2019 ).

[11]

Babakar , W. Moy, The state of automated factchecking, Full Fact 28 ( 2016 ).

[12]

Rufo ,

Semeraro ,

Giachanou ,

Rosso , Studying fake news spreading, polarisation dynamics, and manipulation by bots: A tale of networks and language , Computer science review 47 ( 2023 ) 100531 .

[13]

Li ,

Jiang ,

Shu , H. Liu, Toward a multilingual and multimodal data repository for covid-19 disinformation, in: IEEE Big Data , IEEE, 2020 , pp. 4325 - 4330 .

[14]

Li ,

Gao ,

Zhang , W. Zhai,

Chen , G. Jeon, Towards multimodal disinformation detection by vision-language knowledge interaction , Information Fusion 102 ( 2024 ) 102037 .

[15]

Zhang ,

Giachanou ,

Rosso , Scenefnd: Multimodal fake news detection by modelling scene context information , Journal of Information Science ( 2022 ).

[16]

Tufchi ,

Yadav ,

Ahmed , A comprehensive survey of multimodal fake news detection techniques: advances, challenges, and opportunities , International Journal of Multimedia Information Retrieval 12 ( 2023 ) 28 .

[17]

Wilson ,

Wilkes ,

Teramoto ,

Hale , Multimodal analysis of disinformation and misinformation , Royal Society Open Science 10 ( 2023 ) 230964 .

[18]

Bondielli ,

Dell'Oglio ,

Lenci ,

Marcelloni ,

L. C.

Passaro , M. Sabbatini, Multi-fake-detective at evalita 2023: Overview of the multimodal fake news detection and verification task , CEUR Workshop Proceedings ( 2023 ).

[19]

Radford ,

J. W.

Kim ,

Hallacy ,

Ramesh , G. Goh,

Agarwal ,

Sastry ,

Askell ,

Mishkin ,

Clark , et al., Learning transferable visual models from natural language supervision , in: Proceedings of ICML , 2021 , pp. 8748 - 8763 .

[20]

Xu ,

Xie ,

X. E.

Tan , P.-Y. Huang,

Howes ,

Sharma ,

S.-W.

Li ,

Ghosh ,

Zettlemoyer ,

Feichtenhofer , Demystifying clip data , in: Proceedings of ICLR , 2023 .

[21]

Zhai ,

Mustafa ,

Kolesnikov , L. Beyer, Sigmoid loss for language image pre-training , in: Proceedings of ICCV , 2023 .

[22]

Peskine ,

Korenčić , I. Grubisic ,

Papotti ,

Troncy ,

Rosso , Definitions matter: Guiding gpt for multi-label classification , in: Findings of ACL: EMNLP 2023 , 2023 , pp. 4054 - 4063 .

[23]

Ilharco ,

Wortsman ,

Wightman ,

Gordon ,

Carlini ,

Taori ,

Dave ,

Shankar ,

Namkoong ,

Miller ,

Hajishirzi ,

Farhadi , L. Schmidt, Openclip, 2021 .

[24]

Schuhmann ,

Beaumont ,

Vencu ,

Gordon ,

Wightman ,

Cherti ,

Coombes ,

Katta ,

Mullis ,

Wortsman , et al., Laion-5b: An open large-scale dataset for training next generation image-text models , in: Proceedings of NeurIPS , volume 35 , 2022 , pp. 25278 - 25294 .

[25]

A. Q.

Jiang ,

Sablayrolles ,

Mensch ,

Bamford ,

D. S.

Chaplot , D. d. l. Casas,

Bressand , G. Lengyel,

Lample ,

Saulnier , et al., Mistral 7b, arXiv preprint arXiv:2310.06825 ( 2023 ).

[26]

Liu ,

Li ,

Y. J.

Lee , Improved baselines with visual instruction tuning , in: Proceedings of ITIF Workshop , 2023 .

[27]

Touvron ,

Martin ,

Stone ,

Albert ,

Almahairi ,

Babaei ,

Bashlykov ,

Batra ,

Bhargava ,

Bhosale , et al., Llama 2 : Open foundation and fine-tuned chat models , arXiv preprint arXiv:2307.09288 ( 2023 ).

[28]

Wu ,

Zhang , E. Zhang,

Chen ,

Liao ,

Wang ,

Xu ,

Li ,

Hou ,

Zhai , et al., Qinstruct: Improving low-level visual abilities for multi-modality foundation models , arXiv preprint arXiv:2311.06783 ( 2023 ).

[29]

Li ,

Savarese ,

Hoi , Blip-2: bootstrapping language-image pre-training with frozen