<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">On the Categorization of Corporate Multimodal Disinformation with Large Language Models</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Ana-Maria</forename><surname>Bucur</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Interdisciplinary School of Doctoral Studies</orgName>
								<orgName type="institution">University of Bucharest</orgName>
								<address>
									<country key="RO">Romania</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">PRHLT Research Center</orgName>
								<orgName type="institution">Universitat Politècnica de València</orgName>
								<address>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sónia</forename><surname>Gonçalves</surname></persName>
							<affiliation key="aff2">
								<orgName type="institution">Universidad de Sevilla</orgName>
								<address>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<author role="corresp">
							<persName><forename type="first">Paolo</forename><surname>Rosso</surname></persName>
							<email>prosso@dsic.upv.es</email>
							<affiliation key="aff1">
								<orgName type="department">PRHLT Research Center</orgName>
								<orgName type="institution">Universitat Politècnica de València</orgName>
								<address>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
							<affiliation key="aff3">
								<orgName type="department">ValgrAI Valencian Graduate School and Research Network of Artificial Intelligence</orgName>
								<address>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">On the Categorization of Corporate Multimodal Disinformation with Large Language Models</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">2BEB7EE672BAF4590C85E3DBBE928401</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:21+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Corporate Multimodal Disinformation</term>
					<term>Multimodal Large Language Models</term>
					<term>Spanish</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Disinformation is becoming more prevalent in the corporate sphere, especially as brands choose to promote their products through influencers or micro-celebrities who are perceived as reliable and impartial, but may facilitate false information. The spread of disinformation can have negative economic impacts on companies and brands, which can even affect their reputation. Artificial Intelligence can help detect false information and has become increasingly important in combating disinformation. The current work addresses the problem of characterizing multimodal disinformation targeting corporations and provides a collection of content that spreads disinformation in digital media. The content was manually annotated with information about the target (Organization, Brand, or Other) and the source (Corporate, Advertising, or Other) of the false content. We conduct comprehensive experiments to evaluate the effectiveness of state-of-the-art Unimodal and Multimodal Large Language Models in identifying the source and target of the content.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction and Related Work</head><p>According to <ref type="bibr" target="#b0">[1]</ref>, the concept of disinformation refers to a deliberate and organized attempt to confuse or manipulate people by providing dishonest information. In the corporate sphere, disinformation is gaining more ground. It is orchestrated to persuade audiences and hold great appeal for advertisers who promote their dissemination as a lure "because it fits more easily into people's prejudices" <ref type="bibr" target="#b1">[2]</ref>. The issue can become even more dangerous when we consider that more and more brands choose to promote their products through influencers or micro-celebrities, which can facilitate false information <ref type="bibr" target="#b2">[3]</ref>. These opinion leaders are perceived with high levels of reliability and impartiality, allowing them to recommend products and services on various social media platforms and generate word of mouth that brands leverage for their commercialization <ref type="bibr" target="#b3">[4]</ref>.</p><p>The spread of disinformation can be a risk to companies and brands and cause a negative economic impact <ref type="bibr" target="#b4">[5]</ref> that can even affect their reputation. Disinformation that can impact a company's reputation may stem from political, financial, emotional, or internal motivations, such as discontented employees <ref type="bibr" target="#b5">[6]</ref>. Therefore, it is important for organizations to manage trusting relationships with the public. Organizations can become victims of individuals and advanced technologies with the intention to damage their reputation for twisted purposes <ref type="bibr" target="#b6">[7]</ref> through the use of deepfakes, a new form of fake news that threatens companies, organizations, and brands <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b9">10]</ref>. As the reputation of organizations can be affected by the spread of disinformation, to protect the corporate image, communication officers need to be aware of strategies to combat it, such as fact-checking. Artificial Intelligence has enabled the implementation of automated approaches capable of detecting false information <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12]</ref>, also from a multimodal perspective <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b16">17,</ref><ref type="bibr" target="#b17">18]</ref>. Unlike general disinformation, which can target individuals, events, or broad societal issues, corporate disinformation often has direct financial implications and can damage trust in brands and organizations.</p><p>Recognizing the unique characteristics and potential impacts of such disinformation, our work aims to deepen the understanding of what are the actors targeted by corporate disinformation and the sources spreading it. By classifying the target of the false content, we can identify whether the affected entity is an organization or a brand. Furthermore, identifying the source will enable affected entities to take action and develop appropriate responses to counter the disinformation being spread about them.</p><p>As there are many previous works on multimodal fake content detection <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b16">17]</ref>, we aim to characterize content that has been already fact-checked and confirmed as false. To the best of our knowledge, this is the first time that the problem of multimodal disinformation targeting corporations has been addressed automatically. For this purpose, a collection of multimodal content in Spanish that was already fact-checked is collected and annotated by expert annotators with information about the target and source of the content (Figure <ref type="figure" target="#fig_0">1</ref>). Our dataset consists of 534 samples, together with annotations for the target (Organization, Brand, or Other) and the source (Corporate, Advertising, or Other) spreading disinformation. The false content can be targeted at an Organization, such as a company, institution, or an individual representing them. It can also target a Brand or a person associated with it. Alternatively, disinformation can be classified as Other, meaning it is not aimed at an organization or brand but contains misleading information intended to deceive the general population. Furthermore, false content can originate from various sources. It may stem from a Corporate origin, where a corporate entity is responsible for spreading disinformation, rather than just an individual. Alternatively, it could be a result of persuasive Advertising, typically in the form of paid posts on social media. Lastly, false content may originate from Other sources, such as online users disseminating misleading information.</p><p>In this paper, we address the problem of characterizing multimodal disinformation targeting corporations. Our work makes the following contributions:</p><p>• A collection of multimodal false content (visual and textual information in Spanish) that spread disinformation in digital media on corporations is compiled and annotated with information about the source and target of the false content; • Comprehensive experiments are conducted to evaluate the effectiveness of state-of-the-art Unimodal and Multimodal Large Language Models (LLMs) in characterizing false content.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Data Collection</head><p>The dataset used in this work is obtained from the IBERIFIER repository<ref type="foot" target="#foot_0">1</ref> , which includes online content that has been fact-checked and verified<ref type="foot" target="#foot_1">2</ref> . IBERIFIER is a project that aims to fight disinformation in digital media in Spain and Portugal, in which data from various fact-checking websites is collected and analyzed. In our research, we specifically focus on false content in Spanish that was verified by EFE Verifica<ref type="foot" target="#foot_2">3</ref> and Maldita.es <ref type="foot" target="#foot_3">4</ref> , as these organizations contributed the most content to the IBERIFIER database. Our dataset consists solely of posts that were confirmed by these fact-checking entities to contain false information. This limits the dataset size, as obtaining fact-checked data is challenging. Our dataset contains 496 samples from Maldita.es and 38 samples from EFE Verifica, with multimodal data represented through both visual and textual information in Spanish. By deliberately focusing on posts that have been verified to contain disinformation, we can more effectively evaluate the performance of pre-trained visual transformer models and LLMs in characterizing deceptive information. This dataset allows us to study and understand how these models identify the different targets and sources spreading disinformation. The dataset is an essential resource for studying the effectiveness of LLMs in classifying false content from visual and textual cues found in images.  For each of the collected images, we also retrieved information about the format of the content and the platform used to spread it using the IBERIFIER API. In Figure <ref type="figure" target="#fig_1">2</ref>, we present the various formats of false content. The most common type of false content is represented by pictures, followed by screenshots from social media. Figure <ref type="figure" target="#fig_2">3</ref> shows the platforms used to spread the disinformation content. The data suggests that social media platforms like Twitter, Facebook, TikTok, and Instagram are the primary channels used to spread false content. However, we found that a considerable amount of false information is also shared through messaging apps like WhatsApp.</p><p>Two expert annotators have labeled each instance of false content with information about the target and source. The target of the disinformation can be an Organization (either a company, an institution, or a person representing it), a Brand (or a person representing it), or it can be Other, meaning that it is not targeted towards an organization or a brand, and it contains false information intending to mislead the general population about various topics, such as climate change, immigrants, conspiracy theories, local news. With regard to the different sources of false content (i.e. the origin of the content), the content can be of Corporate origin (usually, there is an entire corporate entity behind the spread of disinformation, not just an individual), persuasive Advertising (usually paid posts on social media), or Other -usually false content spread by other users. The Other class also contains false content in which the identity of the spreader does not appear in or cannot be inferred from the image/text (see Figure <ref type="figure" target="#fig_0">1</ref>, 1st and 4th example). We obtained a strong agreement between the two annotators (Cohen's 𝜅 0.90). The disagreements between them have been resolved by a senior researcher in the field. The final dataset contains 347 samples targeting an organization, 87 targeting a brand, and 100 targeting other entities. Regarding the sources of the false content, the dataset is comprised of 52 Corporate, 4 Advertising, and 478 Other sources.</p><p>We showcase 4 examples from the collected data in Figure <ref type="figure" target="#fig_0">1</ref>. The dataset includes different types of disinformation found in digital media, which makes it difficult to identify the source and target spreading the content. The first example shows an image with a figure representing the electoral results from the Chueca neighborhood of Madrid. However, the image is spreading disinformation because the results are actually from a municipality in Toledo with the same name. This is a classic example of how disinformation can be spread by manipulating images and providing false information. The source of the content was classified as Other because the origin of the information is unknown, it does not appear in the text or the image. On the other hand, the target is Organization because the disinformation publication affects one or more organizations, in this case, political parties (People's Party (PP)) and Spanish Socialist Workers' Party (PSOE)).</p><p>The second example is a sponsored post from Facebook, asking individuals to complete a brief questionnaire for the chance to purchase a discounted vacuum cleaner. However, this image represents a classic phishing post where individuals are persuaded to share their banking information with malicious entities. This example illustrates how social media platforms can be used to spread phishing scams that can deceive unsuspecting users. The source of the content was categorized as Advertising due to the information originating from a clearly identified advertising publication (sponsored content), indicating that the advertising is conducted on a social network through payment. Conversely, the target is identified as Brand because the disinformation publication impacts brands, specifically Dyson and Lidl.</p><p>The third example is a screenshot from a website that claims to be of Repsol S.A., an energy and petrochemical company from Spain. However, the website is not the real website of the company, and it is used for phishing. Malicious actors are using the website to trick users into sharing their personal data. The content was categorized as Corporate because the web page appears to be created by a corporate entity rather than an individual. On the other hand, the target is Brand, as it targets Repsol.</p><p>In the fourth example, we present a screenshot from social media that is not targeted towards a corporate entity or a brand, and it was labeled as Other -trying to mislead the general population. The source of the content was labeled as Other, with no information about the source provided in the text or image.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology</head><p>We perform experiments in zero-shot or few-shot settings to evaluate the effectiveness of state-of-the-art visual transformer models and LLMs in characterizing false content within multimodal data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Pre-trained Visual Transformer Models</head><p>Pre-trained visual transformer models, such as CLIP <ref type="bibr" target="#b18">[19]</ref>, have shown great performance on downstream tasks without additional training, obtaining competitive results with a supervised baseline. CLIP was pre-trained in a self-supervised manner on a large collection of image-text pairs with a contrastive learning objective. The model was trained to maximize similarity between pairs of the same class and minimize similarity between pairs of different classes. CLIP extracts embeddings by processing the image and text through a visual and textual encoder, respectively. The embeddings are then mapped to a shared space where similarities between image-text pairs can be computed. Pre-training allows CLIP to represent images and text with similar content closer in the embedding space while unrelated image-text pairs are represented further apart. In this way, the model can compute the relationship between a given image and its corresponding textual description.</p><p>We are exploring the effectiveness of using CLIP and similar models <ref type="bibr" target="#b19">[20,</ref><ref type="bibr" target="#b20">21]</ref> for zero-shot classification. To achieve this, we investigate how well the models can predict the target and the source of online disinformation. The zero-shot classification pipeline is presented in Figure <ref type="figure" target="#fig_3">4</ref>. The process involves passing images and texts, in our case, the names/descriptions of the categories, through frozen visual and textual encoder models. The similarity between the image and each category name/description is computed, and the category with the highest similarity score is selected as the final prediction. We conducted our experiments in two settings: by providing the class names as labels and by providing a short definition/description of the content we expect to find for each class. The two types of label names, short and long, are shown in Figure <ref type="figure" target="#fig_3">4</ref>. For target classification, we first experimented with short label names such as Organization, Brand, and Other. We also experimented with longer names, such as "a screenshot of false information targeting an organization (a company or an institution)", etc. Inspired by recent works highlighting the importance of the definitions of the concepts <ref type="bibr" target="#b21">[22]</ref>, we added more information to the text describing the categories. For the source classification, we followed a similar approach and experimented with both the short label names, such as Corporate, Advertising, and Other, and longer variants.</p><p>In our experiments, we have tested the abilities of various pre-trained transformer models like CLIP <ref type="bibr" target="#b18">[19]</ref>, OpenCLIP <ref type="bibr" target="#b22">[23]</ref>, MetaCLIP <ref type="bibr" target="#b19">[20]</ref>, SigLIP <ref type="bibr" target="#b20">[21]</ref>. CLIP and OpenCLIP <ref type="bibr" target="#b22">[23]</ref> have identical vision transformer architecture, but OpenCLIP was trained on the open-source dataset LAION-2B <ref type="bibr" target="#b23">[24]</ref>, whereas CLIP was trained on a private dataset of image-text pairs. MetaCLIP <ref type="bibr" target="#b19">[20]</ref> uses the same architecture and training regime as above, but the authors ensure that only high-quality image-text pairs are used for pre-training. SigLIP <ref type="bibr" target="#b20">[21]</ref> replaces the softmax-based contrastive loss from CLIP with a sigmoid loss. We experiment with different variants of the models, either base, large, or huge, if available.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Large Language Models</head><p>With the great success of leveraging LLMs in various vision and language tasks <ref type="bibr" target="#b24">[25,</ref><ref type="bibr" target="#b25">26,</ref><ref type="bibr" target="#b26">27,</ref><ref type="bibr" target="#b27">28]</ref>, we also choose to test their abilities in characterizing multimodal disinformation shared in digital media. We experiment with two LLMs that have shown good results in language tasks, LLaMa-2 <ref type="bibr" target="#b26">[27]</ref>, and Mistral <ref type="bibr" target="#b24">[25]</ref>. LLaMa is a competitive model, with good results over a suite of benchmarks related to commonsense reasoning, word knowledge, reading comprehension, etc. <ref type="bibr" target="#b26">[27]</ref>. Mistral is another LLM Figure <ref type="figure">5</ref>: Zero-Shot Classification pipeline with LLaVA. LLaVa uses a language model (in our case, LLaMa) to process both visual information and language instructions, and generate an appropriate response. LLaVa leverages a pre-trained CLIP model to encode visual information from images. These embeddings are then projected into the same word embeddings space and fed into LLaMa. Finally, LLaMa generates a suitable language response.</p><p>that surpasses LLaMa-2 on all the tested benchmarks <ref type="bibr" target="#b24">[25]</ref>. We chose these two models to evaluate their classification performance on our dataset based solely on the text found in the image and its caption. The text found in images is written in Spanish (as presented in Figure <ref type="figure" target="#fig_0">1</ref>) and was extracted using Pytesseract <ref type="foot" target="#foot_4">5</ref> . The caption of the image was generated using BLIP-2 <ref type="bibr" target="#b28">[29]</ref>. We conducted zero-shot and few-shot experiments using the aforementioned LLMs. Although these LLMs are pre-trained on data that is mostly in English, LLaMa, for example, was pre-trained on 1.3B Spanish tokens (0.13% of the total corpus). This amount of pre-training tokens makes it capable of processing Spanish content, although the results may not be as accurate as for English data <ref type="bibr" target="#b29">[30]</ref>. No information about the data used for pre-training Mistral models is available <ref type="bibr" target="#b24">[25]</ref>.</p><p>Because the text from the multimodal false content is in Spanish, we chose to include in our experiments a fine-tuned version of LLaMa-7B on Spanish instructions<ref type="foot" target="#foot_5">6</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Multimodal Large Language Models</head><p>In our work, we also conduct experiments using the Multimodal LLM LLaVa <ref type="bibr" target="#b30">[31]</ref>, which is a generalpurpose visual and language model (Figure <ref type="figure">5</ref>). LLaVa uses a language model (in our case, LLaMa-2 <ref type="bibr" target="#b26">[27]</ref>) to process both the visual information from the image and the text of the language instructions. LLaVa uses a pre-trained CLIP vision transformer to process visual input, which is then projected in the same embedding space as the text. The visual and text embeddings are then fed to LLaMa, which generates a suitable language response. In our experiments we use LLaVA-v1.5 <ref type="bibr" target="#b25">[26]</ref> and LLaVA-v1.5 Q-Instruct <ref type="bibr" target="#b27">[28]</ref>. We chose to use LLaVA-v1.5, as it is an improved version of the original LLaVA, and it achieves state-of-the-art results on various benchmarks related to visual question answering. LLaVA-v1.5 Q-Instruct improves over the aforementioned versions by demonstrating low-level visual perception <ref type="bibr" target="#b27">[28]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental Setup</head><p>As part of our experiments, we tested the zero-shot and few-shot (one-shot) capabilities of various models. Our test set is comprised of 519 samples, as 15 samples were kept to potentially be used for the few-shot settings. We used the open-source implementations for all the models. Due to computational limitations, we only experimented with 7B variants of LLMs and Multimodal LLMs. While generating the output, we use the default temperature of 0.7. Additionally, we post-processed the generated output to remove any punctuation, quotation marks, or explanations generated by the models. The prompts for LLaMa-2-7B and Mistral-7B were written in English. For LLaMa-2-7B-ES, given that it is a model fine-tuned for the Spanish language, we use prompts written in Spanish.  <ref type="table">1</ref>: Zero-shot classification using visual transformer models. We present the Weighted F 1 -score, and the F 1 -scores for each of the classes. We present the best results with bold, and with underline the second-best results. * denotes statistically significant differences between best and second-best models using the McNemar-Bowker Test (p&lt;0.05).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Target</head><p>We evaluate each model for the two tasks, either target or source classification, by computing F 1 scores for each class. We also measure the performance over each task using Weighted-F 1 score, given that the categories of our dataset are highly imbalanced. We present the results of the zero-shot classification using CLIP, MetaCLIP, OpenCLIP, and SigLIP in Table <ref type="table">1</ref>. For the majority of the models and variants, using longer descriptions of the class names improved the results of the classification. The best model for classifying the target of the false multimodal content was OpenCLIP ℎ𝑢𝑔𝑒 , obtaining a Weighted-F 1 score of 55.05%. Even if SigLIP 𝑙𝑎𝑟𝑔𝑒 obtained an 86.18% Weighted-F 1 score for predicting the source of disinformation, it cannot accurately make predictions for all the categories.</p><p>In Table <ref type="table" target="#tab_1">2</ref>, we showcase the performance of the LLMs in zero-shot and few-shot settings. LLaMa-2-7B, Mistral-7B and LLaMa-2-7B-ES use only the text extracted from the image and its generated caption. By providing only one example in the prompt, the performance of LLaMa-2-7B improves by 28.15%. For Mistral-7B, there is a 10.49% improvement in Weighted-F 1 score for target classification, while, for LLaMa-2-7B-ES, the improvement is minimal between zero-shot and few-shot settings. However, the model fine-tuned on Spanish instructions, LLaMa-2-7B-ES, obtained the best Weighted F 1 score of 64.01% in the few-shot setting and second-best Weighted F 1 score of 62.31% in the zero-shot setting.  Predicting the target of disinformation is easier, usually relying on specific cues, such as the presence of organizations' or brands' logos or names appearing in the picture or written in text. However, predicting the source of disinformation from multimodal content is a harder task, as in many instances, no information about it appears, and the source is unknown. For source classification, the LLMs sometimes only predict the Other class, failing to predict other categories. Using the LLaMa-2-7B-ES in one-shot setting with the text from the image and its caption as input was proven to be a suitable approach for target classification, surpassing all other visual models, such as CLIP, MetaCLIP, OpenCLIP and SigLIP. The limitations of general language models trained solely on English data are highlighted by the best performance of LLaMa-2-7B-ES, which was adapted to Spanish data. This further emphasizes the need to develop language-specialized LLMs.</p><p>In Table <ref type="table" target="#tab_2">3</ref>, we show the results of LLaVA-v1.5-7B for zero-shot classification. LLaVA-v1.5-7B obtains a better performance of 51.88% Weighted-F 1 score for target classification, while LLaVA-v1.5-7B (Q-Instruct) obtains a better performance for source classification (74.16% Weighted-F 1 score). In zero-shot settings, LLaVA-v1.5-7B outperforms the English-based language-only counterparts, LLaMa-2-7B and Mistral-7B, for target classification, obtaining a Weighted-F 1 score of 51.88%. However, it has a lower performance than LLaMa-2-7B-ES. According to our experiments, while general LLMs pre-trained on mostly English data can provide satisfactory results for identifying false content in our corporate multimodal disinformation dataset, models specifically adapted for a particular language perform better. This is because they can make use of the Spanish text present in the multimodal content, leading to enhanced performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>In this paper, our aim was to create a valuable resource for characterizing corporate multimodal disinformation from digital media featuring both visual and textual elements in Spanish, annotated with details about the source and target of the false content. By publishing our dataset, we aim to encourage further research in this area and the development of more effective disinformation characterization technologies. Our comprehensive experiments have assessed the efficacy of state-of-the-art multimodal transformer models and LLMs in characterizing false content within images. Our findings reveal that predicting the target of the false content is easier than predicting the source, as the latter requires information that may not be easily represented in the multimodal data. In terms of zero-shot versus fewshot settings, providing one example for each class improved the performance for target classification by 28.15% for LLaMa-2-7B and 10.49% for Mistral-7B in terms of Weighted-F 1 score. LLaVA, the Multimodal LLM that we had tested, obtained a Weighted-F 1 score of 51.88% in a zero-shot setting for target classification. The best result for target classification, of 64.01% Weighted-F 1 score, was obtained by LLaMa-2-7B-ES in one-shot setting, suggesting that LLMs specifically adapted for a particular language are needed when processing non-English data.</p><p>Our goal is to assist corporate entities in monitoring digital streams for fake news that could potentially harm their reputations. In our future work, we intend to expand our dataset and develop methods for identifying the specific brands and organizations targeted by false content. Moreover, we would like to expand our analysis to recently-released LLMs, such as LLama-3 7 , LLaVA-NeXT 8 , GPT-4V <ref type="bibr" target="#b31">[32]</ref>, Gemini Pro 9 , InstructBLIP <ref type="bibr" target="#b32">[33]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Limitations</head><p>One of the limitations of the current study is the small and imbalanced number of samples in each class from the collected dataset. Our approach relies on data that was already fact-checked, which is challenging to obtain. Due to the insufficient samples in some categories, our models struggle to accurately predict those classes. To address this limitation, our future work will focus on expanding the dataset. Specifically, we will target the collection of more samples for underrepresented classes, such as Brand for target classification and Corporate and Advertising for source classification.</p><p>Another limitation is the use of 7B variants of LLMs and Multimodal LLMs in our experiments due to computational limitations. Even if LLaMa-2-7B-ES and LLaVA-v1.5-7B have shown promising results of 64.01% and 51.88% Weighted-F 1 for source classification, using bigger variants of the models could lead to further improvements in the results <ref type="bibr" target="#b33">[34]</ref>.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Selected examples of false content. The data is diverse, containing screenshots from social media, websites, etc. Translated text, first image: "Results for Chueca". Translated text, second image: "Get Dyson V11 for only 1,95 euros. Fill in the short questionnaire and respond to the three questions...". Translated text, third image: "Congratulations! Repsol 35th anniversary government subsidy! Through the questionnaire, you will have the opportunity to obtain 1000 euros. ". Translated text, fourth image: "Bad news for the climate fanatics: with 661 gigatons of extra mass, Antarctica continues to expand... ".</figDesc><graphic coords="2,72.00,65.61,451.26,160.70" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: The format of the false content found in the collected data: pictures, screenshots from social media platforms, from different websites, or news articles.</figDesc><graphic coords="3,97.46,292.67,162.77,190.24" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Platforms used to spread the false content. Most of the content was shared on social media platforms and WhatsApp.</figDesc><graphic coords="3,325.12,291.55,182.62,191.36" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Zero-Shot Classification pipeline for state-of-the-art visual transformer models: CLIP, Open-CLIP, MetaCLIP, SigLIP. Images and class names/descriptions are passed through frozen encoder models, and the final prediction is represented by the text that is most similar to a given image.</figDesc><graphic coords="5,105.84,65.61,383.58,145.67" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="6,105.84,65.61,383.60,217.19" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Zero-shot and one-shot classification using LLMs. * LLaMa-2-7B-ES (one-shot) obtains statistically significant improvement over the best English counterpart Mistral-7B (one-shot) in Target prediction (McNemar-Bowker Test, p&lt;0.05).</figDesc><table><row><cell></cell><cell></cell><cell></cell><cell>Target</cell><cell></cell><cell></cell><cell>Source</cell><cell></cell></row><row><cell>Model</cell><cell cols="8">Weighted-F 1 Brand Org. Other Weighted-F 1 Adv. Corp. Other</cell></row><row><cell>LLaMa-2-7B (zero-shot)</cell><cell></cell><cell>14.33</cell><cell>0.00</cell><cell>12.90 31.85</cell><cell>80.71</cell><cell>0.00</cell><cell>0.00</cell><cell>88.94</cell></row><row><cell>LLaMa-2-7B (one-shot)</cell><cell></cell><cell>42.48</cell><cell>22.43</cell><cell>50.47 31.00</cell><cell>72.66</cell><cell>2.65</cell><cell>0.00</cell><cell>80.05</cell></row><row><cell>Mistral-7B (zero-shot)</cell><cell></cell><cell>49.89</cell><cell>23.53</cell><cell>59.51 38.04</cell><cell>86.98</cell><cell>0.00</cell><cell>4.26</cell><cell>95.43</cell></row><row><cell>Mistral-7B (one-shot)</cell><cell></cell><cell>60.38</cell><cell cols="2">32.00 74.89 32.62</cell><cell>86.35</cell><cell>0.00</cell><cell>0.00</cell><cell>95.15</cell></row><row><cell>LLaMa-2-7B-ES (zero-shot)</cell><cell></cell><cell>62.31</cell><cell>19.23</cell><cell>76.07 50.00</cell><cell>81.81</cell><cell cols="3">2.38 41.24 86.11</cell></row><row><cell>LLaMa-2-7B-ES (one-shot)</cell><cell></cell><cell>64.01*</cell><cell cols="2">24.56 76.41 53.42</cell><cell>78.67</cell><cell cols="2">2.96 41.03</cell><cell>82.67</cell></row><row><cell></cell><cell></cell><cell></cell><cell>Target</cell><cell></cell><cell></cell><cell>Source</cell><cell></cell></row><row><cell>Model</cell><cell></cell><cell cols="7">Weighted-F 1 Brand Org. Other Weighted-F 1 Adv. Corp. Other</cell></row><row><cell>LLaVA-v1.5-7B</cell><cell></cell><cell>51.88*</cell><cell cols="2">21.37 65.85 27.89</cell><cell>61.68</cell><cell>1.89</cell><cell>8.60</cell><cell>67.12</cell></row><row><cell cols="2">LLaVA-v1.5-7B (Q-Instruct)</cell><cell>49.68</cell><cell cols="2">24.84 60.20 33.22</cell><cell>68.72*</cell><cell cols="3">2.65 15.93 74.16</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Zero-shot classification using LLaVA. * denotes statistically significant differences between best and second-best models using the McNemar-Bowker Test (p&lt;0.05).</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://iberifier.eu/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://iberifier.eu/factchecks/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://verifica.efe.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">https://maldita.es/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">https://github.com/madmaze/pytesseract</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">clibrain/Llama-2-7b-ft-instruct-es</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>The work of Paolo Rosso was in the framework of FAKE news and HATE speech (FAKEnHATE-PdC) funded by MCIN/AEI/10.13039/501100011033 and by European Union NextGenerationEU/PRTR (PDC2022-133118-I00), Iberian Digital Media Observatory (IBERIFIER Plus) funded by the EC (DIGITAL-2023-DEPLOY-04) under reference 101158511, and Malicious Actors Profiling and Detection in Online Social Networks Through Artificial Intelligence (MARTINI) funded by MCIN/AEI/ 10.13039/501100011033 and by European Union NextGenerationEU/PRTR (PCI2022-135008-2).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Ireton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Posetti</surname></persName>
		</author>
		<title level="m">Journalism, fake news &amp; disinformation: handbook for journalism education and training</title>
				<imprint>
			<publisher>Unesco Publishing</publisher>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">How truthiness, fake news and post-fact endanger brands and what to do about it</title>
		<author>
			<persName><forename type="first">P</forename><surname>Berthon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Treen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Pitt</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">NIM Marketing Intelligence Review</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="18" to="23" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">health influencers: how wellness culture and web culture have been weaponised to promote conspiracy theories and far-right extremism during the covid-19 pandemic</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Baker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alt</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">European Journal of Cultural Studies</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="page" from="3" to="24" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Marketing through instagram influencers: the impact of number of followers and product divergence on brand attitude</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">De</forename><surname>Veirman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Cauberghe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hudders</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International journal of advertising</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="page" from="798" to="828" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Economic effects of the fake news on companies and the need of new pr strategies</title>
		<author>
			<persName><forename type="first">A</forename><surname>Christov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Sustainable Development</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="41" to="49" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">act as a guide for companies to navigate a post-truth landscape</title>
		<author>
			<persName><forename type="first">A</forename><surname>Reid</surname></persName>
		</author>
		<ptr target="com" />
	</analytic>
	<monogr>
		<title level="m">What&apos;s the damage?. measuring the impact of fake news on corporate reputation can</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A high-speed world with fake news: brand managers take warning</title>
		<author>
			<persName><forename type="first">M</forename><surname>Peterson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Product &amp; Brand Management</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="page" from="234" to="245" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Is seeing still believing? the deepfake challenge to truth in politics</title>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">A</forename><surname>Galston</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
			<publisher>Brookings Institution</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Los deepfakes como una nueva forma de desinformación corporativa-una revisión de la literatura</title>
		<author>
			<persName><forename type="first">S</forename><surname>Gomes-Gonçalves</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IROCAMM: International Review of Communication and Marketing Mix</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="22" to="38" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">The emergence of deepfake technology: A review</title>
		<author>
			<persName><forename type="first">M</forename><surname>Westerlund</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Technology innovation management review</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">The state of automated factchecking</title>
		<author>
			<persName><forename type="first">M</forename><surname>Babakar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Moy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Full Fact</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Studying fake news spreading, polarisation dynamics, and manipulation by bots: A tale of networks and language</title>
		<author>
			<persName><forename type="first">G</forename><surname>Ruffo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Semeraro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Giachanou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer science review</title>
		<imprint>
			<biblScope unit="volume">47</biblScope>
			<biblScope unit="page">100531</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Toward a multilingual and multimodal data repository for covid-19 disinformation</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Shu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Big Data, IEEE</title>
				<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="4325" to="4330" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Towards multimodal disinformation detection by vision-language knowledge interaction</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Jeon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Fusion</title>
		<imprint>
			<biblScope unit="volume">102</biblScope>
			<biblScope unit="page">102037</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Scenefnd: Multimodal fake news detection by modelling scene context information</title>
		<author>
			<persName><forename type="first">G</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Giachanou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Information Science</title>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">A comprehensive survey of multimodal fake news detection techniques: advances, challenges, and opportunities</title>
		<author>
			<persName><forename type="first">S</forename><surname>Tufchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Yadav</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ahmed</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Multimedia Information Retrieval</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page">28</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Multimodal analysis of disinformation and misinformation</title>
		<author>
			<persName><forename type="first">A</forename><surname>Wilson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wilkes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Teramoto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hale</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Royal Society Open Science</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page">230964</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Multi-fake-detective at evalita 2023: Overview of the multimodal fake news detection and verification task</title>
		<author>
			<persName><forename type="first">A</forename><surname>Bondielli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dell'oglio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lenci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Marcelloni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">C</forename><surname>Passaro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sabbatini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CEUR Workshop Proceedings</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Learning transferable visual models from natural language supervision</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hallacy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Goh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ICML</title>
				<meeting>ICML</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="8748" to="8763" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Demystifying clip data</title>
		<author>
			<persName><forename type="first">H</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">E</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P.-Y</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Howes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Sharma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-W</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Ghosh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Feichtenhofer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ICLR</title>
				<meeting>ICLR</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Sigmoid loss for language image pre-training</title>
		<author>
			<persName><forename type="first">X</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mustafa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kolesnikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Beyer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ICCV</title>
				<meeting>ICCV</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Definitions matter: Guiding gpt for multi-label classification</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Peskine</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Korenčić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Grubisic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Papotti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Troncy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Findings of ACL: EMNLP 2023</title>
				<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="4054" to="4063" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Ilharco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wortsman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Wightman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Gordon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Carlini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Taori</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dave</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Shankar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Namkoong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Miller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hajishirzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Farhadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Schmidt</surname></persName>
		</author>
		<title level="m">Openclip</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Laion-5b: An open large-scale dataset for training next generation image-text models</title>
		<author>
			<persName><forename type="first">C</forename><surname>Schuhmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Beaumont</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Vencu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Gordon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Wightman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cherti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Coombes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Katta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Mullis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wortsman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of NeurIPS</title>
				<meeting>NeurIPS</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="25278" to="25294" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Q</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sablayrolles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mensch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bamford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Chaplot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Casas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Bressand</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lengyel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lample</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Saulnier</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2310.06825</idno>
		<title level="m">Mistral 7b</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Improved baselines with visual instruction tuning</title>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">J</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ITIF Workshop</title>
				<meeting>ITIF Workshop</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Touvron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Stone</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Albert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Almahairi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Babaei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Bashlykov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Batra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bhargava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bhosale</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2307.09288</idno>
		<title level="m">Llama 2: Open foundation and fine-tuned chat models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b27">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Liao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zhai</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2311.06783</idno>
		<title level="m">Qinstruct: Improving low-level visual abilities for multi-modality foundation models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models</title>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Savarese</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hoi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ICML</title>
				<meeting>ICML</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">How does fake news use a thumbnail? clip-based multimodal detection on the unrepresentative news image</title>
		<author>
			<persName><forename type="first">H</forename><surname>Choi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yoon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yoon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Park</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the CONSTRAINT Workshop</title>
				<meeting>the CONSTRAINT Workshop</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="86" to="94" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Visual instruction tuning</title>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">J</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of NeurIPS</title>
				<meeting>NeurIPS</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<author>
			<persName><surname>Openai</surname></persName>
		</author>
		<title level="m">Gpt-4v(ision) system card</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note>preprint</note>
</biblStruct>

<biblStruct xml:id="b32">
	<monogr>
		<author>
			<persName><forename type="first">W</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M H</forename><surname>Tiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Fung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hoi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.06500</idno>
		<title level="m">Instructblip: Towards general-purpose vision-language models with instruction tuning</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Fighting fire with fire: The dual role of llms in crafting and detecting elusive disinformation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Lucas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Uchendu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yamashita</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Rohatgi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of EMNLP</title>
				<meeting>EMNLP</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="14279" to="14305" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
