<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Using LLMs to explain AI-generated art classification via Grad-CAM heatmaps</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Giovanna</forename><surname>Castellano</surname></persName>
							<email>giovanna.castellano@uniba.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Maria</forename><forename type="middle">Grazia</forename><surname>Miccoli</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Raffaele</forename><surname>Scaringi</surname></persName>
							<email>raffaele.scaringi@uniba.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Gennaro</forename><surname>Vessio</surname></persName>
							<email>gennaro.vessio@uniba.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Gianluca</forename><surname>Zaza</surname></persName>
							<email>gianluca.zaza@uniba.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Using LLMs to explain AI-generated art classification via Grad-CAM heatmaps</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">435E879CC19C79F1847BFFBAEA4E7741</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:10+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Explainable AI, Large Language Models, Grad-CAM, AI-generated art, Artwork classification Orcid 0000-0002-6489-8628 (G. Castellano)</term>
					<term>0000-0001-7512-7661 (R. Scaringi)</term>
					<term>0000-0002-0883-2691 (G. Vessio)</term>
					<term>0000-0003-3272-9739 (G. Zaza)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The proliferation of AI-generated media, especially in art, has sparked interest in creating models that differentiate between original and AI-generated artworks. However, understanding why these models make certain decisions remains a significant challenge. This paper enhances the explainability of Vision Transformer-based classification models by using Grad-CAM to generate visual explanations of the model's focus areas, combined with Large Language Models (LLMs) to provide natural language descriptions. We evaluate three cutting-edge LLMs-LLaVa-NeXt, InstructBLIP, and KOSMOS-2-by using them to generate textual explanations for Grad-CAM visualizations applied to artwork classification. Through quantitative and qualitative analyses, we find that while InstructBLIP and KOSMOS-2 achieve higher similarity scores between generated descriptions and visual content, LLaVa-NeXt provides more insightful and coherent explanations, particularly for AI-generated art. This study demonstrates the potential of LLMs to improve the interpretability of AI decisions in complex image classification tasks, helping to bridge the gap between model decisions and human understanding in art classification.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Artificial Intelligence (AI) has achieved remarkable advancements in today's digital age, particularly in creating synthetic media. Generative models, such as GANs (Generative Adversarial Networks) and diffusion models, can produce highly realistic images, videos, and artworks, making it increasingly difficult to distinguish between AI-generated and human-created content. This growing challenge is especially critical in the domain of art, where concepts of creativity, authorship, and authenticity are deeply rooted in human expression. The ability to accurately classify and explain AI-generated versus original artworks is therefore essential for preserving the integrity of human creativity and safeguarding intellectual property.</p><p>A key concern related to AI-generated art is its potential to blur the boundaries between real and synthetic content, raising questions about originality and ownership. AI-generated media can disrupt traditional notions of being an artist as machines begin to emulate complex artistic styles and compositions with unprecedented fidelity. Beyond artistic expression, the rise of manipulated media, such as deepfakes, has further complicated the landscape by enabling the creation of highly realistic yet artificial videos and images <ref type="bibr" target="#b1">[2]</ref>. These technologies, often indistinguishable to the human eye, pose ethical and legal challenges, particularly in cases of media manipulation and copyright infringement <ref type="bibr" target="#b2">[3]</ref>.</p><p>In response to these challenges, deep learning models have been developed to automatically classify artworks as either original (human-created) or AI-generated. These models typically leverage sophisticated neural architectures, including Convolutional Neural Networks (CNNs) and Transformer-based models, to perform classification tasks with high accuracy. For example, in a previous work <ref type="bibr" target="#b3">[4]</ref>, we demonstrated the effectiveness of deep learning models, such as Vision Transformers (ViTs) <ref type="bibr" target="#b4">[5]</ref>, in distinguishing between human-created and AI-generated art. However, while these models achieve impressive performance, their decision-making processes often remain opaque, limiting user trust and understanding.</p><p>Explainable AI (XAI) methods, such as Gradient-weighted Class Activation Mapping (Grad-CAM) <ref type="bibr" target="#b5">[6]</ref>, have been developed to address this opacity. Grad-CAM generates visual heatmaps that highlight the regions of an image that most influence a model's decision, offering a degree of interpretability by showing users where the model "looked" to make its classification. However, while Grad-CAM provides valuable visual insights, it may not be sufficient for non-expert users who need a more explicit explanation of why certain regions were emphasized in the decision-making process.</p><p>To bridge this gap, recent advancements in natural language processing (NLP), mainly through Large Language Models (LLMs), offer an opportunity to enhance the interpretability of these visual explanations. LLMs are designed to generate natural language explanations describing complex visual information. By integrating LLMs with Grad-CAM outputs, it is possible to generate textual descriptions that explain why certain areas of an artwork were highlighted during classification, improving the transparency and interpretability of AI models.</p><p>This paper proposes a framework combining Grad-CAM visualizations with advanced LLMs to enhance the explainability of deep learning models in artwork classification. Specifically, we evaluate the performance of three cutting-edge LLMs-LLaVa-NeXt <ref type="bibr" target="#b6">[7]</ref>, InstructBLIP <ref type="bibr" target="#b7">[8]</ref>, and KOSMOS-2 <ref type="bibr" target="#b8">[9]</ref>-in generating coherent and insightful natural language explanations for classifying original versus AIgenerated artworks. Through quantitative and qualitative analyses, we assess how well these models can generate meaningful explanations that align with the visual heatmaps produced by Grad-CAM.</p><p>The rest of this paper is organized as follows. Section 2 reviews related works. Section 3 details our methodology, including the integration of Grad-CAM with LLMs for the explainability of artwork classification models. Section 4 presents the experimental results. Finally, Section 5 concludes with a summary of the results and future directions for research in this area.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related work</head><p>This section reviews significant research contributions in three key areas: AI-generated image classification, explainability in AI models using Grad-CAM, and leveraging LLMs to generate textual explanations for improving model interpretability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">AI-generated image classification</head><p>Differentiating AI-generated images from human-created ones has garnered increasing attention in recent times. Martin-Rodriguez et al. <ref type="bibr" target="#b9">[10]</ref> proposed methods based on pixel-level feature extraction, such as Photo Response Non-Uniformity (PRNU) and Error Level Analysis (ELA), to train CNNs for the classification of AI-generated images versus real photographs. Similarly, Epstein et al. <ref type="bibr" target="#b10">[11]</ref> explored real-time detection of AI-generated images using advanced neural network architectures, emphasizing the importance of fast and accurate identification in fighting synthetic content.</p><p>In the domain of art, few works have tackled this classification task. Ha et al. <ref type="bibr" target="#b11">[12]</ref> analyzed the distinguishing features between AI-generated and human-created artworks using deep learning models. They demonstrated how neural networks could differentiate these categories based on creative elements like style and composition. More recently, we evaluated the performance of deep learning models, including VGG-19 <ref type="bibr" target="#b12">[13]</ref>, ResNet-50 <ref type="bibr" target="#b13">[14]</ref>, and Vision Transformers <ref type="bibr" target="#b4">[5]</ref>, in classifying AI-generated artworks <ref type="bibr" target="#b3">[4]</ref>, reporting classification accuracies of up to 97%. In this study, we build on our previous ViT model, which achieved the highest performance, while enhancing the interpretability of the classification process using explainability techniques.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Explainability of AI models using Grad-CAM</head><p>The need for interpretability in AI models has led to the adoption of XAI techniques, with Grad-CAM <ref type="bibr" target="#b5">[6]</ref> being one of the most popular methods. Grad-CAM generates heatmaps highlighting image regions that influence a model's decision-making process. These visualizations are invaluable for improving the transparency of AI systems.</p><p>In the context of artwork classification, we applied Grad-CAM to visualize the decision-making processes of deep learning models, demonstrating how these heatmaps could enhance model interpretability <ref type="bibr" target="#b3">[4]</ref>. However, while these heatmaps provide insight into where the model is "looking", they may still be challenging for non-expert users to interpret without additional guidance. Distinguishing between AI-generated and human-created art requires a nuanced understanding. Improving models' interpretability is critical in domains where creative ownership and authenticity are essential, making this a viable use case for the study of explainability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Leveraging LLMs for explainability</head><p>Recent advances in NLP have led to the development of LLMs capable of generating coherent and detailed textual descriptions based on visual inputs. Yang et al. <ref type="bibr" target="#b14">[15]</ref> explored using LLMs to detect sophisticated image tampering, demonstrating that advanced models could accurately identify subtle manipulations in AI-generated content. However, they also noted that current LLMs struggle with highly realistic AI-generated images, underscoring the need for further improvements. In response to these challenges, Samesh et al. <ref type="bibr" target="#b15">[16]</ref> combined CNNs with multimodal fusion techniques to detect advanced deepfakes, integrating LLMs to enhance accuracy. Their findings highlight the potential of LLMs to offer detailed explanations that improve model transparency.</p><p>For our study, we selected three advanced LLMs to generate explanations for AI-driven artwork classification: LLaVa-NeXt <ref type="bibr" target="#b6">[7]</ref>, InstructBLIB <ref type="bibr" target="#b7">[8]</ref>, and KOSMOS-2 <ref type="bibr" target="#b8">[9]</ref>. These models, designed to process visual and textual inputs, have shown promise in generating human-understandable explanations of complex AI decisions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methods</head><p>This section outlines the proposed framework, illustrated in Fig. <ref type="figure" target="#fig_0">1</ref>, which integrates ViTs with Grad-CAM and LLMs to enhance the interpretability of AI-generated artwork classification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Proposed framework</head><p>The proposed framework aims to provide human-understandable explanations of how a deep learning model classifies artworks as either original or AI-generated. The pipeline consists of three key stages:</p><p>• A ViT model <ref type="bibr" target="#b4">[5]</ref> is first trained to classify artworks. The input consists of RGB images of both original and AI-generated artworks. We selected this model because it achieves 97% accuracy on the same dataset used in this study for this classification task. • During the inference stage, the Grad-CAM technique <ref type="bibr" target="#b5">[6]</ref> is applied to generate heatmaps that highlight regions of the image that most strongly influenced the ViT model's classification decision. These heatmaps provide visual explanations of the areas of the artwork that were most important for the model's decision-making process. • To generate natural language explanations of the Grad-CAM heatmaps, three advanced LLMs-LLaVa-NeXt <ref type="bibr" target="#b6">[7]</ref>, InstructBLIB <ref type="bibr" target="#b7">[8]</ref>, and KOSMOS-2 <ref type="bibr" target="#b8">[9]</ref>-are integrated into the pipeline.</p><p>Each LLM receives the Grad-CAM overlay image and a specially designed prompt to generate a description explaining the model's focus on some areas of the artwork.</p><p>This combined approach offers visual and textual explanations, ensuring that non-expert users can better understand the classification decisions made by the AI model. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Implementation details</head><p>The ViT model used in this study is based on the work of Dosovitskiy et al. <ref type="bibr" target="#b4">[5]</ref>. It processes the artwork as a sequence of image patches and uses multi-headed self-attention to capture global information. We employ ViT to classify images into two categories, namely original or AI-generated. Specifically, we used ViT-B/16, pre-trained on the ImageNet dataset <ref type="bibr" target="#b16">[17]</ref>, which takes as input RGB images with a resolution of 224 × 224 pixels. Then, we finetuned for 30 epochs the last layer, training the model to recognize whether or not the input artwork is original or AI-generated. It is worth noting that during the training stage, we optimized a binary cross-entropy loss using the well-known Adam optimizer, with an initial learning rate of 10 −3 and a step learning rate scheduling every seven epochs with a decay factor 𝛾 = 0.1. Furthermore, an early stopping mechanism was employed, halting after three epochs with no decrease in validation loss.</p><p>Once the ViT model is trained, Grad-CAM <ref type="bibr" target="#b5">[6]</ref> is applied to visualize which parts of the input image influence the model's decision. Specifically, the gradients of the output class score are computed with respect to the token embeddings (representing image patches) from the final layers. These gradients are used to weigh the importance of each token, and a heatmap is generated that highlights the most influential image regions for the classification.</p><p>Lastly, we overlay the input image with the Grad-CAM heatmap. We employed three advanced LLMs to generate textual explanations based on that, each with a tailored prompt iteratively refined through multiple attempts to improve the clarity and relevance of the generated explanations. Specifically, the prompt used in this work was developed following the main guidelines of prompt engineering to ensure effective communication with the language model. In particular, five basic rules were applied:</p><p>1. Clarity of input: The model was provided with a description of the input type, specifying that it was an artwork with a superimposed heatmap. 2. Context: The context in which the input was located was provided, explaining that the heatmap indicated the areas of the artwork that the classifier considered most relevant to its decision. 3. Objective of the output: The type of output required was explicitly stated, namely, an analysis of the possible causes that led the classifier to identify those specific areas. 4. Output format: Clear instructions were given on the expected format of the output. 5. Example of output: An example was also provided to guide the model in generating the desired response.</p><p>The three LLMs employed are:</p><p>• LLaVa-NeXt <ref type="bibr" target="#b6">[7]</ref>, an evolution of LLaVA-1.5, supports image resolutions up to 672 × 672 pixels and enhances visual reasoning and OCR capabilities. We employed the quantized version of this model.</p><p>• InstructBLIP <ref type="bibr" target="#b7">[8]</ref>, built on the BLIP-2 architecture, specializes in zero-shot vision-language tasks. Key hyperparameters include num_beams=5 (for beam search decoding), max_new_tokens=250, min_length=1, top_p=0.9 (to maintain diversity), and temperature=1 (to balance randomness). These parameters ensure coherent and concise descriptions are generated efficiently. • KOSMOS-2 <ref type="bibr" target="#b8">[9]</ref> is a multimodal Transformer-based model designed for visual-textual grounding.</p><p>Key hyperparameters, such as max_new_tokens=1024 and attention masks, control the generation process, ensuring the output is aligned with the visual input.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Evaluation</head><p>We employ quantitative and qualitative analyses to evaluate the quality of the generated explanations.</p><p>The quantitative metrics measure aspects that may not always capture human understandability or insightfulness. The qualitative evaluations, instead, may provide deeper insights into the coherence and relevance of explanations, which the metrics might miss. We measure the alignment between the generated descriptions and the visual content using two metrics:</p><p>• Image-to-text similarity: Using CLIP <ref type="bibr" target="#b17">[18]</ref>, we compute the cosine similarity between the image (with Grad-CAM overlay) and the generated textual description. A higher score indicates that the text better reflects the image content. • Text-to-label similarity: We assess the consistency between the generated text and the classification label (original or AI-generated) using the S-BERT model <ref type="bibr" target="#b18">[19]</ref> by computing the cosine similarity between the embedding of the generated explanation, and the description of the target label.</p><p>In the qualitative analysis, we manually examine the coherence, relevance, and insightfulness of the explanations generated by each LLM. This involves comparing the descriptions with the Grad-CAM heatmaps and assessing whether the explanations provide meaningful insights into the model's decision-making process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments Table 1</head><p>Similarity metrics between image-to-text, text-to-label, and the total similarity for the three LLMs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model Image-to-text Text-to-label Total similarity</head><p>LLaVa-NeXt 0.22 0.14 0.36 InstructBLIB 0.26 0.14 0.40 KOSMOS-2 0.26 0.14 0.40</p><p>Table <ref type="table">1</ref> presents the results of this analysis, showing the similarity metrics for each LLM. The results show that InstructBLIB and KOSMOS-2 achieve the highest overall similarity scores (0.40), with LLaVa-NeXt scoring slightly lower (0.36). All models perform similarly in terms of text-to-label similarity (0.14). Regarding image-to-text similarity, InstructBLIB and KOSMOS-2 score 0.26, while LLaVa-NeXt reaches 0.22.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Qualitative analysis</head><p>In addition to the quantitative analysis, we conducted a qualitative evaluation using four sample images, two AI-generated and two original artworks (Fig. <ref type="figure" target="#fig_1">2</ref>). All images were correctly classified by ViT. The goal was to assess the ability of each LLM to generate insightful and relevant explanations for the classification of these images.</p><p>For AI-generated artworks, LLaVa-NeXt accurately identifies the relevant areas of interest and provides plausible explanations for the network's focus. However, for original artworks, it occasionally struggles to account for the model's final classification, leading to less coherent explanations. Instruct-BLIP identifies regions of interest but often misinterprets the heatmap as a thermal image, particularly in cases of AI-generated artworks. This results in inaccurate descriptions that fail to explain the classification decision meaningfully. KOSMOS-2 performs relatively well in identifying relevant areas for both AI-generated and original artworks, but its explanations are vague and often lack depth, leaving some interpretability issues unresolved.</p><p>The qualitative analysis reveals that LLaVa-NeXt generally produces more intelligent and abstract explanations, particularly for AI-generated images, despite some inaccuracies with original artworks. In contrast, InstructBLIP and KOSMOS-2, while performing well in the quantitative analysis, often produce literal descriptions that fail to provide meaningful insights into the classification decision.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>In this work, we aimed to enhance the interpretability of Transformer-based models for classifying artworks as original or AI-generated by integrating Grad-CAM heatmaps with Large Language Models. We evaluated the performance of three state-of-the-art LLMs-LLaVa-NeXt, InstructBLIP, and KOSMOS-2-on their ability to generate coherent and meaningful descriptions based on Grad-CAM visualizations applied to both synthetic and original artworks.</p><p>While LLaVa-NeXt outperformed the other models in generating meaningful explanations for AIgenerated content, it requires further refinement to improve its accuracy when handling original artworks. Addressing this limitation would increase its versatility across a broader range of image classification tasks. InstructBLIP, on the other hand, struggled with interpreting Grad-CAM heatmaps, often providing overly literal descriptions. Future work could focus on refining its training strategies or prompt engineering to align its outputs more closely with the visual features it is meant to explain. KOSMOS-2 showed promise with its strong multimodal grounding but lacked the depth of insight necessary for robust interpretability. Further research could enhance its ability to abstract complex visual inputs into more meaningful explanations.</p><p>Ultimately, combining advanced LLMs with explainability techniques like Grad-CAM can significantly improve the transparency and trustworthiness of AI-driven image classification systems. Fine-tuning</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>LLaVa-NeXt</head><p>The activation map generated by the network focuses on certain areas of the work, such as the face and hair color. This suggests that the network focused on these areas to classify the work, as these areas are likely to be the parts of the work that have the greatest influence on the classification. The network may have analyzed the lines and colors to identify whether the work was artificially generated or whether it was created from an original work of art.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>InstructBLIP</head><p>The activation map presents those warm colors in that area of the artwork due to the use of thermal imaging technology. Thermal imaging, also known as infrared thermography, is a non-destructive testing technique that uses infrared cameras to detect and visualize thermal energy emitted by objects. In this case, the activation map likely represents the temperature distribution of the artwork's surface, with the warm colors indicating higher temperatures and the cooler colors indicating lower temperatures. This could be caused by various factors, such as the presence of hotspots on the artwork's surface, uneven heating or cooling due to environmental factors, or even the use of certain materials that have different thermal properties.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>KOSMOS-2</head><p>The classifier might have focused on the areas of the face that are most likely to be associated with emotion, such as the eyes, nose, mouth, or lips. The activation map highlights these areas, which could be the result of the classifying algorithm's attention to these features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>LLaVa-NeXt</head><p>The activation map generated by the network focuses on some specific areas of the work, such as the profile of the face and the lines that define the shape of the head. These areas are likely relevant to the classification of the work, as they may be unique features that the network used to distinguish the original artwork from an artificially generated one.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>InstructBLIP</head><p>The activation map presents those warm colors in that area of the artwork due to the presence of a heat signature. Heat signatures, also known as thermograms, are generated by the infrared radiation emitted by an object or a person's body. In this case, the activation map is likely capturing the heat signature of the woman's face, resulting in the display of warm colors in the affected area. This could be caused by various factors, such as the intensity of the infrared radiation, the distance between the camera and the subject, or the type of clothing or accessories worn by the subject.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>KOSMOS-2</head><p>The classifier might have focused on the areas of the face that are most likely to be associated with emotion, such as the eyes, nose, and mouth. The classification could have been based on the artist's facial expression, the way they hold their head, or the way the artwork is positioned in the image.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>LLaVa-NeXt</head><p>The activation map by the network focuses on some parts of the work, such as these that define the architecture of the buildings and lines that follow the contour of the buildings. The specific lines are then worked further by the generator for classification, as they may be unique features that distinguish the original work from an artificially generated network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>InstructBLIP</head><p>The abstract map presents some colors that are in contrast due to their use of thermal imaging technology. Thermal imaging captures radiation based on the intensity of infrared radiation emitted by objects, which is proportional to their temperature. In this particular artwork, thermal readings generate in sequence warm colors in terms of a figure capturing through rays resulting in a colorful and vibrant representation. The capture buttons and activation map correspond to high temperatures or clusters that yield areas of intense heat such as during building fires. This technique allows for realistic visual representation through detection of events, while also highlighting intense heat generated by these events at any given time throughout the day.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>KOSMOS-2</head><p>The designer might have focused on how there is imagery at extreme ends such as paintings depicting historic events depicted in dynamic presence while others would have portrayed quiet desolation instead. Therefore, these risks are located within the city, and presence findings behind each image might have attracted attention additional answers elaboration on each one with discovery mannerism, adding aesthetic aspects but also hazard.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>LLaVa-NeXt</head><p>The activation map generated by the network focuses on some areas of the work, such as the sky, the leaves of the trees and the lines that define the contours of the objects. These areas were probably chosen because they contain visual information that the network used to classify the work as "AI Generated". The lines and colors in the work may have been artificially generated to create an image that appears original, but was actually generated by a classification network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>InstructBLIP</head><p>The activation map presents those warm colors in that area of the artwork due to the presence of heat signatures. Heat signatures, also known as thermal imaging or thermography, are generated by the infrared radiation emitted by objects at different temperatures. In this case, the activation map is likely capturing the heat signatures emitted by the trees and bushes in the image, resulting in the display of warm colors in the affected area.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>KOSMOS-2</head><p>The classifier might have focused on the warm colors in the artwork, such as the yellow and orange spots, as they are associated with warmth and happiness. The class might have also focused on other elements of interest, such as the tree, the house, or the person, as these elements might evoke positive emotions or evoke a sense of connection with nature. and expanding the capabilities of these models will be critical to enhancing interpretability and fostering wider adoption, especially in other domains where explainability is essential.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Proposed framework. A ViT model for classification processes an artwork. Grad-CAM is then applied to identify influential regions. Lastly, the overlayed image is fed to an LLM (LLaVa-NeXt, InstructBLIP, or KOSMOS-2), generating a human-understandable explanation.</figDesc><graphic coords="4,72.00,65.60,451.25,157.27" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Examples of artworks with overlaid Grad-CAM and corresponding explanations generated by the tested LLMs. The two artworks at the top are AI-generated, while the ones at the bottom are authentic.</figDesc><graphic coords="7,82.24,524.51,135.38,142.11" type="bitmap" /></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgments G. Castellano and G. Zaza acknowledge funding support from the FAIR -Future AI Research (PE00000013) project, Spoke 6 -Symbiotic AI (CUP H97G22000210007), under the NRRP MUR program funded by NextGenerationEU. R. Scaringi's research is funded through a Ph.D. fellowship within the Italian "D.M. n. 352, April 9, 2022" program, under the NRRP, Mission 4, Component 2, Investment 3.3, with the Ph.D. project titled "Automatic analysis of artistic heritage via Artificial Intelligence, " co-supported by Exprivia S.p.A. (CUP H91I22000410007).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">.it 2024: An Overview on the Future of Explainable AI in the era of Large Language Models</title>
		<author>
			<persName><forename type="first">M</forename><surname>Polignano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Musto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Pellungrini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Purificato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Semeraro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Setzu</surname></persName>
		</author>
		<author>
			<persName><surname>Xai</surname></persName>
		</author>
		<ptr target="CEUR.org" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of 5th Italian Workshop on Explainable Artificial Intelligence, co-located with the 23rd International Conference of the Italian Association for Artificial Intelligence</title>
				<meeting>5th Italian Workshop on Explainable Artificial Intelligence, co-located with the 23rd International Conference of the Italian Association for Artificial Intelligence<address><addrLine>Bolzano, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">November 25-28, 2024. 2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Cheap-fake Detection with LLM using Prompt Engineering</title>
		<author>
			<persName><forename type="first">G</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Wan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="105" to="109" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">H</forename><surname>Zhang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2402.17801</idno>
		<title level="m">Generative AI and copyright: A dynamic perspective</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Identifying AI-Generated Art with Deep Learning</title>
		<author>
			<persName><forename type="first">T</forename><surname>Bianco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Castellano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Scaringi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Vessio</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CREAI@ AI* IA</title>
				<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="16" to="25" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Dosovitskiy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Beyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kolesnikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Weissenborn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Unterthiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dehghani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Minderer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Heigold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gelly</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2010.11929</idno>
		<title level="m">An image is worth 16x16 words: Transformers for image recognition at scale</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Grad-CAM: Visual explanations from deep networks via gradient-based localization</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">R</forename><surname>Selvaraju</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cogswell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Vedantam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Parikh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Batra</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computer Vision</title>
		<imprint>
			<biblScope unit="volume">128</biblScope>
			<biblScope unit="page" from="336" to="359" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">J</forename><surname>Lee</surname></persName>
		</author>
		<title level="m">LlaVa-Next: Improved reasoning, OCR, and world knowledge</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Improved baselines with visual instruction tuning</title>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">J</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="26296" to="26306" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Hao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wei</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2306.14824</idno>
		<title level="m">KOSMOS-2: Grounding multimodal large language models to the world</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Detection of AI-created images using pixel-wise feature extraction and convolutional neural networks</title>
		<author>
			<persName><forename type="first">F</forename><surname>Martin-Rodriguez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Garcia-Mojon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Fernandez-Barciela</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Sensors</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<date type="published" when="2023">2023. 9037</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Online detection of AI-generated images</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">C</forename><surname>Epstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Jain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF International Conference on Computer Vision</title>
				<meeting>the IEEE/CVF International Conference on Computer Vision</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="382" to="392" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Organic or Diffused: Can We Distinguish Human Art from AI-generated Images?</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Y J</forename><surname>Ha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Passananti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Bhaskar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Southen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">Y</forename><surname>Zhao</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2402.03214</idno>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">K</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zisserman</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1409.1556</idno>
		<title level="m">Very deep convolutional networks for large-scale image recognition</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Deep residual learning for image recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="770" to="778" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhou</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2401.13504</idno>
		<title level="m">Research about the Ability of LLM in the Tamper-Detection Area</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">LLM-Enhanced Deepfake Detection: Dense CNN and Multi-Modal Fusion Framework for Precise Multimedia Authentication</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">E</forename><surname>Vp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Dheepthi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), IEEE</title>
				<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="1" to="6" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">ImageNet: A large-scale hierarchical image database</title>
		<author>
			<persName><forename type="first">J</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Socher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L.-J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fei-Fei</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition</title>
				<imprint>
			<publisher>Ieee</publisher>
			<date type="published" when="2009">2009. 2009</date>
			<biblScope unit="page" from="248" to="255" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Learning transferable visual models from natural language supervision</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hallacy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Goh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="8748" to="8763" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Reimers</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1908.10084</idno>
		<title level="m">Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Leveraging Knowledge Graphs and Deep Learning for automatic art analysis</title>
		<author>
			<persName><forename type="first">G</forename><surname>Castellano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Digeno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sansaro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Vessio</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Knowledge-Based Systems</title>
		<imprint>
			<biblScope unit="volume">248</biblScope>
			<biblScope unit="page">108859</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">ArtiFact: A large-scale dataset with artificial and factual images for generalizable and robust synthetic image detection</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Rahman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Paul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">H</forename><surname>Sarker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><forename type="middle">I A</forename><surname>Hakim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Fattah</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on Image Processing (ICIP)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2023">2023. 2023</date>
			<biblScope unit="page" from="2200" to="2204" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
