=Paper=
{{Paper
|id=Vol-3839/paper5
|storemode=property
|title=Using LLMs to explain AI-generated art classification via Grad-CAM heatmaps
|pdfUrl=https://ceur-ws.org/Vol-3839/paper5.pdf
|volume=Vol-3839
|authors=Giovanna Castellano,Maria Grazia Miccoli,Raffaele Scaringi,Gennaro Vessio,Gianluca Zaza,Zhuofan Zhang,Herbert Wiklicky
|dblpUrl=https://dblp.org/rec/conf/xaiit/CastellanoMSVZ24
}}
==Using LLMs to explain AI-generated art classification via Grad-CAM heatmaps==
Using LLMs to explain AI-generated art classification via
Grad-CAM heatmaps
Giovanna Castellano, Maria Grazia Miccoli, Raffaele Scaringi, Gennaro Vessio and
Gianluca Zaza∗
Department of Computer Science, University of Bari Aldo Moro, Italy
Abstract
The proliferation of AI-generated media, especially in art, has sparked interest in creating models that differentiate
between original and AI-generated artworks. However, understanding why these models make certain decisions
remains a significant challenge. This paper enhances the explainability of Vision Transformer-based classification
models by using Grad-CAM to generate visual explanations of the model’s focus areas, combined with Large
Language Models (LLMs) to provide natural language descriptions. We evaluate three cutting-edge LLMs—LLaVa-
NeXt, InstructBLIP, and KOSMOS-2—by using them to generate textual explanations for Grad-CAM visualizations
applied to artwork classification. Through quantitative and qualitative analyses, we find that while InstructBLIP
and KOSMOS-2 achieve higher similarity scores between generated descriptions and visual content, LLaVa-NeXt
provides more insightful and coherent explanations, particularly for AI-generated art. This study demonstrates
the potential of LLMs to improve the interpretability of AI decisions in complex image classification tasks, helping
to bridge the gap between model decisions and human understanding in art classification.
Keywords
Explainable AI, Large Language Models, Grad-CAM, AI-generated art, Artwork classification
1. Introduction
Artificial Intelligence (AI) has achieved remarkable advancements in today’s digital age, particularly in
creating synthetic media. Generative models, such as GANs (Generative Adversarial Networks) and
diffusion models, can produce highly realistic images, videos, and artworks, making it increasingly
difficult to distinguish between AI-generated and human-created content. This growing challenge
is especially critical in the domain of art, where concepts of creativity, authorship, and authenticity
are deeply rooted in human expression. The ability to accurately classify and explain AI-generated
versus original artworks is therefore essential for preserving the integrity of human creativity and
safeguarding intellectual property.
A key concern related to AI-generated art is its potential to blur the boundaries between real
and synthetic content, raising questions about originality and ownership. AI-generated media can
disrupt traditional notions of being an artist as machines begin to emulate complex artistic styles and
compositions with unprecedented fidelity. Beyond artistic expression, the rise of manipulated media,
such as deepfakes, has further complicated the landscape by enabling the creation of highly realistic yet
artificial videos and images [2]. These technologies, often indistinguishable to the human eye, pose
ethical and legal challenges, particularly in cases of media manipulation and copyright infringement [3].
In response to these challenges, deep learning models have been developed to automatically classify
artworks as either original (human-created) or AI-generated. These models typically leverage sophisti-
cated neural architectures, including Convolutional Neural Networks (CNNs) and Transformer-based
models, to perform classification tasks with high accuracy. For example, in a previous work [4], we
XAI.it - 5th Italian Workshop on Explainable Artificial Intelligence, co-located with the 23rd International Conference of the Italian
Association for Artificial Intelligence, Bolzano, Italy, November 25-28, 2024 [1]
∗
Corresponding author.
Envelope-Open giovanna.castellano@uniba.it (G. Castellano); raffaele.scaringi@uniba.it (R. Scaringi); gennaro.vessio@uniba.it
(G. Vessio); gianluca.zaza@uniba.it (G. Zaza)
Orcid 0000-0002-6489-8628 (G. Castellano); 0000-0001-7512-7661 (R. Scaringi); 0000-0002-0883-2691 (G. Vessio);
0000-0003-3272-9739 (G. Zaza)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
demonstrated the effectiveness of deep learning models, such as Vision Transformers (ViTs) [5], in
distinguishing between human-created and AI-generated art. However, while these models achieve
impressive performance, their decision-making processes often remain opaque, limiting user trust and
understanding.
Explainable AI (XAI) methods, such as Gradient-weighted Class Activation Mapping (Grad-CAM) [6],
have been developed to address this opacity. Grad-CAM generates visual heatmaps that highlight
the regions of an image that most influence a model’s decision, offering a degree of interpretability
by showing users where the model “looked” to make its classification. However, while Grad-CAM
provides valuable visual insights, it may not be sufficient for non-expert users who need a more explicit
explanation of why certain regions were emphasized in the decision-making process.
To bridge this gap, recent advancements in natural language processing (NLP), mainly through
Large Language Models (LLMs), offer an opportunity to enhance the interpretability of these visual
explanations. LLMs are designed to generate natural language explanations describing complex visual
information. By integrating LLMs with Grad-CAM outputs, it is possible to generate textual descriptions
that explain why certain areas of an artwork were highlighted during classification, improving the
transparency and interpretability of AI models.
This paper proposes a framework combining Grad-CAM visualizations with advanced LLMs to
enhance the explainability of deep learning models in artwork classification. Specifically, we evaluate
the performance of three cutting-edge LLMs—LLaVa-NeXt [7], InstructBLIP [8], and KOSMOS-2 [9]—in
generating coherent and insightful natural language explanations for classifying original versus AI-
generated artworks. Through quantitative and qualitative analyses, we assess how well these models
can generate meaningful explanations that align with the visual heatmaps produced by Grad-CAM.
The rest of this paper is organized as follows. Section 2 reviews related works. Section 3 details
our methodology, including the integration of Grad-CAM with LLMs for the explainability of artwork
classification models. Section 4 presents the experimental results. Finally, Section 5 concludes with a
summary of the results and future directions for research in this area.
2. Related work
This section reviews significant research contributions in three key areas: AI-generated image classifica-
tion, explainability in AI models using Grad-CAM, and leveraging LLMs to generate textual explanations
for improving model interpretability.
2.1. AI-generated image classification
Differentiating AI-generated images from human-created ones has garnered increasing attention in
recent times. Martin-Rodriguez et al. [10] proposed methods based on pixel-level feature extraction,
such as Photo Response Non-Uniformity (PRNU) and Error Level Analysis (ELA), to train CNNs for the
classification of AI-generated images versus real photographs. Similarly, Epstein et al. [11] explored
real-time detection of AI-generated images using advanced neural network architectures, emphasizing
the importance of fast and accurate identification in fighting synthetic content.
In the domain of art, few works have tackled this classification task. Ha et al. [12] analyzed the
distinguishing features between AI-generated and human-created artworks using deep learning models.
They demonstrated how neural networks could differentiate these categories based on creative elements
like style and composition. More recently, we evaluated the performance of deep learning models,
including VGG-19 [13], ResNet-50 [14], and Vision Transformers [5], in classifying AI-generated
artworks [4], reporting classification accuracies of up to 97%. In this study, we build on our previous ViT
model, which achieved the highest performance, while enhancing the interpretability of the classification
process using explainability techniques.
2.2. Explainability of AI models using Grad-CAM
The need for interpretability in AI models has led to the adoption of XAI techniques, with Grad-CAM [6]
being one of the most popular methods. Grad-CAM generates heatmaps highlighting image regions
that influence a model’s decision-making process. These visualizations are invaluable for improving
the transparency of AI systems.
In the context of artwork classification, we applied Grad-CAM to visualize the decision-making
processes of deep learning models, demonstrating how these heatmaps could enhance model inter-
pretability [4]. However, while these heatmaps provide insight into where the model is “looking”, they
may still be challenging for non-expert users to interpret without additional guidance. Distinguishing
between AI-generated and human-created art requires a nuanced understanding. Improving models’
interpretability is critical in domains where creative ownership and authenticity are essential, making
this a viable use case for the study of explainability.
2.3. Leveraging LLMs for explainability
Recent advances in NLP have led to the development of LLMs capable of generating coherent and
detailed textual descriptions based on visual inputs. Yang et al. [15] explored using LLMs to detect
sophisticated image tampering, demonstrating that advanced models could accurately identify subtle
manipulations in AI-generated content. However, they also noted that current LLMs struggle with
highly realistic AI-generated images, underscoring the need for further improvements. In response
to these challenges, Samesh et al. [16] combined CNNs with multimodal fusion techniques to detect
advanced deepfakes, integrating LLMs to enhance accuracy. Their findings highlight the potential of
LLMs to offer detailed explanations that improve model transparency.
For our study, we selected three advanced LLMs to generate explanations for AI-driven artwork
classification: LLaVa-NeXt [7], InstructBLIB [8], and KOSMOS-2 [9]. These models, designed to process
visual and textual inputs, have shown promise in generating human-understandable explanations of
complex AI decisions.
3. Methods
This section outlines the proposed framework, illustrated in Fig. 1, which integrates ViTs with Grad-CAM
and LLMs to enhance the interpretability of AI-generated artwork classification.
3.1. Proposed framework
The proposed framework aims to provide human-understandable explanations of how a deep learning
model classifies artworks as either original or AI-generated. The pipeline consists of three key stages:
• A ViT model [5] is first trained to classify artworks. The input consists of RGB images of both
original and AI-generated artworks. We selected this model because it achieves 97% accuracy on
the same dataset used in this study for this classification task.
• During the inference stage, the Grad-CAM technique [6] is applied to generate heatmaps that
highlight regions of the image that most strongly influenced the ViT model’s classification
decision. These heatmaps provide visual explanations of the areas of the artwork that were most
important for the model’s decision-making process.
• To generate natural language explanations of the Grad-CAM heatmaps, three advanced
LLMs—LLaVa-NeXt [7], InstructBLIB [8], and KOSMOS-2 [9]—are integrated into the pipeline.
Each LLM receives the Grad-CAM overlay image and a specially designed prompt to generate a
description explaining the model’s focus on some areas of the artwork.
This combined approach offers visual and textual explanations, ensuring that non-expert users can
better understand the classification decisions made by the AI model.
Figure 1: Proposed framework. A ViT model for classification processes an artwork. Grad-CAM is then applied
to identify influential regions. Lastly, the overlayed image is fed to an LLM (LLaVa-NeXt, InstructBLIP, or
KOSMOS-2), generating a human-understandable explanation.
3.2. Implementation details
The ViT model used in this study is based on the work of Dosovitskiy et al. [5]. It processes the artwork
as a sequence of image patches and uses multi-headed self-attention to capture global information.
We employ ViT to classify images into two categories, namely original or AI-generated. Specifically,
we used ViT-B/16, pre-trained on the ImageNet dataset [17], which takes as input RGB images with a
resolution of 224 × 224 pixels. Then, we finetuned for 30 epochs the last layer, training the model to
recognize whether or not the input artwork is original or AI-generated. It is worth noting that during
the training stage, we optimized a binary cross-entropy loss using the well-known Adam optimizer,
with an initial learning rate of 10−3 and a step learning rate scheduling every seven epochs with a decay
factor 𝛾 = 0.1. Furthermore, an early stopping mechanism was employed, halting after three epochs
with no decrease in validation loss.
Once the ViT model is trained, Grad-CAM [6] is applied to visualize which parts of the input image
influence the model’s decision. Specifically, the gradients of the output class score are computed with
respect to the token embeddings (representing image patches) from the final layers. These gradients
are used to weigh the importance of each token, and a heatmap is generated that highlights the most
influential image regions for the classification.
Lastly, we overlay the input image with the Grad-CAM heatmap. We employed three advanced LLMs
to generate textual explanations based on that, each with a tailored prompt iteratively refined through
multiple attempts to improve the clarity and relevance of the generated explanations. Specifically, the
prompt used in this work was developed following the main guidelines of prompt engineering to ensure
effective communication with the language model. In particular, five basic rules were applied:
1. Clarity of input: The model was provided with a description of the input type, specifying that it
was an artwork with a superimposed heatmap.
2. Context: The context in which the input was located was provided, explaining that the heatmap
indicated the areas of the artwork that the classifier considered most relevant to its decision.
3. Objective of the output: The type of output required was explicitly stated, namely, an analysis of
the possible causes that led the classifier to identify those specific areas.
4. Output format: Clear instructions were given on the expected format of the output.
5. Example of output: An example was also provided to guide the model in generating the desired
response.
The three LLMs employed are:
• LLaVa-NeXt [7], an evolution of LLaVA-1.5, supports image resolutions up to 672 × 672 pixels
and enhances visual reasoning and OCR capabilities. We employed the quantized version of this
model.
• InstructBLIP [8], built on the BLIP-2 architecture, specializes in zero-shot vision-language tasks.
Key hyperparameters include num_beams=5 (for beam search decoding), max_new_tokens=250 ,
min_length=1 , top_p=0.9 (to maintain diversity), and temperature=1 (to balance randomness).
These parameters ensure coherent and concise descriptions are generated efficiently.
• KOSMOS-2 [9] is a multimodal Transformer-based model designed for visual-textual grounding.
Key hyperparameters, such as max_new_tokens=1024 and attention masks, control the generation
process, ensuring the output is aligned with the visual input.
3.3. Evaluation
We employ quantitative and qualitative analyses to evaluate the quality of the generated explanations.
The quantitative metrics measure aspects that may not always capture human understandability or
insightfulness. The qualitative evaluations, instead, may provide deeper insights into the coherence
and relevance of explanations, which the metrics might miss.
We measure the alignment between the generated descriptions and the visual content using two
metrics:
• Image-to-text similarity: Using CLIP [18], we compute the cosine similarity between the image
(with Grad-CAM overlay) and the generated textual description. A higher score indicates that
the text better reflects the image content.
• Text-to-label similarity: We assess the consistency between the generated text and the classifi-
cation label (original or AI-generated) using the S-BERT model [19] by computing the cosine
similarity between the embedding of the generated explanation, and the description of the target
label.
In the qualitative analysis, we manually examine the coherence, relevance, and insightfulness of
the explanations generated by each LLM. This involves comparing the descriptions with the Grad-
CAM heatmaps and assessing whether the explanations provide meaningful insights into the model’s
decision-making process.
4. Experiments
In this section, we compare the performance of the selected LLMs—LLaVa-NeXt, InstructBLIB, and
KOSMOS-2—in improving the interpretability of AI models for artwork classification. The experiments
consisted of two primary analyses: quantitative evaluation using similarity metrics and qualitative
evaluation of the model’s ability to generate coherent and relevant descriptions for selected images.
4.1. Materials
To evaluate the proposed method, we used a subset of 100 images, equally divided between 50 AI-
generated artworks and 50 original artworks, drawn from the dataset described in [4]. This dataset
combines elements from ArtGraph [20] and ArtiFact [21]. ArtGraph is a specialized knowledge graph
with 116,475 artworks classified across 32 styles and 18 genres. ArtiFact is a large-scale dataset with
2,496,738 images, including authentic and fake images from various domains such as art, human faces,
and vehicles.
Each image in our subset included the original artwork and a corresponding Grad-CAM overlay
generated during classification. The experiments were run on the Google Colab platform, utilizing an
Intel Xeon processor, 12 GB RAM, and an NVIDIA T4 GPU with 15 GB VRAM.
4.2. Quantitative analysis
For the quantitative analysis, we measured the similarity between the generated descriptions, images,
and labels to assess the models’ ability to provide relevant textual explanations.
Table 1
Similarity metrics between image-to-text, text-to-label, and the total similarity for the three LLMs.
Model Image-to-text Text-to-label Total similarity
LLaVa-NeXt 0.22 0.14 0.36
InstructBLIB 0.26 0.14 0.40
KOSMOS-2 0.26 0.14 0.40
Table 1 presents the results of this analysis, showing the similarity metrics for each LLM. The results
show that InstructBLIB and KOSMOS-2 achieve the highest overall similarity scores (0.40), with LLaVa-
NeXt scoring slightly lower (0.36). All models perform similarly in terms of text-to-label similarity
(0.14). Regarding image-to-text similarity, InstructBLIB and KOSMOS-2 score 0.26, while LLaVa-NeXt
reaches 0.22.
4.3. Qualitative analysis
In addition to the quantitative analysis, we conducted a qualitative evaluation using four sample images,
two AI-generated and two original artworks (Fig. 2). All images were correctly classified by ViT. The
goal was to assess the ability of each LLM to generate insightful and relevant explanations for the
classification of these images.
For AI-generated artworks, LLaVa-NeXt accurately identifies the relevant areas of interest and pro-
vides plausible explanations for the network’s focus. However, for original artworks, it occasionally
struggles to account for the model’s final classification, leading to less coherent explanations. Instruct-
BLIP identifies regions of interest but often misinterprets the heatmap as a thermal image, particularly
in cases of AI-generated artworks. This results in inaccurate descriptions that fail to explain the classifi-
cation decision meaningfully. KOSMOS-2 performs relatively well in identifying relevant areas for both
AI-generated and original artworks, but its explanations are vague and often lack depth, leaving some
interpretability issues unresolved.
The qualitative analysis reveals that LLaVa-NeXt generally produces more intelligent and abstract
explanations, particularly for AI-generated images, despite some inaccuracies with original artworks.
In contrast, InstructBLIP and KOSMOS-2, while performing well in the quantitative analysis, often
produce literal descriptions that fail to provide meaningful insights into the classification decision.
5. Conclusion
In this work, we aimed to enhance the interpretability of Transformer-based models for classifying
artworks as original or AI-generated by integrating Grad-CAM heatmaps with Large Language Models.
We evaluated the performance of three state-of-the-art LLMs—LLaVa-NeXt, InstructBLIP, and KOSMOS-
2—on their ability to generate coherent and meaningful descriptions based on Grad-CAM visualizations
applied to both synthetic and original artworks.
While LLaVa-NeXt outperformed the other models in generating meaningful explanations for AI-
generated content, it requires further refinement to improve its accuracy when handling original
artworks. Addressing this limitation would increase its versatility across a broader range of image
classification tasks. InstructBLIP, on the other hand, struggled with interpreting Grad-CAM heatmaps,
often providing overly literal descriptions. Future work could focus on refining its training strategies
or prompt engineering to align its outputs more closely with the visual features it is meant to explain.
KOSMOS-2 showed promise with its strong multimodal grounding but lacked the depth of insight
necessary for robust interpretability. Further research could enhance its ability to abstract complex
visual inputs into more meaningful explanations.
Ultimately, combining advanced LLMs with explainability techniques like Grad-CAM can significantly
improve the transparency and trustworthiness of AI-driven image classification systems. Fine-tuning
LLaVa-NeXt
The activation map generated by the network focuses on certain areas of the work, such as the face and hair color.
This suggests that the network focused on these areas to classify the work, as these areas are likely to be the parts
of the work that have the greatest influence on the classification. The network may have analyzed the lines and
colors to identify whether the work was artificially generated or whether it was created from an original work of art.
InstructBLIP
The activation map presents those warm colors in that area of the artwork due to the use of thermal imaging
technology. Thermal imaging, also known as infrared thermography, is a non-destructive testing technique that
uses infrared cameras to detect and visualize thermal energy emitted by objects. In this case, the activation map
likely represents the temperature distribution of the artwork’s surface, with the warm colors indicating higher
temperatures and the cooler colors indicating lower temperatures. This could be caused by various factors, such as
the presence of hotspots on the artwork’s surface, uneven heating or cooling due to environmental factors, or even
the use of certain materials that have different thermal properties.
KOSMOS-2
The classifier might have focused on the areas of the face that are most likely to be associated with emotion, such as
the eyes, nose, mouth, or lips. The activation map highlights these areas, which could be the result of the classifying
algorithm’s attention to these features.
LLaVa-NeXt
The activation map generated by the network focuses on some specific areas of the work, such as the profile of
the face and the lines that define the shape of the head. These areas are likely relevant to the classification of
the work, as they may be unique features that the network used to distinguish the original artwork from an
artificially generated one.
InstructBLIP
The activation map presents those warm colors in that area of the artwork due to the presence of a heat signature.
Heat signatures, also known as thermograms, are generated by the infrared radiation emitted by an object or a
person’s body. In this case, the activation map is likely capturing the heat signature of the woman’s face, resulting
in the display of warm colors in the affected area. This could be caused by various factors, such as the intensity of
the infrared radiation, the distance between the camera and the subject, or the type of clothing or accessories worn
by the subject.
KOSMOS-2
The classifier might have focused on the areas of the face that are most likely to be associated with emotion, such as
the eyes, nose, and mouth. The classification could have been based on the artist’s facial expression, the way they
hold their head, or the way the artwork is positioned in the image.
LLaVa-NeXt
The activation map by the network focuses on some parts of the work, such as these that define the architecture of
the buildings and lines that follow the contour of the buildings. The specific lines are then worked further by the
generator for classification, as they may be unique features that distinguish the original work from an artificially
generated network.
InstructBLIP
The abstract map presents some colors that are in contrast due to their use of thermal imaging technology. Thermal
imaging captures radiation based on the intensity of infrared radiation emitted by objects, which is proportional
to their temperature. In this particular artwork, thermal readings generate in sequence warm colors in terms of a
figure capturing through rays resulting in a colorful and vibrant representation. The capture buttons and activation
map correspond to high temperatures or clusters that yield areas of intense heat such as during building fires. This
technique allows for realistic visual representation through detection of events, while also highlighting intense heat
generated by these events at any given time throughout the day.
KOSMOS-2
The designer might have focused on how there is imagery at extreme ends such as paintings depicting historic
events depicted in dynamic presence while others would have portrayed quiet desolation instead. Therefore, these
risks are located within the city, and presence findings behind each image might have attracted attention additional
answers elaboration on each one with discovery mannerism, adding aesthetic aspects but also hazard.
LLaVa-NeXt
The activation map generated by the network focuses on some areas of the work, such as the sky, the leaves of
the trees and the lines that define the contours of the objects. These areas were probably chosen because they
contain visual information that the network used to classify the work as ”AI Generated”. The lines and colors in the
work may have been artificially generated to create an image that appears original, but was actually generated by a
classification network.
InstructBLIP
The activation map presents those warm colors in that area of the artwork due to the presence of heat signatures.
Heat signatures, also known as thermal imaging or thermography, are generated by the infrared radiation emitted
by objects at different temperatures. In this case, the activation map is likely capturing the heat signatures emitted
by the trees and bushes in the image, resulting in the display of warm colors in the affected area.
KOSMOS-2
The classifier might have focused on the warm colors in the artwork, such as the yellow and orange spots, as they
are associated with warmth and happiness. The class might have also focused on other elements of interest, such as
the tree, the house, or the person, as these elements might evoke positive emotions or evoke a sense of connection
with nature.
Figure 2: Examples of artworks with overlaid Grad-CAM and corresponding explanations generated by the
tested LLMs. The two artworks at the top are AI-generated, while the ones at the bottom are authentic.
and expanding the capabilities of these models will be critical to enhancing interpretability and fostering
wider adoption, especially in other domains where explainability is essential.
Acknowledgments
G. Castellano and G. Zaza acknowledge funding support from the FAIR - Future AI Research (PE00000013)
project, Spoke 6 - Symbiotic AI (CUP H97G22000210007), under the NRRP MUR program funded by
NextGenerationEU. R. Scaringi’s research is funded through a Ph.D. fellowship within the Italian
“D.M. n. 352, April 9, 2022” program, under the NRRP, Mission 4, Component 2, Investment 3.3, with
the Ph.D. project titled “Automatic analysis of artistic heritage via Artificial Intelligence,” co-supported
by Exprivia S.p.A. (CUP H91I22000410007).
References
[1] M. Polignano, C. Musto, R. Pellungrini, E. Purificato, G. Semeraro, M. Setzu, XAI.it 2024: An
Overview on the Future of Explainable AI in the era of Large Language Models, in: Proceedings of
5th Italian Workshop on Explainable Artificial Intelligence, co-located with the 23rd International
Conference of the Italian Association for Artificial Intelligence, Bolzano, Italy, November 25-28,
2024, CEUR. org, 2024.
[2] G. Wu, W. Wu, X. Liu, K. Xu, T. Wan, W. Wang, Cheap-fake Detection with LLM using Prompt En-
gineering, in: 2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW),
IEEE, 2023, pp. 105–109.
[3] S. A. Yang, A. H. Zhang, Generative AI and copyright: A dynamic perspective, arXiv preprint
arXiv:2402.17801 (2024).
[4] T. Bianco, G. Castellano, R. Scaringi, G. Vessio, Identifying AI-Generated Art with Deep Learning,
in: CREAI@ AI* IA, 2023, pp. 16–25.
[5] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image
recognition at scale, arXiv preprint arXiv:2010.11929 (2020).
[6] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-CAM: Visual expla-
nations from deep networks via gradient-based localization, International Journal of Computer
Vision 128 (2020) 336–359.
[7] H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, Y. J. Lee, LlaVa-Next: Improved reasoning, OCR, and
world knowledge, 2024.
[8] H. Liu, C. Li, Y. Li, Y. J. Lee, Improved baselines with visual instruction tuning, in: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26296–26306.
[9] Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, F. Wei, KOSMOS-2: Grounding multimodal
large language models to the world, arXiv preprint arXiv:2306.14824 (2023).
[10] F. Martin-Rodriguez, R. Garcia-Mojon, M. Fernandez-Barciela, Detection of AI-created images
using pixel-wise feature extraction and convolutional neural networks, Sensors 23 (2023) 9037.
[11] D. C. Epstein, I. Jain, O. Wang, R. Zhang, Online detection of AI-generated images, in: Proceedings
of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 382–392.
[12] A. Y. J. Ha, J. Passananti, R. Bhaskar, S. Shan, R. Southen, H. Zheng, B. Y. Zhao, Organic or Diffused:
Can We Distinguish Human Art from AI-generated Images?, arXiv preprint arXiv:2402.03214
(2024).
[13] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition,
arXiv preprint arXiv:1409.1556 (2014).
[14] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[15] X. Yang, J. Zhou, Research about the Ability of LLM in the Tamper-Detection Area, arXiv preprint
arXiv:2401.13504 (2024).
[16] S. E. VP, R. Dheepthi, et al., LLM-Enhanced Deepfake Detection: Dense CNN and Multi-Modal
Fusion Framework for Precise Multimedia Authentication, in: 2024 International Conference on
Advances in Data Engineering and Intelligent Computing Systems (ADICS), IEEE, 2024, pp. 1–6.
[17] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A large-scale hierarchical image
database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, Ieee, 2009, pp.
248–255.
[18] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, et al., Learning transferable visual models from natural language supervision, in:
International Conference on Machine Learning, PMLR, 2021, pp. 8748–8763.
[19] N. Reimers, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, arXiv preprint
arXiv:1908.10084 (2019).
[20] G. Castellano, V. Digeno, G. Sansaro, G. Vessio, Leveraging Knowledge Graphs and Deep Learning
for automatic art analysis, Knowledge-Based Systems 248 (2022) 108859.
[21] M. A. Rahman, B. Paul, N. H. Sarker, Z. I. A. Hakim, S. A. Fattah, ArtiFact: A large-scale dataset
with artificial and factual images for generalizable and robust synthetic image detection, in: 2023
IEEE International Conference on Image Processing (ICIP), IEEE, 2023, pp. 2200–2204.