1. Introduction

PoliTo at MULTI-Fake-DetectiVE: Improving FND-CLIP for Multimodal Italian Fake News Detection

Lorenzo D'Amico

0 1 2

Davide Napolitano

davide.napolitano@polito 0 1 2

Lorenzo Vaiani

0 1 2

Luca Cagliero

luca.cagliero@polito.it 0 1 2

Fake News Detection, Multimodal Learning

0 Politecnico di Torino , Turin , Italy 1 Processing and Speech Tools for Italian , Sep 7-8, Parma, IT 2 Workshop Proce dings

The MULTI-Fake-DetectiVE challenge addresses the automatic detection of Italian fake news in a multimodal setting, where both textual and visual components contribute as potential sources of fake content. This paper describes the PoliTO approach to the tasks of fake news detection and analysis of the modality contributions. Our solution turns out to be the best performer on both tasks. It leverages the established FND-CLIP multimodal architecture and proposes ad hoc extensions including sentiment-based text encoding, image transformation in the frequency domain, and data augmentation via back-translation. Thanks to its efectiveness in combining visual and textual content, our solution contributes to fighting the spread of disinformation in the Italian news flow.

1. Introduction

forms all the baselines and competitors in both tasks, can be efectively detected.

The MULTI-Fake-DetectiVE challenge [2] proposed

news content and investigating the influence of visual aim of accurately discriminating between real and fakfueture directions. and textual components on each other’s interpretation.

In this work, we present the PoliTO approach to both2. Related Work at EVALITA 2023 [3] focuses on overcoming the limita-Section3 briefly describes the dataset, task, and mettions of existing approaches in coping with multimodarlics used in the challenge. In Sectio4nwe describe the Italian news content. It addresses the automatic detectmioenthodology, primarily focusing on the proposeFdNDof Italian fake news in a multimodal setting, where botChLIP extensions. Section5 presents the experimental textual and visual components potentially contributesaestup and the obtained results. Finally, Sect6iodnraws sources of fake content. The challenge has the twofoldthe conclusions and discusses the main limitations and

The remainder of this paper is organized as follows. In Section2 we review the literature on fake news detection, considering both text-only and multimodal approaches.

nEvelop-O

0000-0001-9077-4103 (D. Napolitano);0000-0002-3605-1577 (L. Vaiani);0000-0002-7185-5247 (L. Cagliero)

NLP-based approaches.

Early approaches focused on linguistic features, such as lexical and syntactic patterns, to distinguish between real and fake news. However, with the advancement of deep learning researchers have increasingly turned to more sophisticated methods such as recurrent neural networks4][, convolutional neural networks 5[], and transformer models 6[] to capture semantic and contextual information for improved detection accuracy. In this work, we mainly rely on statoer-exhibit a certain level of credibilitPyro;bably Real (PR), of-the-art transformers pretrained on Italian textual dawthaen the news is highly credible but retains some deto efectively extract information from the news textuagrlee of uncertainty regarding the information provided; component. Certainly Real (CR), when the news is most certain to be real and indisputable, regardless of the context. It is Multimodal Approaches. Incorporating multimodal worth noticing that these labels pertain to the overall information such as text and images has shown to bienformational content and should not be assigned based promising to improve the accuracy of fake news detecs-olely on the individual components. tion systems 7[]. Recently, the adoption of multimodal architectures and transformers has shown to be partiTca-sk 2: Analysis of Cross-Modal Relations in Fake ularly efective in capturing the semantic relationshipasnd Real News. The purpose is to examine the relaamong diferent modalities for fake news detection, e.g.,tionship between the textual and visual modalities within CB-Fake [8], CAFE [9] and TTEC [10]. the context of fake and real news. The primary objective

FND-CLIP, proposed by [11], is among the most re- is to gain insights into how images and texts in fake and cently proposed multimodal architectures for fake newrseal news can potentially lead to misleading interpretadetection. It relies on the established CLIP mod1e2l][to tions of the content, both within each modality and as a measure the cross-modal similarity and guide the mapw- hole. The task can be formulated as a three-class classiping and fusion of the input features. The architectu rficeation problem. Given a multimodal piece of conte nt, develops along three main streams: a textual one, whicthhe goal is to automatically assign one of the following extracts information using BERT and CLIP, a visual onecategories to: Misleading (M), when either the image or which extracts features from the images using ResNet antdhe text is misleading in terms of interpreting the inforCLIP, and a multimodal one which combines the featuresmation conveyed by the other modality or the content extracted using CLIP from both modalitieFsN.D-CLIP as a whole;Not Misleading (NM), when both image and sufers from the following limitations: text are related to each other, providing support to the overall information presented, and are not intended to • The natural language encoder neglects the pom-islead;Unrelated (U), when the image and the text are larity of the input text, which is known to bneot related to each other.

relevant to fake news detectio1n3][. • Fake news examples are likely to be undersam-Evaluation Metrics. For both tasks, the evaluation pled in real training data. Hence, the classifi-metrics are accuracy, average per-class precision and cation model may sufer from class imbalance recall, and macro- and weighted- F1 score. Weightedefects. F1 score has been selected as the reference metric for • Multimodal fake news often contains tampered vi- ranking participants.

sual content. Tampered images are more likely to be detected in the frequency domain space. However, FND-CLIP does not consider any frequency- 3.2. Dataset Description based image descriptor1[4].

Both task-specific datasets consist of a collection of Twit

Our research endeavors to address the aforesaid limit-er posts and newspaper articles describing one or more tations by proposingFND-CLIP-IT, i.e., an improved ver- real events. For Task 1 the training set contains 908 dission FND-CLIP suited to multimodal Italian fake newtsinct labeled sample1.s The labels in the training data are detection. distributed as follows: CF 16.4%, PF 22.0%, PR 44.4%, CR 17.2%. Around 80.0% of the samples are tweets, whereas the remaining ones are news articles. The test set con3. Task and Dataset Description sists of 193 samples following roughly the same per-class and per-type distributions as in the training data. For 3.1. Tasks Description Task 2, the training set contains 1309 distinct samples and the per-class distribution Mis 26.9%, U 40.6%, NM 31.5%.

Task 1: Multimodal Fake News Detection. The 66.0% of the samples are tweets, whereas the remaining problem can be formulated as a multi-class classifica-ones are news articles. The test set contains 219 samples. tion task, where the input cont e=nt⟨, ⟩ , consisting Compared to the training data, the per-type sample disof a textual componen tand a visual componen t, can tribution is slightly more biased towards tweets (75.0%) be classified as follows: Certainly Fake (CF), when the and Non-Misleading content (45.2%). news is very likely to be fake, regardless of the context in which it is presentedP;robably Fake (PF), when the news is likely to be fake but may contain some real information 1available at the time of writing, June 2023

4. Methodology Here we presentFND-CLIP-IT, an improved version of

FND-CLIP suited to the MULTI-Fake-DetectiVE challenge. Our solution is rooted in the originaFNlD-CLIP model [11] and a set of unimodal language and visual encoders described below.

Unimodal language baselines. We utilize the follow

ing models tailored to the Italian language: BERT2-,IT GilBERTo3, BART-IT4 [15].

Since the input text can be longer than the maximum model size, we adopt a hierarchical approach: the text is divided into chunks of fixed length, where each one is fed to the transformer encoder, and then the final representation is obtained by averaging all the [CLS] tokens.

Unimodal visual baselines. We exploit two estab

lished models, i.e., ViT [16] and ResNet-152 1[7]. Since more pictures can be associated with the same sample, at inference time we separately evaluate all the images and the final prediction is the average of all obtained output logits.

Multimodal baselines. To leverage visual and textual

content at the same time, we rely on (i) the standard FND-CLIP [11] architecture, adapted to handle Italian text rather than English, (ii) CLI1P2[], and (iii) a late fusion approach combining BERT-IT and ResNet-152. 4.1. FND-CLIP-IT

FND-CLIP-IT extends the state-of-the-aFrNtD-CLIP ar

chitecture to address the current limitations of fake news detection approaches. By incorporating the proposed extensions, the overall eficacy and robustness ofFNDCLIP-IT shows significant improvements compared to the baseline versions. A detailed description of the proposed extensions, hereafter denoted by , , , , and for the sake of brevity, is given below.

A. Sentiment-based textual representation: To consider the polarity of the input text for fake news detection1[3], we enrich the textual representation by adding a sentiment-based encoding to the existing text encoders. Specifically, we use the Italian-BERT model finetuned on a sentiment analysis tas5k. We also consider the following variants of sentiment-based textual representation:

2https://huggingface.co/dbmdz/bert-base-italian-xxl-cased

3https://huggingface.co/idb-ita/ gilberto-uncased-from-camembert 4https://huggingface.co/morenolq/bart-it 5https://huggingface.co/neuraly/ bert-base-italian-cased-sentiment A1. A concatenation of the sentiment-based embedding to the initial original representation, on top of which we apply the textual projection head.

A2. A separate stream of information with a dedicated projection head.

B. DFT-based additional stream: we convert the image from the spatial domain to the frequency domain by applying Discrete Fourier Transform (DFT). The purpose is to detect tampered images, which likely occur in multimodal fake news14[]. We encode both real and imaginary parts using a dedicated VGG19 1[8]. The obtained representations are then concatenated to generate a parallel stream of information that will be then combined with the others before applying the final FNDCLIP classifier.

C. Embedding concatenation: instead of summing the embedding of each stream we concatenate them. Concatenation has already been proven to be an efective way of combining multimodal information1[9]. The rationale behind it is that by keeping more fine-grained pieces of information the classification head, adapted to handle the new encoding, can capture the most discriminating source features in a more efective way.

D. Class rebalancing through data augmentation: since the dataset is quite imbalanced across the classes, we re-balance the data distribution by penalizing the most frequent class. In particular, we generate new samples of the minority classes by applying a textual augmentation based on back-translation20[], which already proved to be beneficial in both multimodal2[1] and fake news detection2[2] tasks. The auxiliary language adopted is English, and the translation models used are provided by Helsinki-NLP6.

E. Additional Squeeze and Excitation Layers: Similar to FND-CLIP, we employ a squeeze-andexcitation operation23[] to weigh the input embedding streams. The purpose of a squeezeand-excitation block is to adaptively recalibrate channel-wise feature responses by explicitly modeling interdependencies between channels. Unlike [11], where they weigh diferently the textual and visual streams, we also adopt squeezeand-excitation within each modality to weigh the relevance of each encoder. The key idea is to give more importance to discriminating modalityspecific embeddings.

Beyond considering each FND-CLIP extension sepa

rately, we also build both models that combine the pro

6https://huggingface.co/Helsinki-NLP

Model BERT-IT GilBERTo BART-IT ResNet-152 ViT BERT-IT+ResNet-152 CLIP-IT FND-CLIP-IT FND-CLIP-IT 1 FND-CLIP-IT 2 FND-CLIP-IT FND-CLIP-IT FND-CLIP-IT FND-CLIP-IT FND-CLIP-IT∗ 2, FND-CLIP-IT∗ 1,, FND-CLIP-IT∗ 1,, FND-CLIP-IT 1,,,, ENSEMBLE posed extensions - in diferent ways and ensemble individually, yield notable improvements over the origimethods that combine best-performing individual modn-al implementation. In addition, select combinations of els. To this end, we use a weighted average of individualthese variants produce even more promising outcomes. logits for each class. Although both focal loss and cross-entropy were evaluated, we chose to report only the results obtained with focal loss, due to their overall superior performance com5. Experimental Results pared to cross-entropy. It is worth noting, however, that the combination of all variants does not surpass the per5.1. Setup formance of specific combinations, indicating a potential The models were fine-tuned for a maximum of 80 epochs, susceptibility to overfitting. Furthermore, an intriguing using a batch size of 16, a learning rate of 1e-3, anobservation emerges with the implementation of an enAdamW optimizer with a weight decay of 0.001 and a lins-emble model that leverages the best-performing combiear scheduler. All baseline models were trained using anations. This ensemble model outperforms the individual cross-entropy-loss, while all FND-CLIP-IT variants were models, further accentuating the benefits of employing trained with both cross-entropy and focal losses. ensemble techniques to enhance overall performance.

5.3. Competition 5.2. Results Table 1 presents the results of the baselines (upper partW)e employed our ensemble method to evaluate the perand our proposed solutions (lower part), obtained on theformance of our FND-CLIP-IT variants on the test samTask 1 validation set. Significantly, the outcomes reveaplles. The test results are presented in Tab2l.eThe upper an intriguing pattern wherein text-only models exhibiptart of Table2 shows the outcomes obtained for Task 1. superior performance when compared to image-only Although these results are worse than the performance models, underscoring the paramount importance of texa-chieved on our validation set, they surpass all other tual information within the context of the task. Notabbalys,elines and competitors. the multimodal CLIP baseline demonstrates results com- Furthermore, we fine-tuned the same ensemble of modparable to the text-only model. At the same time, FNeDls- for Task 2 by replacing the classification head last CLIP-IT architecture attains performance marginally belat-yer. The bottom of Table2 reports the achieved results ter than the BERT-IT model. Furthermore, our diversefor Task 2. Remarkably, our approach outperforms both extensions of the FND-CLIP-IT framework, when applied the baseline and the competitors.

Team Run

PoliTo-P1 1 extremITA-camoscio_lora sk AIMH-MYPRIMARYRUN aT Baseline-SVM-TEXT

HIJLI-JU-CLEF-Multi 2 PoliTo-P1 sk Baseline-MLP-TEXT aT AIMH-MYPRIMARYRUN technologies. Computational resources were provided by HPC@POLITO, a project of Academic Computing within the Department of Control and Computer Engineering at the Politecnico di Tori7n.o

This study was carried out within the FAIR - Future Ar

tificial Intelligence Research and received funding from the European Union Next-GenerationEU (PNRR M4C2, [9] Y. Chen, D. Li, P. Zhang, J. Sui, Q. Lv, L. Tun, INVESTIMENTO 1.3 D.D. 1555 11/10/2022, PE00000013). L. Shang, Cross-modal ambiguity learning for mulThis study was carried out within the MICS (Made in timodal fake news detection, in: Proc. of the ACM Italy – Circular and Sustainable) Extended Partnership Web C onference 2022 , 2022, pp. 2897–2905. and received funding from the European Union Next-[10] J. Hua, X. Cui, X. Li, K. Tang, P. Zhu, Multimodal GenerationEU (PNRR M4C2, INVESTIMENTO 1.3 D.D. fake news detection through data augmentation1551.11-10-2022, PE00000004) . This manuscript reflects based contrastive learning, Applied Soft Computing only the authors’ views and opinions, neither the Eu- 136 (2023) 110125. ropean Union nor the European Commission can be [11] Y. Zhou, Q. Ying, Z. Qian, S. Li, X. Zhang, Multimodal fake news detection via clip-guided learning, arXiv preprint arXiv:2205.14304 (2022). considered responsible for them. The research leading to these results has been partly funded by the SmartData@PoliTO center for Big Data and Machine Learning 7https://www.hpc.polito.it/

[1]

Vosoughi ,

Roy , S. Aral,

The spread of true and

Table 2 false news online , Science ( 2018 ). Oficial MULTI-Fake-DetectiVE results . For the oficial base- [2]

Bondielli ,

Dell'Oglio ,

Lenci ,

Marcelloni ,

proach. at evalita 2023: Overview of the multimodal fake

By leveraging our best ensemble method, we have Language Processing and Speech Tools for Italian. demonstrated the robustness and versatility of FoNurD- Final Workshop (EVALITA 2023 ), CEUR.org, Parma,

CLIP-IT variants across both Task 1 and Task 2 , surpass - Italy, 2023 . ing existing approaches in terms of performance and [3]

Lai ,

Menini ,

Polignano ,

Russo , R. Sprug-

efectiveness. noli, G. Venturi, Evalita 2023 : Overview of the 8th

6. Conclusion and Future of the Eighth Evaluation Campaign of Natural Lan-

Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy,

In this study, we introduced the FND-CLIP-IT architec- 2023. ture exploring several variants for fake news detection [4]

Iwendi ,

Mohan , S. khan, E. Ibeke, A . Ahmadian,

in a multimodal setting. Our findings demonstrate the ef- T. Ciano, Covid-19 fake news sentiment analysis,

fectiveness of these variants, with notable improvements Computers and Electrical Engineering (

2022 ). observed over the original implementation . Furthermore,[5]

Ma ,

Tang ,

Zhang ,

Cui ,

Ji ,

Chen ,

In the future, we plan to continue refining and optimiz- news detection , Applied Intelligence ( 2023 ). ing the proposed variants to further enhance their per [-6]

Jwa ,

Oh ,

Park ,

J. M.

Kang , H. Lim, exbake:

decision-making process will be an interesting direction ers (bert ), Applied Sciences 9 ( 2019 ). for future research . [7]

Segura-Bedmar ,

Alonso-Bartolome , Multi-

modal fake news detection , Information 13 ( 2022 ). Acknowledgments [8]

Palani ,

Elango ,

Viswanathan K , Cb-fake: A

and bert, Multimedia Tools and Applications ( 2022 ). [12]

Radford ,

J. W.

Kim ,

Hallacy , A . Ramesh,

2021 . [13]

M. A.

Alonso ,

Vilares ,

Gómez-Rodríguez , J. Vi-

Electronics 10 ( 2021 ). [14]

Jing ,

Wu ,

Sun ,

Fang ,

Zhang , Multi-

( 2023 ). [15] M. La Quatra , L. Cagliero , Bart-it: An eficient

marization , Future Internet 15 ( 2023 ). [16]

Dosovitskiy ,

Beyer ,

Kolesnikov , D. Weis-

Houlsby , An image is worth 16x16 words: Trans-

formers for image recognition at scale , in: 9th Inter-

ICLR 2021 , 2021 . [17]

He ,

Zhang , S. Ren,

Sun , Deep residual learn-

nition , 2016 . [18]

Simonyan ,

Zisserman , Very deep convolu-

in: Proc. of the 3rd International Conference on

Learning

Representations , ICLR 2015 , San Diego,

CA , USA, May 7- 9 , 2015 , 2015 . [19]

Vaiani ,

M. La

Quatra ,

Cagliero ,

Garza , Lever-

tion, in : Proc. of the 37th ACM/SIGAPP Symposium

on Applied Computing , 2022 . [20]

Edunov ,

Ott ,

Auli ,

Grangier , Under-

2018 Conference on Empirical Methods in Natu-

2018 . [21]

Vaiani ,

Cagliero , P. Garza, PoliTo at SemEval-

2023 task 1: Clip-based visual-word sense disam-

17th International Workshop on Semantic Evalua-

tion (SemEval-2023),

ACL

, Toronto, Canada, 2023 . [22]

Amjad ,

Sidorov ,

Zhila , Data augmentation

in the urdu language , in: Proc. of the 12th lan-

guage resources and evaluation conference , 2020 ,

pp. 2537 - 2542 . [23]

Hu ,

Shen , G. Sun, Squeeze- and -excitation net-

puter vision and pattern recognition , 2018 .