1. Introduction

AIMH at MULTI-Fake-DetectIVE: System Report

Giovanni Puccetti

Andrea Esuli

0 0 ISTI • Area della Ricerca CNR , via G. Moruzzi 1, 56124 Pisa , Italy

This report describes our contribution to the EVALITA 2023 shared task MULTI-Fake-DetectIVE which involves the classification of news including textual and visual components. To experiment on this task we focus on textual data augmentation, extending the Italian text and the Images available in the training set using machine translation models and image captioning ones. To train using diferent set of input features, we use diferent transformer encoders for each variant of text (Italian, English) and modality (Image). For Task 1, among the models we test, we find that using the Italian text together with its translation improves the model performance while the captions don´t provide any improvement. We test the same architecture also on Task 2 although in this case we achieve less satisfactory results.

eol>MULTI-Fake-DetectIVE Fake News Multimodality

1. Introduction

To perform the task we focus on exploring the efectiveness of augmenting the dataset by adding variants Misinformation, intentional or not, is an ubiquitous phe- of the input extrapolated from both the existing text in nomenon in social media. Whether due to malicious Italian as well as the images leveraging the knowledge intent or scarce reviews, the number of outlets producing available in pre-trained models. incorrect information is growing over time [ 1 ]. While the The idea of exploiting knowledge implicitly encoded only true mean to protect one self from misinformation is in large pretrained models is used in several contexts careful review of trustworthy sources, the development with diferent goals, ranging from Neural Databases [ 5 ] of sound quantitative approaches for fake news detection to synthetic text detection [ 6 ]. is a worthy endeavour. The rest of the report is structured as follows: section 2

In this context there are works providing benchmark reports relevant literature, section 3 covers details of the datasets for the very task of fake news detection in Twit- dataset we found while preparing the models, section 4 is ter [ 2 ], however this is generally tackled in a unimodal the System Description, section 5 outlines the results we setting where textual information is the only one exam- obtained, and finally in section 6 we draw the conclusions ined. In this context, the MULTI-Fake-DetectIVE task of this work. [ 3 ], part of the EVALITA 2023 campaign [ 4 ] proposes to add multimodality, by challenging participants to classify fake news using both textual and visual features. 2. State of the Art

The task consists in classifying tweets reporting news about the war in Ukraine with both textual and visual content according to whether the reported news is true or fake. The task is subdivided into two subtasks: • the first subtask is about detecting fake news by assigning a label among Certainly False, Probably

False, Probably True, Certainly True; • the second subtask is focused on detecting the agreement between text and image by assigning a label among Misleading, Non Misleading, Unrelated, which respectively indicate if the content of text and image support diferent interpretations, the same interpretation or are unrelated.

Recently, multimodal classification is tackled with visual language models such as OSCAR [ 7 ], VinVL [ 8 ] or with separate text and image encoding networks [ 9 ]. Built upon the idea of creating a shared representation space between text and images, developed in CLIP [ 10 ], several image captioning models have also been developed such as CoCa [ 11 ], we also try experimenting with these architecture for data augmentation. We could also use Multimodal Large Language Models for this same goal, i.e. augmenting data, some of the best performing ones are BLIP-2 [ 12 ] and Llava [ 13 ] these are too computationally costly and we avoid using them. Instead, to perform data augmentation across languages we employ Italian to English Neural Machine Translation models [ 14, 15 ].

3. Data

We perform an analysis of the dataset meant to understand if there are task specific preprocessing we have to apply to the data.

Figure 1 shows the distribution of labels in both tasks, we notice that both have heavily unbalanced distribution. The dataset of Task 1 Figure 1 shows how (likely) True news are the majority of samples, indeed, while ubiquitous in our everyday experience on the web, (likely) Fake news are still a minority of the total information shared.

Accordingly, for Task 2 Figure 1 shows that instances where Image and Text are heavily non aligned are also a minority.

While inspecting Task 1 training dataset, we observe non negligible data duplication, more specifically, there is 13.6% duplicated training samples, which we remove. On the contrary the dataset for Task 2 does not show any repetition.

4. Description of the System

on, by sample we refer to a set of texts and images composing a single piece of news. Similarly, by features we indicate both texts or images.

To explore several data augmentation possibilities we build a unique pipeline that allows to add multiple pretrained models to process diferent input features schemes, based on diferent sets of texts and images.

Figure 2 outlines our architecture, using the same notation as in the figure, for each input sequence/image ( ) in a sample, we use a pretrained model to embed it (), then we add a linear layer () that maps all embeddings to the same dimension, finally we sum all such embeddings (entry-wise) to create a shared hidden state ( ) and pass this vector through a linear layer ( ) that maps it to a vector with length equal to the number of classes, 4 for task 1 and 3 for task 2. During training we optimize all parameters, including those of the pretrained models .

4.1. Data Augmentation

The architecture we use allows us to seamlessly use as input any number of texts and images for each sample, in particular by adding extra features. We add features in two ways: • We translate the textual documents to English using an open-source machine translation model [ 14, 15 ], in particular an Italian to English model1; • We caption the images using an image captioning model CoCa [ 11 ] fine tuned on the MSCOCO [ 16 ], we use an open source version 2.

Adding these extra inputs gives us the possibility to compose samples with diferent sets of features among Italian Text, English Text, English Caption, Image. We evaluate three sets of features: • English Text, Image; • English Text, Italian Text, Image; • English Text, English Caption, Image.

4.2. Small Scale Ablation Study

In this Section we describe the methodology we developed to tackle the MULTI-Fake-DetectIVE task. We re- All the models we test share the same high level archiport the choices made and the steps that led us to them. tecture as shown in Figure 2, as mentioned above we In particular, we focus on data augmentation, for which use diferent pretrained transformer encoders to embed we mainly adopt two systems working either on text or diferent modalities, sum all the embeddings entry wise on images. Our architecture follows the one proposed by after mapping them to the same dimension through a Gallo et al. [ 9 ]. linear layer and finally with another linear layer we map

We focus on data augmentation because the dataset to a vector with length the number of labels, finally we is composed of Italian texts and since there aren’t many compute the usual Cross Entropy Loss for classification. models pre-trained specifically on this language we ex- 1www.huggingface.co/Helsinki-NLP/ plore how well translating to English works. From here 2https://laion.ai/blog/coca/

While summing the encoding of separate features, we multiply each of them by a coeficient, let us call it (e.g. − is the coeficient multiplying the embedding for the English translation), that modulates the relative importance of each feature. Similarly each feature has its own pre-trained encoder, we use the following ones: • VIT [ 17 ] and in particular the vit-large-patch32

384 version3 to encode images; • RoBERTa large [18] to encode text in English, either the translated texts or the generated captions; • a version of BERT-base pretrained on Italian4 to

encode all Italian text we use.

We perform all our validation test by splitting the training dataset in 80% training and 20% validation. The main architecture choices we make are, the shared size to which we map the embeddings output by each encoder, 3https://huggingface.co/google/vit-large-patch32-384 4https://github.com/dbmdz/berts the that multiply each of the embeddings before summing them, the classification head shape and the pretrained models we use. Let us list how we chose each of them: • For the vector size, we experiment with 512 and 1024, seeing that performance does not change depending on these two setting we use the smaller value, 512 in all our experiments. • Concerning the of each modality, we notice that − is the most relevant one and after some tests, we choose the parameters as follows, − = 1.0 and all others equal to 0.1. • The final ablation we have performed concerns the classification head, which eventually we choose to be a single linear layer with input size 512 and output size the number of labels, 4 for task 1 and 3 for task 2. • Initially, we tried a diferent version with two

Linear layers with tanh activation function in between and the hidden size of 2048, but this leads

Certainly Fake Probably Fake Probably Real Certainly Real weighted avg support

precision recall f1-score 16 using English Text and Images only, however this did not seem to afect performance 6.

Comparing the results obtained on our validation set when using diferent groups of features we eventually choose to only use the translated text together with the Images, as adding Italian didn’t appear to provide significant improvements.

We tackle Task 2 keeping everything as we did in Task 1 switching training set.

5. Results 4.3. Hyper Parameter Selection

For Task 1 we perform a grid search using as features: English Text, Image.

We sweep over the following hyper parameters: to lower performance (although comparable) in all our experiments. • Similarly, while choosing architecture we experi- 5.1. Task 1 mented with smaller versions of each transformer Table 1 shows the performance of our approach on the encoder, namely: (a) VIT with patch16-224 instead ifrst task. In bold we report the metric that has been used of patch32-384; (b) roberta-base instead of roberta- to evaluate our model, it reports how the class balance large; (c) bert-base pretrained on English instead in the training set is reflected into per-class performance of Italian. However, while faster to train, switch- into the oficial test set (measured with the oficial evaluing any pretrained model to its smaller version ation script). Indeed the Certainly Real class is the most reduced performance and therefore we opt for numerous in this case too as well as the one where our the larger ones when performing the grid search model is best performing. It is interesting to notice how to choose our best model. the model performs better on the Certainly False class compared to the Certainly Real one despite the second being more populated, we speculate this is due to the similarity with the Probably Real class.

Although we chose a diferent method to submit to Task 1, we show that including the Italian text results into promising results on the oficial test set. Table 3 • learning rate: 1e-5, 2e-5, 3e-5, 5e-5; shows how this approach performs on the oficial test set • Max epochs: 3, 4, 5, 10, 20; and indeed it would improve over our submission. • Warmup steps: 0, 100; Unlike adding the Italian text, using the captions does • Batch size: 4, 8 (other values would not fit into not results in performance improvements. Table 4 shows our machine). the performance obtained while adding the Captions for the images, generated by CoCa [ 11 ] and processed with a diferent roberta-large model.

The best performance on our validation set is obtained with warmup 0, batch-size 8, epochs 4 and learning rate 1e-5 and therefore we use this set of hyper parameters 5.2. Task 2 when training with all groups of features5.

Due to limitations in GPU memory, we clip all se- For task 2, we chose to keep all parameters as in Task 1. quences to 256 length. We also tested a length of 400

6. Conclusions

We have tackled the MULTI-Fake-DetectIVE task trying and improve performance with textual data augmentation techniques.

We show that our approach does provide some improvements and this is relevant as text-based data augmentation is a novel way to exploit the knowledge present within large pretrained models, made recently possible by pretrained models and has several application settings [19].

Moreover, in this report we show how using both Italian and English data at once, even though the English one is the translation of the Italian text, provides significant improvements in Task 1.

On the contrary, the lower performance of the model in Task 2 underlines how the relations between text and images are not well captured by our model and this ofers the opportunity for further improvements.

A structural limitation of our approach is that, although we know that the dataset is composed of both tweets and articles and that the second document type is generally much longer than the tweets, we have not experimented with ways to use this longer context.

This too ofers a promising future step, using longer context transformers when embedding text, while keeping our overall scheme of translating to English might give further improvements.

Indeed, given the scarcity of longer context transformers trained on Italian the English translation might be useful in this case as well.

Acknowledgments

This work is supported by the European Union under the scheme HORIZON-INFRA-2021-DEV-02-01 – Preparatory phase of new ESFRI research infrastructure projects, Grant Agreement n.101079043, “SoBigData RI PPP: SoBigData RI Preparatory Phase Project”

[1]

Zhou ,

Zafarani , A survey of fake news: Fundamental theories, detection methods, and opportunities , ACM Comput. Surv . 53 ( 2020 ). URL: https:// doi.org/10.1145/3395046. doi: 10 .1145/3395046.

[2]

Fagni ,

Falchi ,

Gambini ,

Martella ,

Tesconi , Tweepfake: About detecting deepfake tweets , PLOS ONE 16 ( 2021 ) 1 - 16 . URL: https://doi. org/10.1371/journal.pone.0251415. doi: 10 .1371/ journal.pone. 0251415 .

[3]

Bondielli ,

Dell'Oglio ,

Lenci ,

Marcelloni ,

L. C.

Passaro , M. Sabbatini, Multi-fake-detective at evalita 2023: Overview of the multimodal fake news detection and verification task, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian . Final Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 .

[4]

Lai ,

Menini ,

Polignano ,

Russo ,

Sprugnoli , G. Venturi, Evalita 2023 : Overview of the 8th evaluation campaign of natural language processing and speech tools for italian, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian . Final Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 .

[5]

Thorne ,

Yazdani ,

Saeidi ,

Silvestri ,

Riedel ,

Halevy , From natural language processing to neural databases , Proc. VLDB Endow . 14 ( 2021 ) 1033 - 1039 . URL: https://doi.org/10. 14778/3447689.3447706. doi: 10 .14778/3447689. 3447706.

[6]

Mitchell ,

Lee ,

Khazatsky ,

C. D.

Manning ,

Finn , Detectgpt: Zero-shot machine-generated text detection using probability curvature , 2023 . arXiv: 2301 . 11305 .

[7]

Li ,

Yin ,

Li ,

Hu ,

Zhang ,

Wang ,

Hu ,

Dong ,

Wei ,

Choi ,

Gao , Oscar: Object-semantics aligned pre-training for vision-language tasks , ECCV 2020 ( 2020 ).

[8]

Zhang ,

Li ,

Hu ,

Yang ,

Zhang , L. Wang, formers for image recognition at scale , in: InY. Choi,

Gao , Vinvl: Making visual representa- ternational Conference on Learning Representations matter in vision-language models , CVPR 2021 tions , 2021 . URL: https://openreview.net/forum?id= ( 2021 ). YicbFdNTTy.

[9]

Gallo ,

Calefati ,

Nawaz ,

M. K.

Janjua , Image [18]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen , and encoded text fusion for multi-modal classifica- O.

Levy , M.

Lewis , L.

Zettlemoyer , V. Stoyanov, tion, 2018

Digital Image Computing: Techniques Ro{bert}a: A robustly optimized {bert} pretrainand Applications (DICTA) (

2018 ) 1 - 7 . ing approach, 2020 . URL: https://openreview.net/

[10]

Radford ,

J. W.

Kim ,

Hallacy , A . Ramesh, forum?id=SyxS0T4tvS. G. Goh,

Agarwal ,

Sastry ,

Askell , P. Mishkin, [19]

Mumuni ,

Mumuni , Data augmentation: A comJ.

Clark , G.

Krueger , I. Sutskever , Learning trans- prehensive survey of modern approaches, Array 16 ferable visual models from natural language su- ( 2022 ) 100258 . URL: https://www.sciencedirect.com/ pervision, in: M. Meila , T. Zhang (Eds.), Pro- science/article/pii/S2590005622000911. doi:https: ceedings of the 38th International Conference //doi.org/10.1016/j.array. 2022.100258. on Machine Learning , volume 139 of Proceedings of Machine Learning Research, PMLR , 2021 , pp. 8748 - 8763 . URL: https://proceedings.mlr.press/ v139/radford21a.html.

[11]

Yu ,

Wang ,

Vasudevan ,

Yeung ,

Seyedhosseini ,

Wu , Coca: Contrastive captioners are image-text foundation models , 2022 . arXiv: 2205 . 01917 .

[12]

Li ,

Savarese ,

S. C. H.

Hoi , BLIP-2 : bootstrapping language-image pre-training with frozen image encoders and large language models , CoRR abs/2301 .12597 ( 2023 ). URL: https://doi.org/ 10.48550/arXiv.2301.12597. doi: 10 .48550/arXiv. 2301.12597. arXiv: 2301 . 12597 .

[13]

Li ,

Wong ,

Zhang ,

Usuyama , H. Liu,

Yang ,

Naumann ,

Poon ,

Gao , Llava-med: Training a large language-andvision assistant for biomedicine in one day , CoRR abs/2306 .00890 ( 2023 ). URL: https://doi.org/10.48550/arXiv.2306.00890. doi: 10 .48550/arXiv.2306.00890. arXiv: 2306 . 00890 .

[14]

Tiedemann , Parallel data, tools and interfaces in OPUS , in: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12) , European Language Resources Association (ELRA) , Istanbul, Turkey, 2012 .

[15]

Tiedemann , S. Thottingal, OPUS-MT - Building open translation services for the World , in: Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT) , Lisbon, Portugal, 2020 .

[16]

Lin ,

Maire ,

S. J.

Belongie ,

L. D.

Bourdev ,

R. B.

Girshick ,

Hays ,

Perona ,

Ramanan ,

Doll'a r , C. L. Zitnick, Microsoft

COCO

: common objects in context , CoRR abs/1405 .0312 ( 2014 ). URL: http: //arxiv.org/abs/1405.0312. arXiv: 1405 . 0312 .

[17]

Dosovitskiy ,

Beyer ,

Kolesnikov ,

Weissenborn ,

Zhai ,

Unterthiner ,

Dehghani ,

Minderer , G. Heigold,

Gelly ,

Uszkoreit ,

Houlsby , An image is worth 16x16 words: Trans-