Multimodal Attention is all you need Marco Saioni1,* , Cristina Giannone1,2 1 University G. Marconi, Rome, IT 2 Almawave S.p.A., Via di Casal Boccone, 188-190 00137, Rome, IT Abstract In this paper, we present a multimodal model for classifying fake news. The main peculiarity of the proposed model is the cross attention mechanism. Cross-attention is an evolution of the attention mechanism that allows the model to examine intermodal relationships to better understand information from different modalities, enabling it to simultaneously focus on the relevant parts of the data extracted from each. We tested the model using textitMULTI-Fake-DetectiVE data from Evalita 2023. The presented model is particularly effective in both the tasks of classifying fake news and evaluating the intermodal relationship. Keywords Transformer, fake news classification, multimodal classification, cross attention 1. Introduction the text and images it receives as input). The aim was to find a way to reconcile the two different Internet has facilitated communication by enabling rapid, representation embeddings because they are learned sep- immersive information exchanges. However, it is also arately from two different corpora, such as text and im- increasingly used to convey falsehoods, so today, more ages, trying to capture their mutual relationships through than ever, the rapid spread of fake news can have se- some interaction between the respective semantic spaces. vere consequences, from inciting hatred to influencing The remainder of the paper is structured as follows: financial markets or the progress of political elections to section 2 presents a brief overview of related work, and endangering world security. For this reason, mitigating section 3 describes the architecture of the proposed the growing spread of fake news on the web has become model. Section 4 discusses an overview of our exper- a significant challenge. iments. Sections 5 and 6 present the final results and our Fake news manifests itself on the internet through conclusions, respectively. text, images, video, audio, or, in general, a combina- tion of these modalities, which is a multimodal way. In this article, we took the two, text and image, compo- 2. Related Works nents of news as it proposed, for instance, in a social network. In this work we proposed an approach to auto- The Italian MULTI-Fake-DetectiVE competition [2] adds matically and promptly identify fake news. We use the to the various datasets and challenges on multimodal dataset MULTI-Fake-DetectiVE1 competition, proposed in fake news recently developed, for instance, Factify [3] EVALITA 20232 . The competition aims to evaluate the and Fakeddint [4]. The creation of these competitions truthfulness of news that combines text and images, an shows the interest in this task. The first task of the Italian aim expressed through two tasks: the first, which car- challenge saw three completely different systems placed ries out the identification of fake news (Multimodal Fake on the podium. While the first system POLITO[5] with News Detection); the second, which seeks relationships a system based on the FND-CLIP multimodal architec- between the two modalities text and image by observing ture [6] proposing some ad hoc extensions of CLIP [7] the presence or absence of correlation or mutual implica- including sentiment-based text encoding, image transfor- tion (Cross-modal relations in Fake and Real News). mation in the frequency domain, and data augmentation Our approach proposes a Transformer-based model via back-translation. The Extremita system [8], second that focuses on relating the textual and visual embeddings classified, exploited the LLM capabilities, focusing only of the input samples (i.e., the vector representations of on the textual component of each news. They fine-tuned the open-source LMM Camoscio [9] with the textual part CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, of the dataset. The impressive results show how the tex- Dec 04 — 06, 2024, Pisa, Italy tual component plays a primary role in identifying fake * Corresponding author. news. Despite the significant contribution of the tex- $ marco.saioni@gmail.com (M. Saioni); c.giannone@unimarconi.it tual component to the task, more and more multimodal (C. Giannone) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License approaches are taking hold. In [10] proposed CNN ar- 1 Attribution 4.0 International (CC BY 4.0). https://sites.google.com/unipi.it/multi-fake-detective chitecture combining texts and images to classify fake 2 https://www.evalita.it[1] news. In that direction, approaches such as CB-FAKE[11] CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings incorporate the encoder representations from the BERT model to extract the textual features and comb them with a model to extract the image features. These features are combined to obtain a richer data representation that helps to determine whether the news is fake or real. Vision- language models, in general, have gained a lot of interest also in the last years, in the "large models era". Language Vision Models have been proposed during the previous year, with surprising results in many visual language interaction tasks [12],[13]. 3. The proposed Model The objective was to "engage" specialist models for nat- ural language processing and artificial vision, making them discover and learn bimodal features from text and images collaboratively and harmoniously by applying the teachings of Vaswani et al. [14]: we decided to follow the path indicated by "Attention is all you need" Vaswani et al. very famous paper, following up on the intuition that the Attention mechanism could provide an important added value to the multimodal model of identification of fake news, becoming a Multimodal Attention (hence the title of this article), i.e. an attention mechanism applied Figure 1: Proposed model architecture. between the two textual and visual modes of news. In fact, while Attention or Self Attention (as described in Vaswani et al. paper) takes as input the embeddings of to each other with the strategy of mutual cross-attention a single modality and transforms them into more infor- to obtain two embeddings subsequently concatenated to mative embeddings (contextualized embeddings), Mul- provide the input of the last dense classification layer. timodal Attention takes as input the embeddings of the two different modalities by combining them and then 3.1.1. Pre-processing step transforming them into a single embedding capable of capturing any existing relationships between the two As a first step it is necessary to process the data made input modes. available by the organizers of the MULTI-Fake-DetectiVE competition to produce inputs that are compatible and 3.1. Architecture compliant with those expected from the pre-trained mod- els. The choices made for this preparation or for the Multimodal Attention is the heart that supports the pro- pre-processing of the dataset and the data ”personaliza- posed model, making it capable of exploring the hidden tion” strategy will then be described in the following aspects of multimodal communication. As shown at a three points: high level in Figure 1, the architecture of the proposed model consists of a hierarchical structure with three lay- • resolution/explosion of 1 : 𝑁 relationships be- ers preceded by a pre-processing step. In order, there are: tween text and images into 𝑁 times 1 : 1 rela- a pre-processing step, an input layer, a cross-modal layer tionships; and a fusion layer. It was decided to propose a network • data augmentation with the creation of an addi- that models the consistent information between the two tional image to support the original one already modalities textual and visual starting from State Of The present in each example; Art pre-trained neural networks. In particular, we use a • management of the textual component, truncated BERT [15] pre-trained model to learn the word embed- by BERT or rather by the relevant tokenizer to a dings by the textual component of news and a ResNet fixed maximum length of tokens. [16] pre-trained model to learn visual embeddings by the visual component. The two embeddings, belonging to As decided for the visual and textual components, there- two spaces with different dimensions, are first projected fore following processing, for each single sample we into a uniform, reduced-dimensional space, then related move from the original pairs < 𝑡, 𝑣 >, where 𝑣 indi- + + cates the ratio 1 : 𝑁 between text in natural language and images in JPEG format, to the triples appropriately visual embedding of size ℎ𝑟 for each example and which translated into numbers represents the features in a compact and semantic form extracted through convolutions and pooling within the < 𝑡𝑡𝑟𝑢𝑛𝑐 , 𝑣, 𝑣𝑎𝑢𝑔 > ResNet network. In fact, to obtain visual embeddings from a pre-trained neural network like ResNet, we usu- where 𝑡𝑡𝑟𝑢𝑛𝑐 indicates, for each sample, a first-order ten- ally take the output of the penultimate layer, i.e. global sor with 128 values (token), while 𝑣 and 𝑣𝑎𝑢𝑔 denote pooling. In the proposed model, ResNet50V2 was cho- third-order tensors with (224 × 224 × 3) values (pixels). sen which in global pooling reduces the spatial dimen- In fact, the first order tensor is the representation of the sions of the output tensor to 2048 values and therefore text in numerical form according to the default strategy each input image will correspond in output to a vector of the BERT tokenizer, while the third order tensor is the with ℎ = 2048 values, which represents the visual em- 𝑟 representation of the images in numerical form according beddings extracted from the network for that specific to the RGB coding for ResNet. image. After obtaining the embeddings for each of the two images, they are concatenated together to obtain 3.1.2. Input layer a single output tensor which will therefore have size This layer receives as input the previously processed 2 × ℎ𝑟 = 4096. Using the same formalism as the previ- dataset, i.e. the text and the images represented in nu- ous text encoder, we have: merical form, passing it to the pre-trained BERT and ev = ResNet(v)[𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑜𝑙𝑖𝑛𝑔] ResNet models to obtain the respective embeddings, sub- sequently projected into a space with small and common where ev ∈ Rℎ𝑟 is the visual embedding vector and v ∈ dimensions to make them comparable and to allow them R𝐿×𝐻×𝐶 the input third-order tensor. The indicated to collaborate with each other in the subsequent cross- equation refers to a single sample but can be extended to modal layers. the entire batch of 𝑁 examples, therefore indicating the batch with V ∈ R𝑁 ×𝐿×𝐻×𝐶 , we will have: BERT Encoder Each sample pre-processed and rep- resented in numerical form by the tokenizer is passed Ev = ResNet(V)[𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑜𝑙𝑖𝑛𝑔] as input to the pre-trained BERT model which returns different output tensors for each of them. For the pur- where Ev ∈ R𝑁 ×ℎ𝑟 is the visual embedding matrix poses of the classification task object of this study, we learned by the ResNet model. Similar discussion for the consider the pooled_output, a compact representation second image, for which it will be valid at batch level: of all the token sequences given as input to the BERT model, obtained via the special token [CLS]. It is there- Evaug = ResNet(Vaug )[𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑜𝑙𝑖𝑛𝑔] fore a summary of the information extracted from the where Evaug ∈ R𝑁 ×ℎ𝑟 . By concatenating the two em- entire input dataset whose dimensions evidently depend beddings, we will obtain: on the number of hidden units of BERT. Since each text supplied as input to BERT will correspond to a tensor Ev ⊕ Evaug = Econcat(v,vaug ) ∈ R𝑁 ×2ℎ𝑟 . with 768 values real, using vector notation we have that: From this moment and for simplicity of notation, Ev will et = BERT(ttrunc )[𝑝𝑜𝑜𝑙𝑒𝑑_𝑜𝑢𝑡𝑝𝑢𝑡] refer to Econcat(v,vaug ) , knowing that this embedding is actually the concatenation of embeddings of an image where et ∈ Rℎ is the word embeddings vector, ttrunc ∈ and the one obtained through random transformations. R𝑁𝑚𝑎𝑥 is the token input vector and ℎ = 768 is the BERT hidden size. The equation shown refers to a single sample but can be extended to the entire batch of 𝑁 Projection The pre-trained models provide embed- examples processed by BERT. Indicating this batch with dings with different sizes. It is, therefore, necessary to Ttrunc ∈ R𝑁 ×𝑁𝑚𝑎𝑥 , we will have: transform them into a space with the same dimensional- ity to obtain comparable representations. The projection Et = BERT(Ttrunc )[𝑝𝑜𝑜𝑙𝑒𝑑_𝑜𝑢𝑡𝑝𝑢𝑡] function carries out this task, introduced both to reduce the dimensions of the two embeddings and reduce the where Et ∈ R𝑁 ×ℎ is the text embedding matrix learned computational load, improving the performance of the by the BERT model. multimodal model and allowing it to learn more complex patterns. The projection of embeddings is particularly ResNet Encoder The two images of each sample pre- useful in cases where you want to compare the seman- viously represented in numerical form are passed as in- tic representations of two objects, ensuring that both put to the pre-trained ResNet model, which returns a are aligned in the same reduced semantic space, making them comparable in terms of similarity or distance or 3.1.4. Fusion layer facilitating the comparison and analysis of relationships. Once you have available the embeddings (textual and For this model, we selected 𝑑𝑝𝑟𝑗 = 128 as the projec- visual) learned unimodally in the network, and the cross- tion size, reducing both embeddings sizes of the input attention embeddings learned intermodally, it is neces- components. sary to implement a fusion strategy that can best balance their respective contributions in the multimodal classi- 3.1.3. Cross-modal layer fication task. Although the architecture of the model This layer is the heart of the model, which is developed would seem to suggest the implementation of the late taking inspiration from the behavior of human beings fusion strategy, it is necessary to observe how the cross- when faced with news made up of text and images. Intu- attention of the cross-modal layer is already a fusion strat- itively, we try to read in the image what is written in the egy adopted in the network during learning before the text and to represent in the text what is shown by the one explicitly implemented in the next fusion layer: this image. It can be said that cross-modal attention relations allowed the model to learn shared features during train- exist between image and text. This is why, to simulate ing while maintaining the suitable flexibility between the the human process described in a neural model, we relied multimodal components, i.e. without excessively influ- on the cross attention between the two modalities, a vari- encing the learning process of each modality separately. ant of the standard component of multi-head attention The concatenation preserves each modality’s distinc- capable of capturing global dependencies between text tive features, allowing the model to exploit them during and images. learning, unlike the sum which could lead to the loss of In the proposed model, two blocks of crossed atten- information due to values that can cancel each other out, tion are activated in the two text-image and image-text taking away the model’s descriptive capacity. For these perspectives. In the first case, we consider the textual reasons, the fusion occurs taking into consideration all embeddings as queries for the multi-head attention block, four embeddings learned by the model Et−projected , while the visual ones as key and value. This should allow Ev−projected , Ecross−tv , Ecross−vt , where the first the characteristics of the text to guide the model to focus two provide distinctive unimodal features, while the on regions of the image semantically coherent with the other two provide correlated and mutually ”attentioned” text: in fact, if the textual embeddings are considered cross-modal features. The hybrid fusion strategy then as queries and the visual ones as key and value, then completes the recipe, providing that pinch of flexibility the attention will be applied to the images in based on necessary to give balance to the multimodal classifier. compatibility with the text, which is therefore consid- Formally we have the following equation, which aims to ered the context on which to evaluate the relevance of make the most of both the information provided by the an image. In this way, attention is focused on the images individual modalities as such, and that provided jointly: with respect to how relevant they are to the text, i.e. we try to give importance to the visual features in relation Eglobal = (Et−projected ⊕ Ev−projected )⊕ to the context provided by the text. Conversely, in the Ecross−tv ⊕ Ecross−vt second case the visual embeddings are the queries, while the keys and values are the textual embeddings, and this where Eglobal 𝑖𝑛R𝑁 ×4𝑑𝑝𝑟𝑗 , where 𝑁 is the size of the should allow the visual features to make the model pay batch of examples given as input to the network and attention to those parts of text consistent with the images. 𝑑𝑝𝑟𝑗 = 128. That is, the same thing as in the previous case applies, The final output of the multimodal model is obtained but the roles between text and image are reversed. by applying a densely connected layer with 𝐶 = 4 units Wanting to formalize the bidirectional cross-attention and a softmax activation function that returns the proba- between the embeddings of the text Et−projected and bilities of the four classes. Formally: those of the images Ev−projected , we can write: Y = (Eglobal W + b) Ecross−tv = Attention(Et−projected , Ev−projected ) O = softmax(Y) Ecross−vt = Attention(Ev−projected , Et−projected ) with W ∈ R , b ∈ R1×𝐶 and therefore O ∈ 4𝑑𝑝𝑟𝑗 ×𝐶 where Ecross−tv represents the attention embeddings of R 𝑁 ×𝐶 is a matrix in which each row is a vector with image information with respect to the text and Ecross−vt 𝐶 = 4 values representing the conditional (estimated) represents attention embeddings of text information com- probability of each class for the relevant sample. pared to images. In this layer the dimensions of the embeddings are not modified in any way, therefore we remain in R𝑁 ×128 . 4. Experimental Setup Model Accuracy F1-weighted Text-only 0.498 0.462 Multi-modal 0.480 0.442 4.1. Split dataset into training and Image-only 0.438 0.371 validation Table 1 To guarantee that the proportions relating to the classes Summary and comparison of the main metrics for the three and sources are maintained uniformly in the two sets, baseline models on the official dataset. the 1034 samples of the dataset are randomly divided following the 80%-20% proportion between training and validation in a stratified way both with respect to the by the unimodal textual model, but higher than the score labels, as also happens in the baseline model of the com- of the unimodal visual model, indicating that the integra- petition MULTI-Fake-DetectiVE and, with respect to the tion of visual and textual information led to an improve- type of source of the news. ment in performance compared to the model visual, but not enough to outperform the text model. This suggests 4.2. Training and validationn that there may be potential to perform additional opti- mizations or modality integration strategies to achieve For our experiment, the model was trained up to 80 better performance from the multimodal model. epochs with early stopping on using the focal loss [17] function. It is a dynamically scaled loss cross entropy function, where the scaling factor decays to zero as con- 5.2. Proposed model fidence in the correct class increases. Intuitively, this To evaluate the model proposed on the Multimodal Fake scaling factor can automatically scale the contribution News Detection task, we chose to follow the approach used of easy examples during training and quickly focus the by the organizers in the notebook of the baseline models, model on difficult examples. For the optimizer we chosed i.e. we performed an ablation study on the proposed AdamW, given that the models used to analyze text and model: first a unimodal textual model was trained, then images were originally pre-trained using this algorithm, a unimodal visual one, then a multimodal one without which applies weight regularization directly to the model cross-bi-attention, finally a multimodal one with cross-bi- parameters during weight updating, helping to improve attention. Table 2 reports the respective accuracy and the stability and generalization of the model. F1-weighted values. Model Accuracy F1-weighted 5. Results Proposed Multi-modal ⊗ 0.541 0.537 Proposed Text-only 0.472 0.469 5.1. Official baseline models Proposed Multi-modal ⊕ 0.460 0.445 Proposed Image-only 0.418 0.422 In the notebook provided by the MULTI-Fake-DetectiVE organizers there is an evaluation strategy on the offi- Table 2 cial dataset which is developed by comparing the perfor- Ablation study on the proposed model: accuracy and F1- mance of the unimodal pre-trained models with a multi- weighted. The ⊗ symbol indicates cross-bi-attention enabled, while ⊕ indicates cross-bi-attention disabled (i.e. concatena- modal model: tion enabled). • Text-only model: model trained only on textual features, extracted with a pre-trained BERT net- The results for the unimodal and multimodal models work. without cross-bi-attention are in perfect harmony with • Image-only model: model trained only on the those of the similar baseline models. visual features of images, extracted with a pre- But the data that catches the eye is that of the accuracy trained ResNet18 network. and F1-weighted values of the multimodal model with • Multi-modal model: model trained on the con- cross-bi-attention. In particular, its F1-weighted score is catenation of text and image features, extracted almost seven percentage points higher than the proposed separately with the previous two only-model. textual unimodal model, more than eleven compared to the visual unimodal model and more than nine compared The F1-weighted score values of the three baseline mod- to the multimodal one without cross-bi-attention. els are shown in Table 1. The textual model is therefore Let’s see the accuracy and F1-weighted values of the most effective among the three baseline models in the multimodal model proposed with cross-bi-attention classifying fake news and the visual one has lower per- against finalist models. Its F1-weighted score is two and formance than the textual model. The multimodal model a half points higher than that of the winning model of obtained an F1-weighted score lower than that obtained the MULTI-Fake-DetectiVE competition, as evident from The data preparation strategy in the Pre-processing step the Table 3. As supposed and hoped, the mechanism provides the model with more information to learn from, the real strength can be identified in the Cross-modal Model Accuracy F1-weighted Layer. Proposed Multi-modal 0.541 0.537 PoliTo - FND-CLIP-ITA - 0.512 ExtremITA - Suede_LoRA - 0.507 6. Conclusions Baseline Multi-modal 0.480 0.442 Table 3 The Internet has facilitated the multimodality of commu- Final comparison between all the analyzed models and the nication by enabling rapid information exchanges that proposed model. are increasingly immersive but increasingly used to con- vey falsehoods. In this study, a multimodal model for identifying fake news was proposed which is based on of crossed attention seen from the two text-image and the mechanism of cross attention between the represen- image-text perspectives enriched by the skip connec- tations of the features learned by the network on the tion provided by the simple concatenation of the two textual component of the news and those learned on the different embeddings, provides the model with that extra visual component associated with it. edge that allows it to dig background in the relationships Many multimodal models are based on the concatena- between textual and visual features. By combining bi- tion of features learned from distinct modalities which, lateral cross-attention and residual connection, tasks of despite having good performance, however, limit the the cross-modal layer and the fusion layer respectively, potential of the interaction between the features them- significant semantic and semiotic interrelations are ob- selves. tained in favor of the performance of the classifier which From the experiments carried out, the use of cross- becomes more precise and sensitive. attention demonstrated significant improvements in the In fact, if on the one hand the cross-modal layer allows performance of the model proposed in this work com- the model to learn multimodal semantics between text pared to the first two models classified in the MULTI- and images, the fusion layer enhances it by improving Fake-DetectiVE competition for both tasks requested by its stability, capacity and performance thanks to the skip the organizers, despite the dataset available for training connection which provides the gradient with a useful di- is very small in size and unbalanced both with respect rect path during backpropagation to flow without tending to the categories to be predicted and with respect to the to zero, bringing significant and additional information source of the news. Despite the intrinsic complexity of into each layer of the network. the two tasks, the cross-layer of the proposed model man- All the results described up to this point are obtained ages to express the representations learned from the text by measuring the model on the Multimodal Fake News and images of a news story in a harmonious, collabora- Detection task of the competition covered by this work. tive and synergistic way, balancing their contributions As mentioned, the organizers also proposed a second and preventing one from taking over the other. task Cross-modal relations in Fake and Real News, aimed Future developments concern the components of the at verifying the robustness of the model to changing model which could use a Visual Transformer [18] instead tasks without any human intervention. Table 4 shows of the ResNet in order to relate textual embeddings and the accuracy and F1-weighted values for the proposed visuals both generated by training a Transformer net- model called to express itself on the Cross-modal relations work. task, together with the baseline and winner models of the MULTI- competition Fake-DetectiVE. The results show Model Accuracy F1-weighted References Proposed Multi-modal 0.529 0.527 [1] M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprug- PoliTo - FND-CLIP-ITA - 0.517 noli, G. Venturi (Eds.), Proceedings of the Eighth Baseline Multi-modal - 0.442 Evaluation Campaign of Natural Language Pro- Table 4 cessing and Speech Tools for Italian. Final Work- Result summary on Task 2. shop (EVALITA 2023), Parma, Italy, September 7th- 8th, 2023, volume 3473 of CEUR Workshop Proceed- a clear improvement in performance in solving the task ings, CEUR-WS.org, 2023. URL: https://ceur-ws.org/ even compared to the winning model of the competition. Vol-3473. This is a very important result, because it demonstrates [2] A. Bondielli, P. Dell’Oglio, A. Lenci, F. Marcelloni, the network’s ability to adapt to changes in tasks and L. C. Passaro, M. Sabbatini, Multi-fake-detective at changes in training data, which is not at all a given. evalita 2023: Overview of the multimodal fake news detection and verification task, CEUR WORKSHOP org/abs/2307.16456. arXiv:2307.16456. PROCEEDINGS 3473 (2023). URL: https://ceur-ws. [10] I. Segura-Bedmar, S. Alonso-Bartolome, Multi- org/Vol-3473/paper32.pdf. modal fake news detection, Information 13 (2022). [3] S. Suryavardan, S. Mishra, P. Patwa, URL: https://www.mdpi.com/2078-2489/13/6/284. M. Chakraborty, A. Rani, A. N. Reganti, A. Chadha, [11] B. Palani, S. Elango, V. K, Cb-fake: A multi- A. Das, A. P. Sheth, M. Chinnakotla, A. Ekbal, modal deep learning framework for automatic fake S. Kumar, Factify 2: A multimodal fake news news detection using capsule neural network and and satire news dataset., in: A. Das, A. P. Sheth, bert, Multimedia Tools and Applications 81 (2022). A. Ekbal (Eds.), DE-FACTIFY@AAAI, volume 3555 doi:10.1007/s11042-021-11782-3. of CEUR Workshop Proceedings, CEUR-WS.org, 2023. [12] W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang, URL: http://dblp.uni-trier.de/db/conf/defactify/ J. Ji, Z. Yang, L. Zhao, X. Song, J. Xu, B. Xu, J. Li, defactify2023.html#SuryavardanMPCR23. Y. Dong, M. Ding, J. Tang, Cogvlm: Visual expert [4] K. Nakamura, S. Levy, W. Y. Wang, Fakeddit: A new for pretrained language models, 2024. URL: https: multimodal benchmark dataset for fine-grained //arxiv.org/abs/2311.03079. arXiv:2311.03079. fake news detection, in: N. Calzolari, F. Béchet, [13] H. Liu, C. Li, Y. Li, Y. J. Lee, Improved baselines P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, with visual instruction tuning, 2024. URL: https: H. Isahara, B. Maegaard, J. Mariani, H. Mazo, //arxiv.org/abs/2310.03744. arXiv:2310.03744. A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings [14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, of the Twelfth Language Resources and Evaluation L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, At- Conference, European Language Resources Associ- tention is all you need, 2017. arXiv:1706.03762. ation, Marseille, France, 2020, pp. 6149–6157. URL: [15] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, https://aclanthology.org/2020.lrec-1.755. Bert: Pre-training of deep bidirectional trans- [5] L. D’Amico, D. Napolitano, L. Vaiani, L. Cagliero, formers for language understanding, 2019. Polito at multi-fake-detective: Improving FND- arXiv:1810.04805. CLIP for multimodal italian fake news detection, in: [16] K. He, X. Zhang, S. Ren, J. Sun, Deep residual M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprug- learning for image recognition, in: 2016 IEEE Con- noli, G. Venturi (Eds.), Proceedings of the Eighth ference on Computer Vision and Pattern Recog- Evaluation Campaign of Natural Language Pro- nition (CVPR), 2016, pp. 770–778. doi:10.1109/ cessing and Speech Tools for Italian. Final Work- CVPR.2016.90. shop (EVALITA 2023), Parma, Italy, September 7th- [17] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dol- 8th, 2023, volume 3473 of CEUR Workshop Proceed- lár, Focal loss for dense object detection, 2018. ings, CEUR-WS.org, 2023. URL: https://ceur-ws.org/ arXiv:1708.02002. Vol-3473/paper35.pdf. [18] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis- [6] Y. Zhou, Q. Ying, Z. Qian, S. Li, X. Zhang, Multi- senborn, X. Zhai, T. Unterthiner, M. Dehghani, modal fake news detection via clip-guided learn- M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, ing, 2022. URL: https://arxiv.org/abs/2205.14304. N. Houlsby, An image is worth 16x16 words: arXiv:2205.14304. Transformers for image recognition at scale, 2021. [7] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, arXiv:2010.11929. G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning transfer- able visual models from natural language supervi- sion, 2021. arXiv:2103.00020. [8] C. D. Hromei, D. Croce, V. Basile, R. Basili, Extrem- ita at EVALITA 2023: Multi-task sustainable scaling to large language models at its extreme, in: M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprugnoli, G. Venturi (Eds.), Proceedings of the Eighth Evalua- tion Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023), Parma, Italy, September 7th-8th, 2023, vol- ume 3473 of CEUR Workshop Proceedings, CEUR- WS.org, 2023. URL: https://ceur-ws.org/Vol-3473/ paper13.pdf. [9] A. Santilli, E. Rodolà, Camoscio: an italian instruction-tuned llama, 2023. URL: https://arxiv.