Multimodal Attention is all you need

Multimodal Attention is all you need MarcoSaioni marco.saioni@gmail.com University G. Marconi

Rome IT

CristinaGiannone c.giannone@unimarconi.it University G. Marconi

Rome IT

Almawave S.p.A

Via di Casal Boccone, 188-190 00137 Rome IT

Multimodal Attention is all you need 1613-0073 BC620F3150DE36EE2D74FDB8B87AC47C arXiv:2010.11929. GROBID - A machine learning software for extracting information from scholarly documents Transformer fake news classification multimodal classification cross attention

In this paper, we present a multimodal model for classifying fake news. The main peculiarity of the proposed model is the cross attention mechanism. Cross-attention is an evolution of the attention mechanism that allows the model to examine intermodal relationships to better understand information from different modalities, enabling it to simultaneously focus on the relevant parts of the data extracted from each. We tested the model using textitMULTI-Fake-DetectiVE data from Evalita 2023. The presented model is particularly effective in both the tasks of classifying fake news and evaluating the intermodal relationship.

Introduction

Internet has facilitated communication by enabling rapid, immersive information exchanges. However, it is also increasingly used to convey falsehoods, so today, more than ever, the rapid spread of fake news can have severe consequences, from inciting hatred to influencing financial markets or the progress of political elections to endangering world security. For this reason, mitigating the growing spread of fake news on the web has become a significant challenge.

Fake news manifests itself on the internet through text, images, video, audio, or, in general, a combination of these modalities, which is a multimodal way. In this article, we took the two, text and image, components of news as it proposed, for instance, in a social network. In this work we proposed an approach to automatically and promptly identify fake news. We use the dataset MULTI-Fake-DetectiVE 1 competition, proposed in EVALITA 2023 2 . The competition aims to evaluate the truthfulness of news that combines text and images, an aim expressed through two tasks: the first, which carries out the identification of fake news (Multimodal Fake News Detection); the second, which seeks relationships between the two modalities text and image by observing the presence or absence of correlation or mutual implication (Cross-modal relations in Fake and Real News).

Our approach proposes a Transformer-based model that focuses on relating the textual and visual embeddings of the input samples (i.e., the vector representations of 1 https://sites.google.com/unipi.it/multi-fake-detective 2 https://www.evalita.it [1] the text and images it receives as input).

The aim was to find a way to reconcile the two different representation embeddings because they are learned separately from two different corpora, such as text and images, trying to capture their mutual relationships through some interaction between the respective semantic spaces.

The remainder of the paper is structured as follows: section 2 presents a brief overview of related work, and section 3 describes the architecture of the proposed model. Section 4 discusses an overview of our experiments. Sections 5 and 6 present the final results and our conclusions, respectively.

Related Works

The Italian MULTI-Fake-DetectiVE competition [2] adds to the various datasets and challenges on multimodal fake news recently developed, for instance, Factify [3] and Fakeddint [4]. The creation of these competitions shows the interest in this task. The first task of the Italian challenge saw three completely different systems placed on the podium. While the first system POLITO [5] with a system based on the FND-CLIP multimodal architecture [6] proposing some ad hoc extensions of CLIP [7] including sentiment-based text encoding, image transformation in the frequency domain, and data augmentation via back-translation. The Extremita system [8], second classified, exploited the LLM capabilities, focusing only on the textual component of each news. They fine-tuned the open-source LMM Camoscio [9] with the textual part of the dataset. The impressive results show how the textual component plays a primary role in identifying fake news. Despite the significant contribution of the textual component to the task, more and more multimodal approaches are taking hold. In [10] proposed CNN architecture combining texts and images to classify fake news. In that direction, approaches such as CB-FAKE [11] incorporate the encoder representations from the BERT model to extract the textual features and comb them with a model to extract the image features. These features are combined to obtain a richer data representation that helps to determine whether the news is fake or real. Visionlanguage models, in general, have gained a lot of interest also in the last years, in the "large models era". Language Vision Models have been proposed during the previous year, with surprising results in many visual language interaction tasks [12], [13].

The proposed Model

The objective was to "engage" specialist models for natural language processing and artificial vision, making them discover and learn bimodal features from text and images collaboratively and harmoniously by applying the teachings of Vaswani et al. [14]: we decided to follow the path indicated by "Attention is all you need" Vaswani et al. very famous paper, following up on the intuition that the Attention mechanism could provide an important added value to the multimodal model of identification of fake news, becoming a Multimodal Attention (hence the title of this article), i.e. an attention mechanism applied between the two textual and visual modes of news. In fact, while Attention or Self Attention (as described in Vaswani et al. paper) takes as input the embeddings of a single modality and transforms them into more informative embeddings (contextualized embeddings), Multimodal Attention takes as input the embeddings of the two different modalities by combining them and then transforming them into a single embedding capable of capturing any existing relationships between the two input modes.

Architecture

Multimodal Attention is the heart that supports the proposed model, making it capable of exploring the hidden aspects of multimodal communication. As shown at a high level in Figure 1, the architecture of the proposed model consists of a hierarchical structure with three layers preceded by a pre-processing step. In order, there are: a pre-processing step, an input layer, a cross-modal layer and a fusion layer. It was decided to propose a network that models the consistent information between the two modalities textual and visual starting from State Of The Art pre-trained neural networks. In particular, we use a BERT [15] pre-trained model to learn the word embeddings by the textual component of news and a ResNet [16] pre-trained model to learn visual embeddings by the visual component. The two embeddings, belonging to two spaces with different dimensions, are first projected into a uniform, reduced-dimensional space, then related to each other with the strategy of mutual cross-attention to obtain two embeddings subsequently concatenated to provide the input of the last dense classification layer.

Pre-processing step

As a first step it is necessary to process the data made available by the organizers of the MULTI-Fake-DetectiVE competition to produce inputs that are compatible and compliant with those expected from the pre-trained models. The choices made for this preparation or for the pre-processing of the dataset and the data "personalization" strategy will then be described in the following three points:

• resolution/explosion of 1 : 𝑁 relationships between text and images into 𝑁 times 1 : 1 relationships; • data augmentation with the creation of an additional image to support the original one already present in each example; • management of the textual component, truncated by BERT or rather by the relevant tokenizer to a fixed maximum length of tokens.

As decided for the visual and textual components, therefore following processing, for each single sample we move from the original pairs < 𝑡, 𝑣 + >, where 𝑣 + indicates the ratio 1 : 𝑁 between text in natural language and images in JPEG format, to the triples appropriately translated into numbers

< 𝑡𝑡𝑟𝑢𝑛𝑐, 𝑣, 𝑣𝑎𝑢𝑔 >

where 𝑡𝑡𝑟𝑢𝑛𝑐 indicates, for each sample, a first-order tensor with 128 values (token), while 𝑣 and 𝑣𝑎𝑢𝑔 denote third-order tensors with (224 × 224 × 3) values (pixels).

In fact, the first order tensor is the representation of the text in numerical form according to the default strategy of the BERT tokenizer, while the third order tensor is the representation of the images in numerical form according to the RGB coding for ResNet.

Input layer

This layer receives as input the previously processed dataset, i.e. the text and the images represented in numerical form, passing it to the pre-trained BERT and ResNet models to obtain the respective embeddings, subsequently projected into a space with small and common dimensions to make them comparable and to allow them to collaborate with each other in the subsequent crossmodal layers.

BERT Encoder Each sample pre-processed and represented in numerical form by the tokenizer is passed as input to the pre-trained BERT model which returns different output tensors for each of them. For the purposes of the classification task object of this study, we consider the pooled_output, a compact representation of all the token sequences given as input to the BERT model, obtained via the special token [CLS]. It is therefore a summary of the information extracted from the entire input dataset whose dimensions evidently depend on the number of hidden units of BERT. Since each text supplied as input to BERT will correspond to a tensor with 768 values real, using vector notation we have that:

et = BERT(ttrunc)[𝑝𝑜𝑜𝑙𝑒𝑑_𝑜𝑢𝑡𝑝𝑢𝑡]

where et ∈ R ℎ is the word embeddings vector, ttrunc ∈ R 𝑁𝑚𝑎𝑥 is the token input vector and ℎ = 768 is the BERT hidden size. The equation shown refers to a single sample but can be extended to the entire batch of 𝑁 examples processed by BERT. Indicating this batch with Ttrunc ∈ R 𝑁 ×𝑁𝑚𝑎𝑥 , we will have:

Et = BERT(Ttrunc)[𝑝𝑜𝑜𝑙𝑒𝑑_𝑜𝑢𝑡𝑝𝑢𝑡]

where Et ∈ R 𝑁 ×ℎ is the text embedding matrix learned by the BERT model.

ResNet Encoder

The two images of each sample previously represented in numerical form are passed as input to the pre-trained ResNet model, which returns a visual embedding of size ℎ𝑟 for each example and which represents the features in a compact and semantic form extracted through convolutions and pooling within the ResNet network. In fact, to obtain visual embeddings from a pre-trained neural network like ResNet, we usually take the output of the penultimate layer, i.e. global pooling. In the proposed model, ResNet50V2 was chosen which in global pooling reduces the spatial dimensions of the output tensor to 2048 values and therefore each input image will correspond in output to a vector with ℎ𝑟 = 2048 values, which represents the visual embeddings extracted from the network for that specific image. After obtaining the embeddings for each of the two images, they are concatenated together to obtain a single output tensor which will therefore have size 2 × ℎ𝑟 = 4096. Using the same formalism as the previous text encoder, we have:

ev = ResNet(v)[𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑜𝑙𝑖𝑛𝑔]

where ev ∈ R ℎ𝑟 is the visual embedding vector and v ∈ R 𝐿×𝐻×𝐶 the input third-order tensor. The indicated equation refers to a single sample but can be extended to the entire batch of 𝑁 examples, therefore indicating the batch with V ∈ R 𝑁 ×𝐿×𝐻×𝐶 , we will have:

Ev = ResNet(V)[𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑜𝑙𝑖𝑛𝑔]

where Ev ∈ R 𝑁 ×ℎ𝑟 is the visual embedding matrix learned by the ResNet model. Similar discussion for the second image, for which it will be valid at batch level:

Ev aug = ResNet(Vaug)[𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑜𝑙𝑖𝑛𝑔]

where Ev aug ∈ R 𝑁 ×ℎ𝑟 . By concatenating the two embeddings, we will obtain:

Ev ⊕ Ev aug = E concat(v,vaug) ∈ R 𝑁 ×2ℎ𝑟 .

From this moment and for simplicity of notation, Ev will refer to E concat(v,vaug) , knowing that this embedding is actually the concatenation of embeddings of an image and the one obtained through random transformations.

Projection The pre-trained models provide embeddings with different sizes. It is, therefore, necessary to transform them into a space with the same dimensionality to obtain comparable representations. The projection function carries out this task, introduced both to reduce the dimensions of the two embeddings and reduce the computational load, improving the performance of the multimodal model and allowing it to learn more complex patterns. The projection of embeddings is particularly useful in cases where you want to compare the semantic representations of two objects, ensuring that both are aligned in the same reduced semantic space, making them comparable in terms of similarity or distance or facilitating the comparison and analysis of relationships.

For this model, we selected 𝑑𝑝𝑟𝑗 = 128 as the projection size, reducing both embeddings sizes of the input components.

Cross-modal layer

This layer is the heart of the model, which is developed taking inspiration from the behavior of human beings when faced with news made up of text and images. Intuitively, we try to read in the image what is written in the text and to represent in the text what is shown by the image. It can be said that cross-modal attention relations exist between image and text. This is why, to simulate the human process described in a neural model, we relied on the cross attention between the two modalities, a variant of the standard component of multi-head attention capable of capturing global dependencies between text and images.

In the proposed model, two blocks of crossed attention are activated in the two text-image and image-text perspectives. In the first case, we consider the textual embeddings as queries for the multi-head attention block, while the visual ones as key and value. This should allow the characteristics of the text to guide the model to focus on regions of the image semantically coherent with the text: in fact, if the textual embeddings are considered as queries and the visual ones as key and value, then the attention will be applied to the images in based on compatibility with the text, which is therefore considered the context on which to evaluate the relevance of an image. In this way, attention is focused on the images with respect to how relevant they are to the text, i.e. we try to give importance to the visual features in relation to the context provided by the text. Conversely, in the second case the visual embeddings are the queries, while the keys and values are the textual embeddings, and this should allow the visual features to make the model pay attention to those parts of text consistent with the images. That is, the same thing as in the previous case applies, but the roles between text and image are reversed.

Wanting to formalize the bidirectional cross-attention between the embeddings of the text E t−projected and those of the images E v−projected , we can write:

Ecross−tv = Attention(E t−projected , E v−projected ) Ecross−vt = Attention(E v−projected , E t−projected )

where Ecross−tv represents the attention embeddings of image information with respect to the text and Ecross−vt represents attention embeddings of text information compared to images.

In this layer the dimensions of the embeddings are not modified in any way, therefore we remain in R 𝑁 ×128 .

Fusion layer

Once you have available the embeddings (textual and visual) learned unimodally in the network, and the crossattention embeddings learned intermodally, it is necessary to implement a fusion strategy that can best balance their respective contributions in the multimodal classification task. Although the architecture of the model would seem to suggest the implementation of the late fusion strategy, it is necessary to observe how the crossattention of the cross-modal layer is already a fusion strategy adopted in the network during learning before the one explicitly implemented in the next fusion layer: this allowed the model to learn shared features during training while maintaining the suitable flexibility between the multimodal components, i.e. without excessively influencing the learning process of each modality separately.

The concatenation preserves each modality's distinctive features, allowing the model to exploit them during learning, unlike the sum which could lead to the loss of information due to values that can cancel each other out, taking away the model's descriptive capacity. For these reasons, the fusion occurs taking into consideration all four embeddings learned by the model E t−projected , E v−projected , Ecross−tv, Ecross−vt, where the first two provide distinctive unimodal features, while the other two provide correlated and mutually "attentioned" cross-modal features. The hybrid fusion strategy then completes the recipe, providing that pinch of flexibility necessary to give balance to the multimodal classifier. Formally we have the following equation, which aims to make the most of both the information provided by the individual modalities as such, and that provided jointly:

E global = (E t−projected ⊕ E v−projected )⊕ Ecross−tv ⊕ Ecross−vt

where E global 𝑖𝑛R 𝑁 ×4𝑑 𝑝𝑟𝑗 , where 𝑁 is the size of the batch of examples given as input to the network and 𝑑𝑝𝑟𝑗 = 128.

The final output of the multimodal model is obtained by applying a densely connected layer with 𝐶 = 4 units and a softmax activation function that returns the probabilities of the four classes. Formally:

Y = (E global W + b) O = softmax(Y) with W ∈ R 4𝑑 𝑝𝑟𝑗 ×𝐶 , b ∈ R 1×𝐶 and therefore O ∈ R 𝑁 ×𝐶

is a matrix in which each row is a vector with 𝐶 = 4 values representing the conditional (estimated) probability of each class for the relevant sample.

Experimental Setup

Split dataset into training and validation

To guarantee that the proportions relating to the classes and sources are maintained uniformly in the two sets, the 1034 samples of the dataset are randomly divided following the 80%-20% proportion between training and validation in a stratified way both with respect to the labels, as also happens in the baseline model of the competition MULTI-Fake-DetectiVE and, with respect to the type of source of the news.

Training and validationn

For our experiment, the model was trained up to 80 epochs with early stopping on using the focal loss [17] function. It is a dynamically scaled loss cross entropy function, where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, this scaling factor can automatically scale the contribution of easy examples during training and quickly focus the model on difficult examples. For the optimizer we chosed AdamW, given that the models used to analyze text and images were originally pre-trained using this algorithm, which applies weight regularization directly to the model parameters during weight updating, helping to improve the stability and generalization of the model.

Results

Official baseline models

In the notebook provided by the MULTI-Fake-DetectiVE organizers there is an evaluation strategy on the official dataset which is developed by comparing the performance of the unimodal pre-trained models with a multimodal model: Summary and comparison of the main metrics for the three baseline models on the official dataset.

by the unimodal textual model, but higher than the score of the unimodal visual model, indicating that the integration of visual and textual information led to an improvement in performance compared to the model visual, but not enough to outperform the text model. This suggests that there may be potential to perform additional optimizations or modality integration strategies to achieve better performance from the multimodal model.

Proposed model

To evaluate the model proposed on the Multimodal Fake News Detection task, we chose to follow the approach used by the organizers in the notebook of the baseline models, i.e. we performed an ablation study on the proposed model: first a unimodal textual model was trained, then a unimodal visual one, then a multimodal one without cross-bi-attention, finally a multimodal one with cross-biattention. The results for the unimodal and multimodal models without cross-bi-attention are in perfect harmony with those of the similar baseline models.

But the data that catches the eye is that of the accuracy and F1-weighted values of the multimodal model with cross-bi-attention. In particular, its F1-weighted score is almost seven percentage points higher than the proposed textual unimodal model, more than eleven compared to the visual unimodal model and more than nine compared to the multimodal one without cross-bi-attention. Let's see the accuracy and F1-weighted values of the multimodal model proposed with cross-bi-attention against finalist models. Its F1-weighted score is two and a half points higher than that of the winning model of the MULTI-Fake-DetectiVE competition, as evident from the Table 3 In fact, if on the one hand the cross-modal layer allows the model to learn multimodal semantics between text and images, the fusion layer enhances it by improving its stability, capacity and performance thanks to the skip connection which provides the gradient with a useful direct path during backpropagation to flow without tending to zero, bringing significant and additional information into each layer of the network.

All the results described up to this point are obtained by measuring the model on the Multimodal Fake News Detection task of the competition covered by this work. As mentioned, the organizers also proposed a second task Cross-modal relations in Fake and Real News, aimed at verifying the robustness of the model to changing tasks without any human intervention. a clear improvement in performance in solving the task even compared to the winning model of the competition. This is a very important result, because it demonstrates the network's ability to adapt to changes in tasks and changes in training data, which is not at all a given.

The data preparation strategy in the Pre-processing step provides the model with more information to learn from, the real strength can be identified in the Cross-modal Layer.

Conclusions

The Internet has facilitated the multimodality of communication by enabling rapid information exchanges that are increasingly immersive but increasingly used to convey falsehoods. In this study, a multimodal model for identifying fake news was proposed which is based on the mechanism of cross attention between the representations of the features learned by the network on the textual component of the news and those learned on the visual component associated with it.

Many multimodal models are based on the concatenation of features learned from distinct modalities which, despite having good performance, however, limit the potential of the interaction between the features themselves.

From the experiments carried out, the use of crossattention demonstrated significant improvements in the performance of the model proposed in this work compared to the first two models classified in the MULTI-Fake-DetectiVE competition for both tasks requested by the organizers, despite the dataset available for training is very small in size and unbalanced both with respect to the categories to be predicted and with respect to the source of the news. Despite the intrinsic complexity of the two tasks, the cross-layer of the proposed model manages to express the representations learned from the text and images of a news story in a harmonious, collaborative and synergistic way, balancing their contributions and preventing one from taking over the other.

Future developments concern the components of the model which could use a Visual Transformer [18] instead of the ResNet in order to relate textual embeddings and visuals both generated by training a Transformer network.

Figure 1 :1Figure 1: Proposed model architecture.

Table 11• Text-only model: model trained only on textualfeatures, extracted with a pre-trained BERT net-work.• Image-only model: model trained only on thevisual features of images, extracted with a pre-trained ResNet18 network.• Multi-modal model: model trained on the con-catenation of text and image features, extractedseparately with the previous two only-model.The F1-weighted score values of the three baseline mod-els are shown in Table 1. The textual model is thereforethe most effective among the three baseline models inclassifying fake news and the visual one has lower per-formance than the textual model. The multimodal modelobtained an F1-weighted score lower than that obtained

Table 2 reports the respective accuracy and F1-weighted values.ModelAccuracy F1-weightedProposed Multi-modal ⊗0.5410.537Proposed Text-only0.4720.469Proposed Multi-modal ⊕0.4600.445Proposed Image-only0.4180.422Table 2Ablation study on the proposed model: accuracy and F1-weighted. The ⊗ symbol indicates cross-bi-attention enabled,while ⊕ indicates cross-bi-attention disabled (i.e. concatena-tion enabled).

Table 33. As supposed and hoped, the mechanism Final comparison between all the analyzed models and the proposed model.of crossed attention seen from the two text-image and image-text perspectives enriched by the skip connection provided by the simple concatenation of the two different embeddings, provides the model with that extra edge that allows it to dig background in the relationships between textual and visual features. By combining bilateral cross-attention and residual connection, tasks of the cross-modal layer and the fusion layer respectively, significant semantic and semiotic interrelations are obtained in favor of the performance of the classifier which becomes more precise and sensitive.ModelAccuracy F1-weightedProposed Multi-modal0.5410.537PoliTo -FND-CLIP-ITA-0.512ExtremITA -Suede_LoRA-0.507Baseline Multi-modal0.4800.442

Table 44shows the accuracy and F1-weighted values for the proposed model called to express itself on the Cross-modal relations task, together with the baseline and winner models of the MULTI-competition Fake-DetectiVE. The results showModelAccuracy F1-weightedProposed Multi-modal0.5290.527PoliTo -FND-CLIP-ITA-0.517Baseline Multi-modal-0.442

Table 44Result summary on Task 2.

Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023) CEUR Workshop Proceedings MLai SMenini MPolignano VRusso RSprugnoli GVenturi the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023)

Parma, Italy

September 7th-8th, 2023. 2023 3473 Multi-fake-detective at evalita 2023: Overview of the multimodal fake news detection and verification task ABondielli PDell'oglio ALenci FMarcelloni LCPassaro MSabbatini CEUR WORKSHOP PROCEEDINGS 3473 2023 <author> <persName><forename type="first">S</forename><surname>Suryavardan</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Mishra</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Patwa</surname></persName> </author> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b3"> <analytic> <title level="a" type="main">Factify 2: A multimodal fake news and satire news dataset MChakraborty ARani ANReganti AChadha ADas APSheth MChinnakotla AEkbal SKumar DE-FACTIFY@AAAI CEUR Workshop Proceedings ADas APSheth AEkbal 2023 3555 Fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection KNakamura SLevy WYWang Proceedings of the Twelfth Language Resources and Evaluation Conference, European Language Resources Association NCalzolari FBéchet PBlache KChoukri CCieri TDeclerck SGoggi HIsahara BMaegaard JMariani HMazo AMoreno JOdijk SPiperidis the Twelfth Language Resources and Evaluation Conference, European Language Resources Association

Marseille, France

2020 Polito at multi-fake-detective: Improving FND-CLIP for multimodal italian fake news detection LD'amico DNapolitano LVaiani LCagliero Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023) CEUR Workshop Proceedings MLai SMenini MPolignano VRusso RSprugnoli GVenturi the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023)

Parma, Italy

September 7th-8th, 2023. 2023 3473 Multimodal fake news detection via clip-guided learning YZhou QYing ZQian SLi XZhang 2022 Learning transferable visual models from natural language supervision ARadford JWKim CHallacy ARamesh GGoh SAgarwal GSastry AAskell PMishkin JClark GKrueger ISutskever arXiv:2103.00020 2021 Extremita at EVALITA 2023: Multi-task sustainable scaling to large language models at its extreme CDHromei DCroce VBasile RBasili Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023) CEUR Workshop Proceedings MLai SMenini MPolignano VRusso RSprugnoli GVenturi the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023)

Parma, Italy

September 7th-8th, 2023. 2023 3473 Camoscio: an italian instruction-tuned llama ASantilli ERodolà 2023 Multimodal fake news detection ISegura-Bedmar SAlonso-Bartolome Information 13 2022 Cb-fake: A multimodal deep learning framework for automatic fake news detection using capsule neural network and bert BPalani SElango V 10.1007/s11042-021-11782-3 Multimedia Tools and Applications 81 2022 Cogvlm: Visual expert for pretrained language models WWang QLv WYu WHong JQi YWang JJi ZYang LZhao XSong JXu BXu JLi YDong MDing JTang 2024 Improved baselines with visual instruction tuning HLiu CLi YLi YJLee 2024 AVaswani NShazeer NParmar JUszkoreit LJones ANGomez LKaiser IPolosukhin arXiv:1706.03762 Attention is all you need 2017 JDevlin M.-WChang KLee KToutanova arXiv:1810.04805 Bert: Pre-training of deep bidirectional transformers for language understanding 2019 Deep residual learning for image recognition KHe XZhang SRen JSun 10.1109/CVPR.2016.90 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016. 2016 Focal loss for dense object detection T.-YLin PGoyal RGirshick KHe PDollár arXiv:1708.02002 2018 An image is worth 16x16 words: Transformers for image recognition at scale ADosovitskiy LBeyer AKolesnikov DWeissenborn XZhai TUnterthiner MDehghani MMinderer GHeigold SGelly JUszkoreit NHoulsby 2021