=Paper=
{{Paper
|id=Vol-3910/aics2024_p67
|storemode=property
|title=The Impact of CLIP Encoders and CNN Architectures on Model Efficiency: A Case Study on Hate Speech Detection
|pdfUrl=https://ceur-ws.org/Vol-3910/aics2024_p67.pdf
|volume=Vol-3910
|authors=Christof Kälin,Ellen Rushe,Stefan Kull,Andrew McCarren
|dblpUrl=https://dblp.org/rec/conf/aics/KalinRKM24
}}
==The Impact of CLIP Encoders and CNN Architectures on Model Efficiency: A Case Study on Hate Speech Detection==
<pdf width="1500px">https://ceur-ws.org/Vol-3910/aics2024_p67.pdf</pdf>
<pre>
                         The Impact of CLIP Encoders and CNN Architectures on
                         Model Efficiency: A Case Study on Hate Speech Detection
                         Christof Kälin1,2,* , Ellen Rushe1,* , Stefan Kull2 and Andrew McCarren1
                         1
                          School of Computing, Dublin City University, Dublin 9, Ireland
                         2
                          Lucerne School of Computer Science and Information Technology, Lucerne University of Applied Sciences and Arts, Rotkreuz,
                         Switzerland


                                      Abstract
                                      In this paper, the impact of model architecture, model size and overall efficiency of deep vision-language models is
                                      explored, focusing on the trade-off between performance and resource utilisation. Specifically, the effectiveness of
                                      Contrastive Language-Image Pre-training (CLIP) encoders combined with Convolutional Neural Networks (CNNs)
                                      are evaluated for this task. To this end, a study is performed using the Facebook Hateful Memes Dataset (HMD)
                                      to train and evaluate the Hate-CLIPper architecture and its variations, alongside smaller scale extentions using
                                      convolutional layers. Though the evaluation demonstrates that Hate-CLIPper shows the strongest performance,
                                      the reduced versions of Hate-CLIPper’s cross-fusion mechanism see a large decrease in parameters without a
                                      comparably large decrease in performance. This suggests that the addition of parameters to these models may
                                      ultimately lead to diminishing returns in terms of performance. We also show that downscaling a smaller version of
                                      HateCLIPper leads to a larger reduction in performance than its larger-scale counterparts, suggesting a non-linear
                                      relationship between the number of parameters and performance gains. The question of overparameterisation
                                      in Hate-CLIPper is therefore raised, highlighting the importance of balancing model complexity with training
                                      efficiency. Through this case study, this research contributes to the development of more efficient methods for
                                      automatic hateful meme detection which, in extension, can improve content moderation practices and reduce the
                                      spread of online hate speech.

                                      Keywords
                                      Memes and Hate speech detection, Multimodal Data, Contrastive Learning, Convolutional Neural Networks,
                                      Intermediate fusion, Facebook Hateful Memes Dataset


                         1. Introduction
                         Large-scale deep models have become ubiquitous in vision and language technologies, with models
                         typically containing hundreds of millions – if not billions – of parameters. These models tend to be
                         extremely computationally expensive, impacting both their portability and usability when computational
                         resources are comparatively scarce. The performance gains achieved by novel architectures tend to be
                         valued highly in the literature while, the growing size of models come at a cost. These models have
                         significant environmental impact due to the enormous amount of computational resources and energy
                         needed to train them [1, 2]. Additionally, the increasing trend towards large-scale model development [1]
                         limits the the number of users to those with the most resources, contributing to the de-democratisation
                         of machine learning development [3]. It is critical to determine the actual gains provided by these
                         increasingly large models in order to perform a cost-benefit analysis on their use. To this end, in this
                         paper, we present a case study on a task where large-scale models form the current state-of-the-art:
                         multi-modal hate speech detection. More specifically, we aim to analyse a large-scale CLIP-based vision-
                         language model to determine the impact of different size architectures on the training and inference
                         time, memory requirements and performance for a common hate-speech detection benchmark. We
                         perform an evaluation across increasingly lightweight architectures on a fixed computational set-up
                         and report the effects on performance.


                          AICS’24: 32nd Irish Conference on Artificial Intelligence and Cognitive Science, December 09–10, 2024, Dublin, Ireland
                         *
                           Corresponding author.
                          $ christof.kalin2@mail.dcu.ie (C. Kälin); ellen.rushe@dcu.ie (E. Rushe); stefan.kull@hslu.ch (S. Kull);
                          andrew.mccarren@dcu.ie (A. McCarren)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
  The remainder of the paper is organised as follows: Section 2 will outline the task of Hate Speech
Detection and existing approaches to address it. Next, we outline the approach we take to evaluating
the effect of model size and architecture on a competitive model applied to this task in Section 3. We
describe the specific details of our experimental procedure in Section 4. Our findings are outlined in
Section 5 and we finish with a discussion on our findings Section 6 .


2. Hateful Meme Detection
We first begin by providing an introduction to hate speech detection and then follow this discussion
with a description of recent approaches to this task. We also motivate the architecture we choose to
replicate and study in our experiments.

2.1. Task Overview
The traditional internet meme typically features a multimodal format: an image with an overlay of
text. This combination often conveys a humorous or positive message, but it can also spread hateful
content [4]. Such hateful messages can be disseminated widely via the internet and social media, leading
to social issues [5]. Manual moderation of this content is not only slow and costly but has also been
reported to be linked to post traumatic stress disorder-like symptoms among content moderators [6, 5].
Therefore, automating the process of content moderation has been adopted and commercialised with
companies offering API-based solutions supporting various modalities [7, 8]. The complete automation
of content moderation has been criticised, however, with a hybrid approach using automated systems
for the majority of cases while also implementing measures to support human moderators, being
preferred [9]. Nonetheless, these automated systems currently play a central role in content moderation
due to the vast scale of material uploaded to internet platforms each day.


Figure 1: An example of a meme intended to insult the reader (Image sourced from [10, 11]).


   Identifying implicit offensive information in memes presents a unique challenge, as text or image that
comprise the meme may not appear hateful when each is presented in isolation [12]. Figure 1 illustrates
the subtlety of this type of material. Additionally, offensive content can come in a wide variety of forms,
and memes can be deemed offensive based on multiple factors, such as personal attacks, racial abuse, or
attacks on minorities [4]. The following section will describe the different approaches that have been
developed for this complex task.

2.2. Approaches to Hateful Meme Detection
Due to the multi-modal nature of the data and the complex interactions between features of each
modality, approaches have primarily comprised of multiple model components or subtasks. For instance,
Hermida & Santos [4] argued that the text and the image of a meme alone would not suffice as an
input and that data augmentation is necessary, either through object detectors, caption generators or
extracting regions from images to create new ones. Furthermore, they argue that, as architectures are
evolving, advanced techniques, such as ensembles of Transformer-based models, are expected to yield
the best results.
   These premises have been evidenced in the winning architectures of the Facebook Hateful Memes
Challenge 1 , conducted in 2020 using the competition’s Hateful Meme Dataset (HMD) [11], with the
two highest-ranked methods combining external data and subtasks. The first approach [13] enriches
the memes with Google Vision’s Web Entity Detection to capture the images’ context. It includes
race and gender labels created from the FairFace Dataset [14]. After this, it combined VL-BERT [15],
UNITER [16], VILLA [17] and ERNIE-ViL [18] in one ensemble architecture, achieving an Area Under
the Receiver Operating Characteristic (AUROC) of 84.5 [13]. The second-placed approach combined
ERNIE-Vil, UNITER, OSCAR [19] and VisualBert [20], resulting in an AUROC of 82.52 [21].
   Further research was conducted after the competition without ensemble-approaches, such as two-way
feature extraction by including images alone as inputs. The extraction is followed by captioning the
images with an encoder-decoder model and then feeding this along with Optical character Recognition
(OCR)-extracted text into a sentiment analysis model. However, this resulted in a lower AUROC of 56.83
and reported accuracy of 64% [22]. In comparison, employing an approach using the VinVl [23] object
detection model with OSCAR+ and combined with a Random Forest (RF) classifier led to an AUROC of
76.8 and an accuracy of 68.4% [24].
   Kumar and Nandakumar [25] took a different approach, applying CLIP’s [26] pre-trained image
and text encoders on the input. Specifically, encoded inputs are sent through projection layers and
fine-tuned for the task of hateful meme detection. Projection layers are followed by intermediate fusion
with either cross fusion or align fusion . Elements are then flattened and trained on a shallow neural
network (so-called pre-output layers), before using the softmax function to retrieve predictions. The
authors reported an AUROC of 85.8 on the HMD [25]. A disadvantage of the Hate-CLIPper architecture,
however, is that the embeddings of memes have a high similarity despite opposite meanings. In order
to address this, Retrieval-Guided Contrastive Learning (RGCL) was proposed by Mei et al. [27], who
investigated semantically similar memes that are then separated depending on whether both of them
are in the same class (so-called pseudo-gold positive examples) or opposite class (hard negative examples).
These pairs are identified primarily by examining misclassifications in the dataset or instances in which
confounders are present (an example of such confounders can be seen in Figure 2). RGCL raised the
performance to an AUROC of 87.0.
   For detecting offensive content, CNNs have also been used as both auto-feature extraction-based
methods and as image and text classifiers [4]. In a unimodal context, Suryawanshi et al. [28] used the
pre-trained CNN VGG-16 [29] to classify memes based on the image only. However, a text classifier,
CNNText, was also trained separately using three convolutional blocks. In a multi-modal context, the
researchers used early fusion. For both image and text feature extraction, long short-term memory
networks (LSTMs), VGG-16 and their CNNText were employed [28]. Another approach proposes VGCN-
BERT [30] to extract textual features and three different CNNs to extract image features: ResNet-50 [31],
ResNet-152 and VGG-16. This was followed by creating multimodal models performing early fusion.
Ultimately, VGCN-BERT + ResNet50 yielded the best results with an AUROC of 81.69 on a dataset
consisting of memes related to Italian political affairs [32].
   For the baseline model in the Memotion [33] dataset, GloVe [34] word embeddings were used as an
input to a CNN with 64 filters of size 1x5 to extract textual features [35]. In order to maximize F1 scores
on the Memotion dataset, Alzu’bi et al. [36] used ResNet-152 and ResNet-50 to encode the images and
then used pooling to produce distinct image embeddings.
   By prompting the large language model RoBERTa [37] to classify memes, an experimental of AUROC
of 90.96 (and an accuracy of 84.47%) has been reported. However, the evaluation is based on different
datasets, in which the HMD is not included. This approach indirectly incorporates CLIP encoders [38]

1
    https://hatefulmemeschallenge.com/
by using ClipCap [39] to generate captions for the images [40].
   An instruction tuning framework was proposed by Hu et al. [41] to enhance the performance of
visual reasoning tasks in vision-language models (VLMs), achieving an AUROC of 89.2 and an accuracy
of 80.8% on HMD. The proposed framework works in two steps: It generates a program that solves the
query, resulting in a chain-of-thought reasoning process. This process is then fed into fine-tuned VLMs
(e.g., PaLI-X-VPD [42]) along with the visual input and the textual query. In addition to its performance,
another advantage is the capability of explaining why a meme is considered hateful.
   Given the complex set of features and interactions necessary for models to extract from hateful
memes, the above architectures have proven to be multi-facited and often large - in particular those
using ensembles. In this paper, we seek to determine the extent to which the size of architectures
affects the performance of the resulting models. To this end, of the approaches discussed in this section,
we have chosen the HateCLIPper architecture as our base model, due to its high performance (with a
reported AUROC of 85.8 [25] for their best performing model), open source codebase 2 , and end-to-end
architecture.


3. Methodology
In this section, we will describe the base architecture that our proposed approaches build on, and the
specific techniques used to reduce the parameters of the chosen architecture.

3.1. Base model: Hate-CLIPer
Hate-CLIPper [25] is based on CLIP [26], a zero-shot classifier published by OpenAI based on a pre-
training task to predict the captions belonging to images. In order to accomplish this task, the authors
trained image and text encoders on 400 million image-text pairs. CLIP matches the performance of
ResNet-50 on the ImageNet challenge despite not using any of the original training samples. Pre-training
an image encoder and a text encoder enables the prediction of the paired text in the dataset. These
are then used to create a zero-shot classifier while the classes are converted into captions for CLIP to
predict output class.
   HateCLIPper builds on CLIP by extracting text and image encodings from CLIP’s encoders and
applying intermediate fusion where features are fused in the intermediate layers of the network [25].
Kumar and Nandakumar [25] argue that this approach is the most appropriate as this allows interaction
between the features of each modality, which is important given that intermediate features learned
from text are likely to be related to those learned for images. HateCLIPper implements intermediate
fusion by taking the features output from CLIP’s encoder and modelling the interactions between
the resulting image and text features using a feature interaction matrix (FIM). The authors add this
component to measure the associations between image and text features directly, motivated by the
idea that the similarity between the features of text and image pairs that were associated with each
other in CLIP may show different associations in hateful memes. Specifically, HateCLIPper projects
CLIP features using a set of trainable layers to two vectors 𝑝𝑖 and 𝑝𝑡 , each of the same dimension 𝑛
(the underlying CLIP encoders are not finetuned). The feature interaction matrix consists of the outer
product of these two vectors 𝑝𝑖 ⊗ 𝑝𝑡 leading to an 𝑛 × 𝑛 matrix. The use of the outer product operation
here is referred to as cross-fusion. The FIM is then flattened and followed by a number of fully connected
layers to make the final classification, i.e. hateful or non-hateful.
   One drawback of the FIM is that it increases the size of the model substantially. In terms of com-
putational requirements, powerful hardware is required to train this model, with [25] reporting that
Hate-CLIPper was originally trained on an NVIDIA Tesla A100 GPU with 40 GB GPU RAM, with training
taking approximately 30 minutes. Access to this level of hardware is both expensive and scarce. In
more accessible environments, such as Google Colab, these types of resources are not guaranteed [43].
Due to the increased parameters introduced by the use of cross-fusion, the authors create a variation

2
    https://github.com/gokulkarthik/hateclipper
of Hate-CLIPer that instead uses align-fusion. This variant takes the diagonal of the FIM and builds
the classifcation model on this alone. This reduces the dimensionality from a FIM matrix of size 𝑛2
to a vector of size 𝑛. The authors report that align fusion produced the best performance in terms of
AUROC and cross-fusion performed best in terms of micro-F1 score.

3.2. Reducing HateCLIPper
In this paper we seek to determine the effects of reducing the size of HateCLIPper. This can be done
using a number of strategies, align-fusion being the stategy proposed by [25]. We evaluate a number of
alternative strategies using both cross-fusion and align-fusion, and compare their training time and
performance to that of the original models. The two primary strategies are (i) dimensionality reduction
of the projection layer by using a reduced CLIP base model and (ii) using shared parameters through
the use of convolutional layers.

(i) Reduced CLIP model: The most obvious first approach is to simply reduce the size of the base
CLIP model used to encode image and text features, leading to a lower dimensional projection layer. In
our experiments, this change reduces the number of parameters by roughly half.

(ii) Convolutional layers: The next approach to parameter reduction is the use of convolution
layers rather than fully-connected layers after the fusion of the FIM. One of the core benefits of CNNs
is parameter sharing, which reduces the number of trainable parameters and allows the addition of
pooling layers after the convolutional layers in order to shrink feature maps [44]. Here we use the same
architecture as HateCLIPper up to and including FIM, however we apply convolutional layers to the FIM
rather than fully connected layers. We experiment with both cross-fusion and align-fusion. Specifically,
when using align-fusion, the convolutional block will apply 1D convolutions to the diagonal of the FIM
(which is a vector of size n). In the cross fusion variant, a block using 2D convolutions can be applied
directly to the n×n FIM - substantially reducing the number of parameters necessary to train.


4. Experimental Setup
This section will describe the dataset used for experiments, the specific implementation details and
hyperparameters of all models, and the evaluation stratgy employed.

4.1. Dataset
The Facebook Hateful Memes Dataset (HMD) is a widely used benchmark for hateful meme detection
due to the controlled data collection process. During the creation of the HMD , a strict definition for
hate speech was employed and annotators were trained for four hours with three pilot runs [11]. This
set was then used as a part of a competition hosted by Facebook AI (now Meta AI)3 . The dataset comes
with training, development, and test splits –totalling 10,000 memes. Additionally, so-called “benign
confounders” were included: Each hateful meme includes an alternative meme that is non-hateful by
changing either the text or the image (Figure 2).
   For all experiments, the dataset HMD was used, downloaded directly from the authors’ original
source4 and PyTorch Dataset5 was used to create the datasets for training and inference. When
creating the PyTorch Dataset, tensors were generated using the CLIPProcessor model to retrive
the pixel values for the images and CLIPTokenizer to tokenize the text using max length padding.


3
  https://ai.meta.com/blog/hateful-memes-challenge-and-data-set/
4
  https://hatefulmemeschallenge.com/
5
  https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
                  (a) Benign text confounder                          (b) Benign image confounder
Figure 2: Example of Benign confounders turning the meme from Figure 1 into non-hateful ones (Images sourced
from [10, 11]).


4.2. Models
Depending on the model used for the training (large vs. base), the underlying pre-trained CLIP model
was exchanged when instantiating the processor and tokenizer. All pre-trained models were loaded
using HuggingFace’s transformers6 library.

4.2.1. Hate-CLIPper
The Hate-CLIPper re-implementation is based on the hyperparameters outlined in its original paper as
well as the code provided by the authors 7 with the exception that PyTorch was used instead of PyTorch
Lightning initially. Table 1 lists only parameters that differ by model, thus the following information is
not specifically stated for each model: The learning rate and the weight decay are of the same value of
0.0001 for all models, the optimizer used is AdamW [45], the loss function is Binary Cross Entropy and
the models were trained over 20 epochs. The pre-output layer consists of one fully-connected layer.
The following dropout [46] is used: 0.2 for the projection layers, 0.4 for the pre-output layer and 0.1
after the ReLU activation in the pre-output layer. The dimensionalities used in Hate-CLIPper are of
size n = 1024 for the projected image encodings pi and text encodings pt due to encoders provided in
clip-vit-large-patch14. Using the best-performing cross-fusion variant results in a flattened vector of
size n2 [25].

4.2.2. Hate-CLIPper reduced
Based on the re-implementation, a reduced model using OpenAI’s smaller base model was trained in
an attempt to reduce the number of parameters. clip-vit-base-patch32 encoders can be used, resulting
in smaller dimensionalities of n = 768. Hate-CLIPper reduced refers to the same architecture as Hate-
CLIPper but using CLIP’s base model and smaller dimensionalities.
    To determine the optimal hardware on which to train the models, this architecture was trained on a
TPU v2 and the batch size was increased to 1,024. The training time was approximately 90 minutes.
Training Hate-CLIPper reduced for approximately 5 minutes on an A100 GPU uses 0.98 compute units
( 12
  1
     ×11.77 compute units/hour) as opposed to 2.64 compute units on the TPU v2 ( 32 ×1.76 compute
units/hour). Therefore, all following experiments were conducted on an A100 GPU and using a TPU
was deemed unviable for this task.


6
    https://huggingface.co/docs/transformers/main/en/model_doc/clip
7
    https://github.com/gokulkarthik/hateclipper
4.2.3. CNN models
Three different variations of CNN-based models were trained, referred to as CNN V1–V3 in Table 1. All
of which use CLIP’s smaller base models to encode the inputs with projections of dimension of 768.

       1. V1 uses align fusion, resulting in a vector of length 768. This is sent through three one-dimensional
          convolutional layers doubling the numbers of channels each time followed by ReLU activations.
          After the convolutional block, the outputs are flattened and sent through a linear output layer
          to retrieve the logits for the binary classification. No dropout was used aside from 0.2 on the
          projection layers as in Hate-CLIPper.
       2. The convolutional architecture of V2 is similar to V1. The only difference is the change of the
          fusion method from align fusion to cross fusion, leading to an input matrix of dimension 7682 .
          Therefore, two-dimensional convolutions were used given the matrix input. Due to the squared
          input size a reduced batch size of 32 was used.
       3. V3 was designed to more closely align with the Hate-CLIPper architecture by reducing the
          number of convolutional layers from three to one and reintroducing dropout. When Hate-CLIPper
          was trained by the original authors, experiments were conducted with both one and three pre-
          output layers, whereas the architecture with one layer performed slightly better (+0.46% on the
          AUROC [25]).

4.3. Evaluation
Based on the HMD , human performances were established with an accuracy of 84.7% [11]. AUROC
is considered the main metric for the challenge. Due to easier interpretation and its balanced test set,
accuracy is recommended to be reported as well [11]. All metrics are calculated with the TorchMetrics8
library using AUROC, Accuracy and BinaryF1Score. The same splits were used as provided by the
HMD . For calculating the validation during the training, the unseen development set (540 memes) was
used, while for the evaluation of the performance the unseen test set (2,000 memes) was used. The
reported inference time was measured in a CPU only environment.


5. Results
The evaluation of all models are shown in Table 1 while Figure 3 shows the Reciever Operating
Characteristic (ROC) curves. We note that the objective of these experiments is not to determine the
best model for the task in terms of performance, but to analyse the diffence in performance relative to
the parameter reduction. We start by discussing models using cross-fusion and then move on to those
using align-fusion.

5.1. Cross-fusion Models
Hate-CLIPer with Cross-fusion: While the replicated Hate-CLIPper using cross fusion reached
neither the human level accuracy nor the AUROC of 85.12 reported by Hate-CLIPper’s authors [25] using
the configurations described in Section 4, this model still achieves the highest performance compared to
all other methods applied in this paper. This is expected given the extremely large number of parameters
in HateCLIPper’s cross-fusion model. In the next sections we will describe the effects of parameter
reduction using the mechanisms outlined in Section 3.

Hate-CLIPer Reduced: Using the reduced base CLIP model decreased both the performance (6.20%
decrease in terms of AUROC) and training time (85.51% decrease). Given an analysis of the confusion
matrices for both the seen and unseen test set, most misclassifications can be found in the False Negatives.
For measuring how well the positive class 1 (Hateful) is detected, the sensitivity ( TP+FN
                                                                                        TP
                                                                                             ) is calculated.
8
    https://lightning.ai/docs/torchmetrics/stable/
    Table 1
    Evaluation of the trained models on the HMD

                             Hate-CLIPper      Reduced       CNN V1       CNN V2       CNN V3
         Trained on          A100 GPU          A100 GPU      A100 GPU     A100 GPU     A100 GPU
         Fusion method       Cross             Cross         Align        Cross        Cross
         P.o. layer          1x linear         1x linear     3x 1D conv   3x 2D conv   1x 2D conv
                             + dropout         + dropout                               + dropout
         P.o. input dim      10242             7682          768          7682         7682
         Parameters          1’075’580’929     453’970’945   990’861      5’703’561    2’164’245
         Batch size          64                64            64           32           64
         Training time       39min 0s          5min 39s      3min 42s     7min 23s     4min 37s
         Inference time      37min 57s         2min 42s      3min 22s     3min 10s     2min 29s
         AUROC               82.13             77.04         65.81        72.14        72.47
         Accuracy            75.70             71.45         64.40        69.00        68.90
         F1                  64.88             55.49         38.40        50.00        52.59
         Size                5.6 GB            2.25 GB       578.6 MB     596.5 MB     255.4 MB


Figure 3: ROC curve of all trained models as a part of this paper.


For Hate-CLIPper it is 59.97 while for Hate-CLIPper Reduced it is 47.47. This highlights that an enormous
reduction in parameters will not necessarily result in a similarly large reduction in performance.

CNN models: The largest CNN-based model, CNN V2, performs comparably to CNN V3 which is
just under half the size of V2. Additionally, CNN V3 only sees a 5.93% decrease in AUROC compared to
Hate-CLIPer Reduced despite being 99.52% smaller.
   The F1 score for CNN V3 was best compared to all other CNN based models. The sensitivity was
measured at 46.0, the confusion matrix is shown in Figure 4. These results taken together with
HateCLIPper Reduced appear to suggest that there is not a linear relationship between the number
of parameters and the performance of the model in terms of AUROC, with their being a diminishing
return in terms of performance as the scale of the parameters increases.

5.2. Align-fusion Models
Despite the comparably small reduction in performance that the proposed cross-align based models
provide relative to the reduction in parameters, we note that the authors of Hate-CLIPer suggested an
Figure 4: Confusion matrix of the CNN V3 model reporting high specificity but low sensitivity.


alternative to cross-fusion in order to reduce model parameters - align-fusion, as discussed in Section 3.
Remarkably, despite this model reducing the parameters from 1.1 billion when using cross-fusion
to 5 million for the best performing models, they report comparable performance for the best align-
fusion model, despite an enormous parameter reduction. Taken together with our above results using
cross-fusion, it appears that the mechanism by which the parameter reduction is achieved holds more
weight than the parameter reduction itself. Given the success of align-fusion, it may be a more obvious
approach for practitioners using this model than cross-fusion models. As discussed in Section 3, with
this in mind, we performed an additional experiment where we applied 1D convolution to features
resulting from align-fusion. Though not re-implemented for this paper, the authors report an AUROC
of 85.8 and approximately 5 million parameters. CNN V3, using align fusion with just under a million
parameters provides an AUROC of 65.81. This indicates that the performance reductions may align more
closely to the decrease in parameters as the overall scale of the models decreases. This suggests that
there may be a cut-off under which additional parameter reductions have a large effect on performance
for this task.


6. Conclusion
In this paper we have analysed the effects of different mechanisms of parameter reduction for a the
hate speech detection model, Hate-CLIPper. We have found that large parameter reductions are not
associated with comparably large reductions in performance for larger scale models. This suggests
that a larger set of parameters for this task may provide diminishing returns. On smaller scale models,
we see a larger effect on performance when reducing the number of parameters. Additionally, when
comparing the performance of HateCLIPper using align-fusion described by [25], we see that their model
achieves a higher level of performance compared to the reduced cross-fusion based models proposed
here, suggesting that the mechanism of parameter reduction may influence the performance of the
model more than the parameter reduction itself. This indicates that a more careful analysis of feature
fusion and subsequent operations may be an effective means of predictably downscaling large-scale
models.
References
 [1] Z. Wan, X. Wang, C. Liu, S. Alam, Y. Zheng, J. Liu, Z. Qu, S. Yan, Y. Zhu, Q. Zhang, M. Chowdhury,
     M. Zhang, Efficient large language models: A survey, Transactions on Machine Learning Research
     (2024). URL: https://openreview.net/forum?id=bsCCJHbO8A, survey Certification.
 [2] J.-W. Chung, Y. Gu, I. Jang, L. Meng, N. Bansal, M. Chowdhury, Perseus: Removing energy bloat
     from large model training, arXiv preprint arXiv:2312.06902 (2023).
 [3] N. Ahmed, M. Wahed, The de-democratization of ai: Deep learning and the compute divide in
     artificial intelligence research, arXiv preprint arXiv:2010.15581 (2020).
 [4] P. C. d. Q. Hermida, E. M. d. Santos, Detecting hate speech in memes: a review, The Artificial
     intelligence review 56 (2023) 12833–12851.
 [5] Y. Chen, F. Pan, Multimodal detection of hateful memes by applying a vision-language pre-training
     model, PLOS ONE 17 (2022) 1–12. doi:10.1371/journal.pone.0274300.
 [6] C. Newton,           The trauma floor:               The secret lives of facebook mod-
     erators            in           america,            https://www.theverge.com/2019/2/25/18229714/
     cognizant-facebook-content-moderator-interviews-trauma-working-conditions-arizona,
     2019. Accessed: 2024-10-08.
 [7] Checkstep, Checkstep: AI Content Moderation Services Tool Platform, 2024. URL: https://www.
     checkstep.com/.
 [8] Membrace, Membrace - AI Content Moderation and Improvement Tools, 2024. URL: https://
     membrace.ai/.
 [9] T. Gillespie, Content moderation, ai, and the question of scale, Big Data & Society 7 (2020).
     doi:10.1177/2053951720943234.
[10] Hateful Memes Challenge and dataset for research on harmful multimodal content, https://ai.meta.
     com/blog/hateful-memes-challenge-and-data-set/, 2020. Accessed: 2024-10-08.
[11] D. Kiela, H. Firooz, A. Mohan, V. Goswami, A. Singh, P. Ringshia, D. Testuggine, The hateful
     memes challenge: Detecting hate speech in multimodal memes, Advances in neural information
     processing systems 33 (2020) 2611–2624.
[12] L. Shang, C. Youn, Y. Zha, Y. Zhang, D. Wang, Knowmeme: A knowledge-enriched graph neural
     network solution to offensive meme detection, in: 2021 IEEE 17th International Conference on
     eScience (eScience), 2021, pp. 186–195. doi:10.1109/eScience51609.2021.00029.
[13] R. Zhu, Enhance multimodal transformer with external label and in-domain pretrain: Hateful meme
     challenge winning solution, CoRR abs/2012.08290 (2020). URL: https://arxiv.org/abs/2012.08290.
     arXiv:2012.08290.
[14] K. Karkkainen, J. Joo, Fairface: Face attribute dataset for balanced race, gender, and age for bias
     measurement and mitigation, in: Proceedings of the IEEE/CVF winter conference on applications
     of computer vision, 2021, pp. 1548–1558.
[15] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, J. Dai, Vl-bert: Pre-training of generic visual-linguistic
     representations, in: International Conference on Learning Representations, 2020. URL: https:
     //openreview.net/forum?id=SygXPaEYvH.
[16] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. Liu, Uniter: Universal image-text
     representation learning, in: European conference on computer vision, Springer, 2020, pp. 104–120.
[17] Z. Gan, Y.-C. Chen, L. Li, C. Zhu, Y. Cheng, J. Liu, Large-scale adversarial training for vision-and-
     language representation learning, Advances in Neural Information Processing Systems 33 (2020)
     6616–6628.
[18] F. Yu, J. Tang, W. Yin, Y. Sun, H. Tian, H. Wu, H. Wang, Ernie-vil: Knowledge enhanced vision-
     language representations through scene graphs, in: Proceedings of the AAAI conference on
     artificial intelligence, volume 35, 2021, pp. 3208–3216.
[19] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, et al., Oscar:
     Object-semantics aligned pre-training for vision-language tasks, in: the 16th European Conference
     on Computer Vision, Springer, 2020, pp. 121–137.
[20] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, K.-W. Chang, Visualbert: A simple and performant baseline
     for vision and language, arXiv preprint arXiv:1908.03557 (2019).
[21] N. Muennighoff, Vilio: State-of-the-art visio-linguistic models applied to hateful memes, CoRR
     abs/2012.07788 (2020). arXiv:2012.07788.
[22] A. Aggarwal, V. Sharma, A. Trivedi, M. Yadav, C. Agrawal, D. Singh, V. Mishra, H. Gritli, Two-way
     feature extraction using sequential and multimodal approach for hateful meme classification,
     Complexity (New York, N.Y.) 2021 (2021) 1–7.
[23] P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, J. Gao, Vinvl: Revisiting visual
     representations in vision-language models, in: Proceedings of the IEEE/CVF conference on
     computer vision and pattern recognition, 2021, pp. 5579–5588.
[24] Y. Chen, F. Pan, Multimodal detection of hateful memes by applying a vision-language pre-training
     model, PloS one 17 (2022) e0274300–e0274300.
[25] G. K. Kumar, K. Nandakumar, Hate-CLIPper: Multimodal hateful meme classification based on
     cross-modal interaction of CLIP features, in: L. Biester, D. Demszky, Z. Jin, M. Sachan, J. Tetreault,
     S. Wilson, L. Xiao, J. Zhao (Eds.), Proceedings of the Second Workshop on NLP for Positive Impact
     (NLP4PI), Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid),
     2022, pp. 171–183.
[26] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
     J. Clark, et al., Learning transferable visual models from natural language supervision, in:
     International conference on machine learning, PMLR, 2021, pp. 8748–8763.
[27] J. Mei, J. Chen, W. Lin, B. Byrne, M. Tomalin, Improving hateful meme detection through
     retrieval-guided contrastive learning, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.), Proceed-
     ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1:
     Long Papers), Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 5333–5347.
     doi:10.18653/v1/2024.acl-long.291.
[28] S. Suryawanshi, B. R. Chakravarthi, M. Arcan, P. Buitelaar, Multimodal meme dataset (MultiOFF)
     for identifying offensive content in image and text, in: R. Kumar, A. K. Ojha, B. Lahiri, M. Zampieri,
     S. Malmasi, V. Murdock, D. Kadar (Eds.), Proceedings of the Second Workshop on Trolling, Ag-
     gression and Cyberbullying, European Language Resources Association (ELRA), Marseille, France,
     2020, pp. 32–41.
[29] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition,
     arXiv preprint arXiv:1409.1556 (2014).
[30] Z. Lu, P. Du, J.-Y. Nie, Vgcn-bert: augmenting bert with graph embedding for text classification,
     in: Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020,
     Lisbon, Portugal, April 14–17, 2020, Proceedings, Part I 42, Springer, 2020, pp. 369–382.
[31] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of
     the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[32] G.-A. Vlad, G.-E. Zaharia, D.-C. Cercel, M. Dascalu, Upb @ dankmemes: Italian memes analysis
     - employing visual models and graph convolutional networks for meme identification and hate
     speech detection, in: Proceedings of the Seventh Evaluation Campaign of Natural Language
     Processing and Speech Tools for Italian Final Workshop, Accademia University Press, 2020.
[33] C. Sharma, D. Bhageria, W. Scott, S. PYKL, A. Das, T. Chakraborty, V. Pulabaigari, B. Gambäck,
     SemEval-2020 task 8: Memotion analysis- the visuo-lingual metaphor!, in: A. Herbelot, X. Zhu,
     A. Palmer, N. Schneider, J. May, E. Shutova (Eds.), Proceedings of the Fourteenth Workshop on
     Semantic Evaluation, International Committee for Computational Linguistics, Barcelona (online),
     2020, pp. 759–773. doi:10.18653/v1/2020.semeval-1.99.
[34] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in:
     Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP),
     2014, pp. 1532–1543.
[35] C. Sharma, D. Bhageria, W. Scott, S. PYKL, A. Das, T. Chakraborty, V. Pulabaigari, B. Gambäck,
     SemEval-2020 task 8: Memotion analysis- the visuo-lingual metaphor!, in: A. Herbelot, X. Zhu,
     A. Palmer, N. Schneider, J. May, E. Shutova (Eds.), Proceedings of the Fourteenth Workshop on
     Semantic Evaluation, International Committee for Computational Linguistics, Barcelona (online),
     2020, pp. 759–773. doi:10.18653/v1/2020.semeval-1.99.
[36] A. Alzu’bi, L. Bani Younis, A. Abuarqoub, M. Hammoudeh, Multimodal deep learning with
     discriminant descriptors for offensive memes detection, J. Data and Information Quality 15 (2023).
     doi:10.1145/3597308.
[37] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
     Roberta: A robustly optimized bert pretraining approach. arxiv [preprint](2019), arXiv preprint
     arXiv:1907.11692 (1907).
[38] R. Cao, R. K.-W. Lee, W.-H. Chong, J. Jiang, Prompting for multimodal hateful meme classification,
     in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Proceedings of the 2022 Conference on Empirical
     Methods in Natural Language Processing, Association for Computational Linguistics, Abu Dhabi,
     United Arab Emirates, 2022, pp. 321–332. doi:10.18653/v1/2022.emnlp-main.22.
[39] R. Mokady, A. Hertz, A. H. Bermano, Clipcap: Clip prefix for image captioning, arXiv preprint
     arXiv:2111.09734 (2021).
[40] R. Mokady, A. Hertz, A. H. Bermano, Clipcap: CLIP prefix for image captioning, CoRR
     abs/2111.09734 (2021). arXiv:2111.09734.
[41] Y. Hu, O. Stretcu, C.-T. Lu, K. Viswanathan, K. Hata, E. Luo, R. Krishna, A. Fuxman, Visual
     program distillation: Distilling tools and programmatic reasoning into vision-language models, in:
     Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp.
     9590–9601.
[42] Y. Hu, O. Stretcu, C.-T. Lu, K. Viswanathan, K. Hata, E. Luo, R. Krishna, A. Fuxman, Visual
     program distillation: Distilling tools and programmatic reasoning into vision-language models, in:
     Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp.
     9590–9601.
[43] Google Colaboratory, Google Colab, 2020. URL: http://colaboratory.google.com.
[44] L. Alzubaidi, J. Zhang, A. J. Humaidi, A. Al-Dujaili, Y. Duan, O. Al-Shamma, J. Santamaría, M. A.
     Fadhel, M. Al-Amidie, L. Farhan, Review of deep learning: concepts, cnn architectures, challenges,
     applications, future directions, Journal of big data 8 (2021) 53–53.
[45] I. Loshchilov, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101 (2017).
[46] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way
     to prevent neural networks from overfitting, The journal of machine learning research 15 (2014)
     1929–1958.

</pre>