=Paper=
{{Paper
|id=Vol-3740/paper-108
|storemode=property
|title=Language-based Mixture of Transformers for EXIST2024
|pdfUrl=https://ceur-ws.org/Vol-3740/paper-108.pdf
|volume=Vol-3740
|authors=Alexandru Petrescu,Ciprian-Octavian Truică,Elena-Simona Apostol
|dblpUrl=https://dblp.org/rec/conf/clef/PetrescuTA24
}}
==Language-based Mixture of Transformers for EXIST2024==
Language-based Mixture of Transformers for EXIST2024
Notebook for the EXIST Lab at CLEF 2024
Alexandru Petrescu1 , Ciprian-Octavian Truică1,* and Elena-Simona Apostol1,2
1
National University of Science and Technology Politehnica University Bucharest, Splaiul Independent, ei 313, Bucures, ti 060042,
Romania
2
Academy of Romanian Scientists, 3 Ilfov, Bucharest, Romania
Abstract
In this paper, we propose o novel method that leverages a Mixture of Transformers (MoT) based on the language
performance of each model. We employ simple, yet effective, preprocessing modules that are connected to
the state-of-the-art Transformer and we compare the performance of general-purpose, task-specific, and data
source-specific flavors of English and multi-language models. This novel approach manages to obtain good results
for all tasks, with the best performance in soft-label evaluations rather than hard-label evaluations. We propose 3
types of mixtures that performed best on training data and we notice that they behave well against unseen data.
The proposed architecture is easily up-gradable, has low resource costs, and provides good overall results in the
EXIST 2024 competition.
Keywords
Mixture of Transformers, Text Classification, Learning with Disagreements, Sexism detection
1. Introduction
The following document serves as the working notes for our submission to EXIST 2024, described in [1]
[2] and, representing the efforts of team Awakened. EXIST is a renowned series of scientific events and
shared tasks focused on the identification of sexism in social networks. The objective of EXIST is to
encompass sexism in its entirety, ranging from overt misogyny to more subtle manifestations involving
implicit sexist behaviors.
For this particular event, we tackled three out of six tasks, focusing on the Natural Language
Processing (NLP) challenges as the other three are similar tasks, but for Computer Vision:
• TASK 1: Sexism Identification - binary classification
• TASK 2: Source Intention - multi-class (4) classification technique, leveraging the outcome of
TASK 1, identifying the nature
• TASK 3: Sexism Categorization - multi-label classification, showing the probability of each
possible outcome
As most of the NLP endeavors focus on the Generative AI domain, we propose an architecture that is
similar to the Mixture of Experts, but instead of using Large Language Models (LLMs) for the purpose of
language generation, we use Language Models (LMs), namely Transformers, with the purpose of solving
the three classification tasks. Transformers have emerged as the leading methodology for text-related
operations, particularly classification problems. We take advantage of the remarkable capabilities
of transformers, making use of industry-trained models facilitated by the Huggingface platform [3].
Furthermore, we fine-tune these models to optimize their performance for our particular task.
Starting with our previous work for last year’s competition [4], in this article, we plan on employing
a mixture of:
CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
*
Corresponding author.
$ alex.petrescu@upb.ro (A. Petrescu); ciprian.truica@upb.ro (C. Truică); elena.apostol@upb.ro (E. Apostol)
https://alexpetrescu.net/ (A. Petrescu); https://sites.google.com/view/ciprian-octavian-truica (C. Truică);
https://sites.google.com/view/elena-simona-apostol (E. Apostol)
0000-0002-7731-2403 (A. Petrescu); 0000-0001-7292-4462 (C. Truică); 0000-0001-6397-4951 (E. Apostol)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
• general-purpose transformers
• task-specific: harmful speech detection
• data-source specific: trained on Tweets
This article is structured as follows. In Section 2, we present the current state-of-the-art methods
for harmful contentment detection. In Section 3, we analyze the dataset and present the experimental
setup. In Section 4, we present and discuss our results. Finally, Section 5, we draw the main conclusions
of this work.
2. Related Work
The tasks proposed for this lab aim at mitigating harmful speech, more specifically sexist and offensive
language, from social networks. With our work, we plan to improve the proposed approach in the
previous edition of our team [4], but leverage the idea of the latest AI trend, Generative AI, namely
Mixture of Experts (MoE) [5]. MoE proposes training separately a multitude of models, reducing the
required resources of training a model that combines everything.
The idea of leveraging multiple simple models is not new and has been previously used for this
task successfully [6], both for English and non-English tweets. Another approach that successfully
uses multi-lingual transformers proposes some data augmentation techniques in the preprocessing
and training pipeline [7]. An important hint that English-only embeddings might have good results in
non-English tasks is provided in another working note from the previous edition [8].
Other works focus on language-independent models by training word embeddings [9], transformer
embeddings [4, 10], sentence transformers [11] or document embeddings [12] for detecting online
harmful content. Furthermore, in the current literature novel architectures for detecting harmful
content have been proposed. These novel architectures focus on stacked deep neural networks [13] or
integrating network information into their deep neural architectures [14].
Finally, the current literature also focuses on how harmful content is spread online [15, 16, 17] and
how its effects can be mitigated on social platforms [18, 19, 20].
3. Experiments
3.1. Exploratory Data Analysis
To better understand the task at hand we propose a simple Exploratory Data Analysis (EDA), as we
want to use a mixture of models, based on the language of the tweets. In Table 1, we observe that
the proposed split of train-test 79% − 21% has the same distribution across the languages. With
the balanced distribution of tweets by language 53% − 47%, a mixture involving multi-lingual and
English-only models makes sense and the comparisons of models will provide interesting results.
Table 1
Distribution of Tweets by Language for the Train/Test Split
DatasetSplit Language NumberOfItems Percentage
Train en 3749 47%
Train es 4209 53%
Test en 978 47%
Test es 1098 53%
3.2. Experimental Setup
For our experiments, we propose a mixture of English-only and multi-lingual transformer-based models
(Table 2) as we want to showcase our mixture of models architecture based on the language of the
tweets (Figure 1).
The output module leverages 3 types of mixtures, in terms of output weight, for the best English
and multi-language models. We consider the dominant model the English one, in case the language of
the input is English, otherwise the multi-lingual one. When we present the results we will highlight
the best English model like this and the best multi-lingual model like this . The leveraged mixtures
are:
1. Half-Half
2. Dominant 75%
3. Dominant
Figure 1: System Architecture
Since for this competition, Task 2 and 3 are defined to take advantage of Task 1’s output, our system
does the same and propagates the mixtures. This means that in Tasks 2 and Task 3, for each mixture
type, the corresponding mixture from Task 1 is used.
Table 2
The proposed models for our experiments
ModelName IsMultiLingual
twitter-roberta [21] [22] No
twitter-xlm-roberta-base-sentiment-multilingual [21] Yes
twitter-xlm-roberta-base-sentiment [23] Yes
bert-toxic-comment-classification [24] No
distilbert-uncased-english [25] No
distilbert-base-multilingual-cased-sentiments [26] Yes
MiniLM-L12-H384 [27] No
xlm-roberta [28] Yes
roberta-hate-speech-dynabench-r4 [29] No
For all the tasks, we use early stopping with 3 epochs of tolerance and the following hyper-parameters,
obtained while training the best model strategy:
• 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑟𝑎𝑡𝑒 = 2𝑒−5
• 𝑝𝑒𝑟_𝑑𝑒𝑣𝑖𝑐𝑒_𝑡𝑟𝑎𝑖𝑛_𝑏𝑎𝑡𝑐ℎ_𝑠𝑖𝑧𝑒 = 32
• 𝑝𝑒𝑟_𝑑𝑒𝑣𝑖𝑐𝑒_𝑒𝑣𝑎𝑙_𝑏𝑎𝑡𝑐ℎ_𝑠𝑖𝑧𝑒 = 32
• 𝑤𝑒𝑖𝑔ℎ𝑡_𝑑𝑒𝑐𝑎𝑦 = 0.01
• 𝑚𝑎𝑥_𝑒𝑝𝑜𝑐ℎ𝑠 = 50
The metrics used in the competition, for which the engine will be optimized, are ICM-Hard, ICM-Hard
Norm, F1, Cross Entropy, Majority class, Minority class, and Oracle most voted. To provide models that
perform well we are using F1 for Tasks 1 and 2 and for Task 3 we are using a custom Mean Squared
Error.
As for the hyper-parameter tuning, each model is optimized as it would be handling the task alone,
for the current implementation.
3.2.1. Task 1
The first task is a binary classification one. The system has to decide whether or not a given tweet
contains sexist expressions or behaviors. The dataset is annotated by multiple evaluators, each providing
their own label. To obtain only one label for each tweet, i.e., ‘YES’ or ‘NO’, we take the majority label.
In Table 3, we observe that the best-performing models for this task are the ones fine-tuned on Twitter
data. As mentioned in the previous section, we are going to use a mixture of the best English-based
model and the best multi-lingual model.
Table 3
General Metrics for the Proposed Models on Task 1
ModelName Epoch F1 Loss Train(s) Eval(s)
twitter-roberta 4 0.7789 0.4863 1463 5
twitter-xlm-roberta-base-sentiment-multilingual 3 0.7665 0.4670 7039 159
twitter-xlm-roberta-base-sentiment 3 0.7482 0.4902 6373 146
bert-toxic-comment-classification 4 0.7463 0.5211 6969 117
distilbert-uncased-english 4 0.7406 0.5348 2918 3
distilbert-base-multilingual-cased-sentiments 4 0.7379 0.5123 4407 3
MiniLM-L12-H384 5 0.7338 0.5059 296 3
xlm-roberta 4 0.7327 0.5520 1834 9
roberta-hate-speech-dynabench-r4 3 0.7126 0.5220 6080 154
We notice that the models are close when it comes to performance, all are in the range of 71% − 79%,
but when it comes to the resources used there is a meaningful difference. The least resources are used by
MiniLM [27], which is a super-pruned version of the regular LMs. The most resource-intensive models
are the ones that are trained in multiple extra iterations over the regular LMs, namely the multi-lingual
tranformers. The base for the multi-lingual transformers is XLM Roberta. Each is further trained on
platform-specific data, i.e., tweets from Twitter, and task-specific data, i.e., harmful speech.
3.2.2. Task 2
The second task is multi-class classification, namely “Source Intention”. Building on top of the first
one, it aims to categorize the message according to the intention of the author. This provides insights
into the role played by social networks in the emission and dissemination of sexist messages. To unify
the results, we use the same approach as for Task 1, a majority vote with equal weight for the labels.
Furthermore, we are augmenting the output by leveraging the output from Task 1, working as a ‘YES’
or ‘NO’ filter that tells us if the model needs to be run on the input or not.
For this problem, we observe that the range of results is wider (Table 4), i.e., 46% − 61%. The models
perform significantly worse than they previously did, but this is expected as the output of Task 1 is also
leveraged.
As expected, the specialized models are outperforming the others. Moreover, the English model,
which is solely focused on the task at hand rather than the data, has the overall best performance
by a small margin. Resource-wise, the behavior is not reflected on the macro level as it was for Task
1. As such, MiniLM, despite training for the most epochs among the smaller models, is not the least
demanding. However, it remains the most efficient model per iteration.
Table 4
General Metrics for the Proposed Models on Task 2
ModelName Epoch F1 Loss Train(s) Eval(s)
twitter-xlm-roberta-base-sentiment 5 0.6090 0.8623 760 4
xlm-roberta 7 0.6064 0.9265 1107 4
roberta-hate-speech-dynabench-r4 5 0.6063 0.8843 515 2
twitter-roberta 5 0.5822 0.8790 518 2
twitter-xlm-roberta-base-sentiment-multilingual 4 0.5525 0.8631 625 4
MiniLM-L12-H384 10 0.5395 0.9431 482 1
distilbert-uncased-english 5 0.5333 0.9226 115 1
bert-toxic-comment-classification 4 0.5021 0.9227 176 2
distilbert-base-multilingual-cased-sentiments 4 0.4657 0.9439 126 1
3.2.3. Task 3
Task 3 is a multi-label classification focusing on identifying different sexism categories for each tweet
that was labeled as sexist by Task 1. Unlike Task 1, tweets have multiple sexist labels. Thus, our
proposed approach computes a probability for each label considering the number of annotations, using
an equal weight for each annotation. As the metrics are custom, the loss is 1/𝑀 𝑒𝑡𝑟𝑖𝑐, and we did not
need to represent it in Table 5. The custom Mean Square Error (CustomMSE) is adapted to the way we
build the probabilities of each class.
We observe an almost perfect mirror of what happened before (Table 5), with the best performing
models being the data and task-specific ones. For the resources side, we notice that this time MiniLM
trained more than twice the number of epochs that the others.
Table 5
General Metrics for the Proposed Models on Task 3
ModelName Epoch CustomMSE Train(s) Eval(s)
twitter-xlm-roberta-base-sentiment-multilingual 7 32.6568 1154 5
twitter-xlm-roberta-base-sentiment 7 32.0696 1089 5
xlm-roberta 7 31.0823 1060 2
twitter-roberta 7 30.2483 664 2
distilbert-uncased-english 7 29.6718 170 2
roberta-hate-speech-dynabench-r4 8 29.5671 826 3
distilbert-base-multilingual-cased-sentiments 7 29.2608 233 2
bert-toxic-comment-classification 8 29.2145 396 3
MiniLM-L12-H384 18 27.8630 862 1
4. Results
Table 6 presents the official results from the leaderboard. For a more comprehensive analysis, please
refer to the Results chapter available on the official site. We are showcasing only the best ranking, as
most of the submissions are one after another in the rankings with a maximum drift of 3 places.
We notice that the best overall mixture is 2, 𝐷𝑂𝑀 𝐼𝑁 𝐴𝑁 𝑇 − 75%, and the least is mixture 1,
𝐻𝐴𝐿𝐹 − 𝐻𝐴𝐿𝐹 . The best-performing outputs are on the English tasks for the soft evaluation rather
Table 6
Ranking in EXIST2024 competition
Task EvalType BestMixture BestRank TotalSystems
1 Soft-Soft-ALL 2 10 40
1 Hard-Hard-ALL 3 20 70
1 Soft-Soft-ES 3 16 40
1 Hard-Hard-ES 3 21 66
1 Soft-Soft-EN 1 5 40
1 Hard-Hard-EN 3 12 68
2 Soft-Soft-ALL 2 9 35
2 Hard-Hard-ALL 2 12 46
2 Soft-Soft-ES 2 11 35
2 Hard-Hard-ES 2 14 46
2 Soft-Soft-EN 2 11 35
2 Hard-Hard-EN 2 7 46
3 Soft-Soft-ALL 2 9 33
3 Hard-Hard-ALL 2 6 34
3 Soft-Soft-ES 2 10 33
3 Hard-Hard-ES 2 9 34
3 Soft-Soft-EN 3 8 33
3 Hard-Hard-EN 2 6 34
than the hard one. One interesting aspect is that we obtained the lowest performance for Task 1, but
the other two that are leveraging its output behave better, with a slight margin. Another interesting
aspect is that Task 3, the one that leverages the custom metric, has the best results out of all tasks, with
consistent placement.
One thing to notice is the difference between the soft and the hard evaluation, for all language splits,
where for Task 1 the difference is quite significant and for the other 2 not that much, considering that
in most cases the outputs of each team were one after another and each team had 3 possible outputs
that can be submitted.
To conclude the Mixture of Transformers provides promising results with good resource requirements,
with the second proposed mixture, 𝐷𝑂𝑀 𝐼𝑁 𝐴𝑁 𝑇 − 75%, performing on average the best, with the
difference in performance between them being not that significant.
5. Conclusions and future directions
We notice a similar performance as we did in the previous iteration, namely in [4], which is slightly
fixed by the Mixture of Transformers:
1. The models yield better results for the soft evaluation, meaning we can adjust the tolerance to
better improve the hard evaluation.
2. The models behave better on the English data, which means that we have to either find better
models specialized in other languages or fine-tune the multi-language ones with more data.
One thing that we did not tackle, but we previously mentioned, is experimenting with the way we
weigh each label, based on the meta-data of the annotator, but the literature has mixed views on this.
Another interesting approach is to consider a dynamic number of Transformers, for each language,
based on performance, as we observe that sometimes the performance is close for multiple models.
Acknowledgment
This work is supported in part by
• The German Academic Exchange Service (DAAD) through the project “iTracing: Automatic
Misinformation Fact-Checking” (DAAD grant no. 91809005).
• The Academy of Romanian Scientists through the funding of project “SCAN-NEWS: Smart system
for deteCting And mitigatiNg misinformation and fake news in social media” (AOS, R-TEAMS-III).
References
[1] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi-
cation and Characterization in Social Networks and Memes, in: Experimental IR Meets Multilin-
guality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of
the CLEF Association (CLEF 2024), 2024.
[2] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi-
cation and Characterization in Social Networks and Memes (Extended Overview), in: G. Faggioli,
N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 – Conference
and Labs of the Evaluation Forum, 2024.
[3] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger,
M. Drame, Q. Lhoest, A. Rush, Huggingface transformers: State-of-the-art natural language
processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing: System Demonstrations, Association for Computational Linguistics, 2020, pp. 38–45.
doi:10.18653/v1/2020.emnlp-demos.6.
[4] A. Petrescu, Leveraging MiniLMv2 Pipelines for EXIST2023, in: Working Notes of the Conference
and Labs of the Evaluation Forum (CLEF 2023), volume 3497 of CEUR Workshop Proceedings,
CEUR-WS.org, 2023, pp. 1037–1043.
[5] O. Sanseviero, L. Tunstall, P. Schmid, S. Mangrulkar, Y. Belkada, P. Cuenca, Mixture of experts
explained, 2023. URL: https://huggingface.co/blog/moe.
[6] C. Jhakal, K. Singal, M. Suri, D. Chaudhary, B. Kumar, I. Gorton, Detection of sexism on social
media with multiple simple transformers, in: Working Notes of the Conference and Labs of the
Evaluation Forum (CLEF 2023), volume 3497 of CEUR Workshop Proceedings, 2023, pp. 959–966.
[7] H. Mohammadi, A. Giachanou, A. Bagheri, Towards robust online sexism detection: A multi-
model approach with bert, xlm-roberta, and distilbert for EXIST 2023 tasks, in: Working Notes of
the Conference and Labs of the Evaluation Forum (CLEF 2023), volume 3497 of CEUR Workshop
Proceedings, 2023, pp. 1000–1011.
[8] A. Sanchez-Urbina, H. Gómez-Adorno, G. Bel-Enguix, V. Rodríguez-Figueroa, A. Monge-Barrera,
Iimasgil_nlp@exist2023: Unveiling sexism on twitter with fine-tuned transformers, in: Working
Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), volume 3497 of CEUR
Workshop Proceedings, CEUR-WS.org, 2023, pp. 1067–1082.
[9] V.-I. Ilie, C.-O. Truică, E.-S. Apostol, A. Paschke, Context-Aware Misinformation Detection: A
Benchmark of Deep Learning Architectures Using Word Embeddings, IEEE Access 9 (2021)
162122–162146. doi:10.1109/access.2021.3132502.
[10] C.-O. Truică, E.-S. Apostol, MisRoBÆRTa: Transformers versus Misinformation, Mathematics 10
(2022) 1–25(569). doi:10.3390/math10040569.
[11] C.-O. Truică, E.-S. Apostol, A. Paschke, Awakened at CheckThat! 2022: fake news detection
using BiLSTM and sentence transformer, in: Working Notes of the Conference and Labs of the
Evaluation Forum (CLEF2022), 2022, pp. 749–757.
[12] C.-O. Truică, E.-S. Apostol, It’s All in the Embedding! Fake News Detection Using Document
Embeddings, Mathematics 11 (2023) 508. doi:10.3390/math11030508.
[13] E.-S. Apostol, C.-O. Truică, A. Paschke, Contcommrtd: A distributed content-based misinformation-
aware community detection system for real-time disaster reporting, IEEE Transactions on Knowl-
edge and Data Engineering (2024) 1–12. doi:10.1109/tkde.2024.3417232.
[14] C.-O. Truică, E.-S. Apostol, P. Karras, DANES: Deep Neural Network Ensemble Architecture for
Social and Textual Context-aware Fake News Detection, Knowledge-Based Systems 294 (2024)
1–13(111715). doi:10.1016/j.knosys.2024.111715.
[15] A. Petrescu, C.-O. Truică, E.-S. Apostol, Sentiment Analysis of Events in Social Media, in: 2019
IEEE 15th International Conference on Intelligent Computer Communication and Processing
(ICCP), IEEE, 2019, pp. 143–149. doi:10.1109/iccp48234.2019.8959677.
[16] A. Petrescu, C.-O. Truică, E.-S. Apostol, A. Paschke, EDSA-Ensemble: an Event Detection Sentiment
Analysis Ensemble Architecture, 2023. arXiv:2301.12805.
[17] C.-O. Truică, E.-S. Apostol, T. S, tefu, P. Karras, A Deep Learning Architecture for Audience Interest
Prediction of News Topic on Social Media, in: International Conference on Extending Database
Technology (EDBT2021), 2021, pp. 588–599. doi:10.5441/002/EDBT.2021.69.
[18] A. Petrescu, C.-O. Truică, E.-S. Apostol, P. Karras, Sparse Shield: Social Network Immunization vs.
Harmful Speech, in: ACM International Conference on Information and Knowledge Management
(CIKM2021), ACM, 2021, pp. 1426–1436. doi:10.1145/3459637.3482481.
[19] C.-O. Truică, E.-S. Apostol, R.-C. Nicolescu, P. Karras, MCWDST: a Minimum-Cost Weighted
Directed Spanning Tree Algorithm for Real-Time Fake News Mitigation in Social Media, IEEE
Access 11 (2023) 125861–25873. doi:10.1109/ACCESS.2023.3331220.
[20] E.-S. Apostol, Özgur Coban, C.-O. Truică, Contain: A community-based algorithm for network im-
munization, Engineering Science and Technology, an International Journal 55 (2024) 1–10(101728).
doi:10.1016/j.jestch.2024.101728.
[21] J. Camacho-Collados, K. Rezaee, T. Riahi, A. Ushio, D. Loureiro, D. Antypas, J. Boisson, L. Es-
pinosa Anke, F. Liu, E. Martínez Cámara, Tweetnlp: Cutting-edge natural language processing for
social media, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language
Processing: System Demonstrations, Association for Computational Linguistics, 2022, pp. 38–49.
doi:10.18653/v1/2022.emnlp-demos.5.
[22] D. Loureiro, F. Barbieri, L. Neves, L. Espinosa Anke, J. Camacho-collados, TimeLMs: Diachronic
language models from Twitter, in: Proceedings of the 60th Annual Meeting of the Association for
Computational Linguistics: System Demonstrations, Association for Computational Linguistics,
Dublin, Ireland, 2022, pp. 251–260. doi:10.18653/v1/2022.acl-demo.25.
[23] F. Barbieri, L. Espinosa Anke, J. Camacho-Collados, XLM-T: Multilingual language models in
Twitter for sentiment analysis and beyond, in: Proceedings of the Thirteenth Language Resources
and Evaluation Conference, European Language Resources Association, Marseille, France, 2022,
pp. 258–266.
[24] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, 2019. doi:10.18653/v1/n19-1423.
[25] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster,
cheaper and lighter, 2020. arXiv:1910.01108.
[26] M. Laurer, W. van Atteveldt, A. Casas, K. Welbers, Less annotating, more classifying: Addressing
the data scarcity issue of supervised machine learning with deep transfer learning and bert-nli,
Political Analysis 32 (2024) 84–100. doi:10.1017/pan.2023.20.
[27] W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, M. Zhou, Minilm: Deep self-attention distillation for
task-agnostic compression of pre-trained transformers, 2020. arXiv:2002.10957.
[28] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, in:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL,
2020, pp. 8440–8451. doi:10.18653/v1/2020.acl-main.747.
[29] B. Vidgen, T. Thrush, Z. Waseem, D. Kiela, Learning from the worst: Dynamically generated
datasets to improve online hate detection, in: Proceedings of the 59th Annual Meeting of the
Association for Computational Linguistics and the 11th International Joint Conference on Natural
Language Processing (Volume 1: Long Papers), ACL, 2021, pp. 1667–1682. doi:10.18653/v1/
2021.acl-long.132.