Prompt Engineering for Identifying Sexism using GPT
                         Mistral 7B
                         Notebook for the EXIST Lab at CLEF 2024

                         Marco Siino1,* , Ilenia Tinnirello2
                         1
                             University of Catania, Piazza Università 2, Catania, 95131, Italy
                         2
                             University of Palermo, Piazza Marina 61, Palermo, 90133, Italy


                                         Abstract
                                         EXIST is an ongoing series of scientific events and collaborative tasks dedicated to identifying sexism in social
                                         networks. The goal of EXIST - in this case hosted at CLEF 2024 - is to encompass the full spectrum of sexist
                                         expressions, ranging from overt misogyny to more subtle forms that include implicit sexist behaviours. A binary
                                         classification is the first task. Systems must determine whether a particular tweet includes statements or actions
                                         that are sexist. In this paper, we discuss the application of a Mistral 7B model to address the task in the hard
                                         labelling setup for English and Spanish. Our approach leverages a Mistral 7B model along with a few-shot learning
                                         strategy and prompt engineering. Thanks to our approach, on the English test set, our best run achieved an F1 of
                                         0.56, and on the Spanish test set, it achieved an F1 of 0.51. In the global ranking, our approach was able to obtain
                                         an F1 of 0.53. Our selected approach is able to outperform some of the baselines provided for the competition
                                         while outperforming other LLM-based approaches.

                                         Keywords
                                         GPT, sexism, mistral 7B, LLM, prompt engineering


                         1. Introduction
                         In recent years, Natural Language Processing (NLP) has been reshaped by Generative Pre-trained
                         Transformer (GPT) models [1, 2], by managing text across various applications. EXIST is a series of
                         scientific events and shared tasks dedicated to identifying sexism in social networks. It aims to address
                         sexism in a comprehensive manner, encompassing explicit misogyny and more subtle expressions of
                         implicit sexist behaviours (EXIST 2021, EXIST 2022, EXIST 2023). The fourth edition of the EXIST
                         shared task takes place as a Lab hosted at CLEF 2024.
                            Social networks serve as major platforms for social complaints, activism, and movements such as
                         #MeToo, #8M, and #Time’sUp, which have rapidly gained traction. Many women throughout the
                         world have been able to report sexist incidents in real life, including violence and discrimination,
                         thanks to these sites. Social media platforms, however, also aid in the propagation of sexism and other
                         rude, abusive, and offensive behaviours. In this situation, automated methods can be quite helpful in
                         identifying and raising awareness of sexist discourses and behaviours. These methods can also be used
                         to determine the most prevalent types of sexism, assess the frequency of abusive and sexist situations
                         on social media platforms, and comprehend the ways in which sexism manifests itself in these medium.
                         This lab helps with the creation of sexism detection applications. The activities in the latest version
                         are centred upon visuals, namely memes, whereas the previous three editions were solely concerned
                         with identifying and categorizing sexist text messages. Memes, which are usually funny pictures that
                         become viral on the Internet and social media, are now included to encompass a wider range of sexist
                         expressions, particularly those that pass for comedy. Consequently, the development of automated
                         multimodal technologies that can identify sexism in text and memes is imperative.

                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                          $ marco.siino@unipa.it (M. Siino); ilenia.tinnirello@unipa.it (I. Tinnirello)
                           https://github.com/marco-siino (M. Siino)
                           0000-0002-4453-5352 (M. Siino); 0000-0002-1305-0248 (I. Tinnirello)
                                      © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   Detecting sexist content online is an increasingly complex challenge, requiring the development
of automated tools for data extraction and categorization. These tools are essential for addressing
both established and emerging societal concerns. Recent advancements in machine learning and deep
learning architectures in almost every field [3, 4, 5] have driven a surge of interest also in natural
language processing (NLP) techniques. Capitalizing on this momentum in NLP research, numerous
text classification strategies have been proposed in the literature to automate the identification and
categorization of online textual content. In the last fifteen years, some of the most successful strategies
have been based on SVM [6, 7], on Convolutional Neural Network (CNN) [8, 9], on Graph Neural
Network (GNN) [10], on ensemble models [11, 12] and, recently, on Transformers [1, 13, 14, 15].
   Recently, the many approaches presented at SemEval 2024 - usually held before the CLEF conference
- have further pushed the growing use of the large language model (LLM)-based architectures in
academic research. LLM apps are used at SemEval to take on a variety of tasks and provide noteworthy
outcomes. For example, T5 is used to the problem of determining the inference relation between plain
language statements and Clinical Trial Reports [16] in Task 2 [17]. In Task 10, Hindi-English code-mixed
conversations are subjected to emotion recognition in Conversation (ERC) using a Mistral 7B model
[18]. Furthermore, a DistilBERT model is used in Task 8 [19] to recognize text generated by machines
[20]. Inspired by the results provided by this last work, we decided to employ a Mistral 7B model to
face the EXIST 2024 binary classification task (i.e., Task 1). EXIST 2024 [21, 22] finds its main basis on
the previous editions of the same series [23, 24].
   Classifying items in binary form is the first task. The systems must determine whether a particular
tweet includes sexist language or actions (that is, if it is sexist in and of itself, portrays a sexist scenario,
or disparages a sexist action). We provide a Transformer-based strategy that uses Mistral 7B to tackle the
problem in both English and Spanish [25]. We applied the model in a specific few-shot manner that is
covered in the remainder of this study. In particular, we provided samples from the English and Spanish
training sets. We chose Mistral 7B because its comparison examination with two other state-of-the-art
models—Llama 2 and Llama 1—shows significant improvements in common natural language processing
tasks. Mistral 7B continuously performs better than Llama 2, a well-known open 13B model, in several
benchmark tests. Additionally, as reported in its introductory paper, Mistral 7B performs better than
Llama 1, a cutting-edge 34B model, not just equalling but surpassing its accomplishments in areas
related to logic, maths, and coding.
   The rest of the paper is developed in this manner. In Section 2, we provide some background on
Task 1 hosted at EXIST 2024. An explanation of the employed technique is provided in Section 3. In
Section 4, we detail the experimental configuration that was utilized to replicate our findings. Section
5 contains some discussions and the official task results. In section 6, we offer our conclusions and
recommendations for additional study.
   We make all the code publicly available and reusable on GitHub.


2. Background
This section furnishes background information regarding the Task 1, held at EXIST 2024. This text
describes a challenge for participants to develop models that can detect sexism in tweets from Twitter.
The challenge seeks creation of multilingual and monolingual models that can both detect sexism in
terms of binary classification. These models, given a source content, need to identify if some form of
sexism is present within the content.
   For our submission, we only addressed the Task 1 (binary classification task), where we were asked
to detect if a tweet contained some sexist content or not. An example from the official Task description
is shown in the Figure 1.
   Finally, the task organizers requested the submission of a JSON file. In our case, we submitted two
runs. In one field of the JSON file is reported the ID of the test sample considered, in the second field it
is reported the label (i.e., YES or NO).
Figure 1: In the Figure is shown a sample from the task description page. The output of the model for the task
has to be one out of YES or NO.


3. System Overview
While it has been established that Transformers may not always be the optimal choice for text classifi-
cation tasks [26], the efficacy of various strategies, such as domain-specific fine-tuning [27, 28] and
data augmentation [29, 30], depends on the specific objectives.
   The increasing adoption of Transformer-based architectures in academic research has also been
bolstered by various methodologies showcased at SemEval 2024. These methodologies tackle diverse
tasks and yield noteworthy findings. For instance, at the Task 2 [17], where to address the challenge of
identifying the inference relation between a plain language statement and Clinical Trial Reports is used
T5 [16]; Task 4 [31] where is employed a Mistral 7B model to detect persuasion techniques in memes
[32]; and Task 8 [19], that utilizes a DistilBERT model to identify machine-generated text [20].

                            Few-Shot                                 EN Few-Shot
      Train Set                               Deep Translator
                                                                                                         Joy
                                                                                        Mistral 7B
       Test Set                               Deep Translator
                        "mujer estúpida"                            "Stupid Woman"
        Figure 2: The system overview of our proposed approach in the case of the Spanish dataset. Given a
        set of Spanish samples from the train set, they are translated to English using Deep Translator. Then
        they are all provided as input - few-shot samples from the training set, together with a prompt question
        - to Mistral 7B. Following these few shot samples and the question as input, there is one sample from
        the test set for which the prediction has to be provided.


   We employ Mistral 7B in our few-shot approach [33]. With seven billion parameters, Mistral 7B is a
language model designed to be very performant. Mistral 7B outperforms the industry-leading open 13B
model (Llama 2) on every benchmark that has been evaluated. Moreover, it outperforms the leading 34B
model (Llama 1) in tasks pertaining to logic, math, and coding creation. The model leverages sliding
window attention (SWA) to efficiently analyze sequences of different lengths while reducing inference
costs, in addition to grouped-query attention (GQA) to accelerate inference. Furthermore, Mistral 7B
- Instruct, a refined version designed for following instructions, performs better than Llama 2 13B -
conversation model in both automatic and human assessments. Mistral 7B Instruct’s release highlights
how easily the base model may be adjusted to provide considerable performance gains.
   For our task, in the case of the Spanish language before prompting the model with the current sample
from the test set, we made an online and real-time use of Google Translator from the deep_translator
library. Then we randomly selected 10 samples from the provided labelled training set. Then we
formatted the samples in each set in the following way:

      Tweet1 // NO
      Tweet2 // YES
          ...
          TweetN // NO

   After merging the formatted samples from the training set, we fed the model, appending to the
few-shot samples the current unlabelled sample from the official test set. At this point, the full text
containing the few-shot samples plus the sample to be classified were provided as prompts to Mistral.
   Then the question provided as prompt to the model was: " [INST] Is the following TWEET sexist, in
any form, or does it describe situations in which such discrimination occurs (i.e., it is sexist itself, describes
a sexist situation or criticizes a sexist behaviour)? Reply only with YES or NO.". Where the CONTEXT
were the few-shot samples provided. For all the samples from the test set, the model correctly predicted
one of the response requested (i.e., YES or NO).
   According to a recent study by [34], preprocessing usually has little effect on text classification tasks
when employing Transformers. More precisely, when it comes to Transformers, the optimal set of
preprocessing techniques does not really vary from doing none at all. We have not preprocessed the
text in any way because of these factors, as well as the desire to maintain the speed and computational
efficiency of our system.


4. Experimental Setup
Our model implementation was executed on Google Colab, utilizing the Mistral 7B library from Hugging
Face, specifically the Mistral-7B-Instruct-v0.2-GGUF version from TheBlock. Additionally, we utilized
the deep_translator package with Google Translator 1 for the translation task. The Mistral 7B version
employed represents an enhanced iteration of the Mistral-7B-Instruct-v0.1 model, geared towards
instruction fine-tuning. Instructions for instruction fine-tuning should be enclosed within [INST]
and [/INST] tokens, with the initial instruction beginning with a sentence identifier, and subsequent
instructions omitting this identifier. The generation process is terminated by the end-of-sentence token
ID. Furthermore, we imported the Llama library [35] from llama_cpp, with comprehensive details
available on GitHub.
   All datasets required for the various phases of the experiment are accessible on the Official Competi-
tion page. No additional fine-tuning was conducted on the model. The experiment was executed using
a T4 GPU provided by Google. Upon generating the predictions, the results were exported in the format
specified by the organizers. As previously mentioned, our complete codebase is accessible on GitHub.


5. Results
To compile the final ranking, the official ranking metric used was the ICM normalized. Also, the ICM
and the F1-score based on gold labels YES were reported. However, it is worth mentioning that also the
ICM and the ICM normalized were provided in the final ranking.
   In the Table 1, the global results obtained by the first three participants and by the last one are shown
along with our submissions. While we do not know the details of other participants’ implementations,
we can notice that there is a relevant gap with our team. Also, the results of our two runs are comparable.
As already stated, our approach is based on the application of prompt engineering using Mistral 7B.
It is worth mentioning that the only difference between our two submissions is in the position of the
tag < 𝑠 > used by Mistral. In the Table 2, we show the results for the English language obtained by
the first three participants and by the last one are shown along with our submissions. In this case, we
notice a greater gap with the last ranked submission. However, our best result is slightly better than
the one obtained for the global ranking. Finally, in the Table 3, the results obtained by the first three
participants and by the last one for the Spanish language are shown along with our submission. In this
case, our approach obtained the worst results compared to the first positions.

1
    https://pypi.org/project/deep-translator/
Table 1
Performance of participant models for the global ranking in the hard setting. Results are sorted according to the
ICM. Our two runs ranked 56 and 57 respectively.
       Pos    Participant                                      ICM-Hard       ICM-Hard Norm          F1
        1     NYCU-NLP_1.json                                    0.597             0.800            0.794
        2     ABCD Team_1.json                                   0.596             0.799            0.783
        3     CIMAT-CS-NLP_2.json                                0.593             0.798            0.790
        56    mc-mistral_2.json                                  0.061             0.531            0.532
        57    mc-mistral_1.json                                  -0.009            0.495            0.478
        70    The-Three-Musketeers_3.json                        -0.464            0.266            0.300

Table 2
Performance of participant models for the English language in the hard setting. Results are sorted according to
the ICM. Our two runs ranked 59 and 61 respectively.
       Pos    Participant                                      ICM-Hard       ICM-Hard Norm         F1
        1     EquityExplorers_2.json                              0.618            0.815           0.761
        2     EquityExplorers_1.json                              0.595            0.804           0.749
        3     I2C-UHU_2.json                                      0.580            0.796           0.763
        59    mc-mistral_2.json                                   0.142            0.572           0.563
        61    mc-mistral_1.json                                   0.076            0.539           0.519
        66    shm2024_2.json                                     -0.367            0.313           0.462

Table 3
Performance of participant models for the Spanish language in the hard setting. Results are sorted according to
the ICM. Our two runs ranked 58 and 59 respectively.
       Pos    Participant                                      ICM-Hard       ICM-Hard Norm          F1
        1     NYCU-NLP_1.json                                    0.621             0.811            0.824
        2     ABCD Team_1.json                                   0.617             0.808            0.810
        3     CIMAT-CS-NLP_2.json                                0.610             0.805            0.815
        58    mc-mistral_2.json                                  -0.012            0.494            0.507
        59    mc-mistral_1.json                                  -0.087            0.456            0.444
        66    UniLeon-UniBO_1.json                               -0.511            0.245            0.607


   Unfortunately, it is not easy from our perspective to motivate the actual gap with the best performing
team. It is also worth noticing that our approach is ranked better in the case of the English language.
This result is shown in the Table 2. Compared to the best performing models, our simple approach
exhibits some room for improvements, although it is able to outperform some of the baseline provided.
However, it is worth notice that it required no further pre-training and the computational cost to
address the task is manageable with the free online resources offered by Google Colab. Furthermore,
our approach made use of a quantized version of Mistral 7B available on Hugging Face and referenced
in our code available on GitHub.


6. Conclusion
This paper presents the application of Mistral 7B-model for addressing the Task 1 at EXIST 2024
hosted at CLEF 2024. In our submission, we opted to adopt a few-shot learning strategy, utilizing an
in-domain pre-trained Transformer model without modifications. Through numerous experimental
iterations, we discovered the efficacy of constructing a prompt comprising examples extracted from
the training dataset. Subsequently, we presented few-shot samples alongside a test sample as the
prompt. The model’s objective was to discern whether a sexist content is within a tweet. Undoubtedly,
tackling this task presented considerable challenges, and despite our dedicated efforts, it’s evident
from the analysis of the final ranking that there is still considerable room for improvement. Potential
alternative avenues for exploration include leveraging the zero-shot capabilities of alternative models
such as GPT and T5, expanding the training dataset by incorporating additional data sources, or
implementing a novel approach to integrate domain-specific ontology-based knowledge, departing
from the methodology outlined in our current work. These strategies hold promise for advancing the
effectiveness and robustness of our model in identifying and addressing hallucinations [36]. Additional
enhancements could be achieved through fine-tuning the model and reframing the problem as a distinct
text classification task. Fine-tuning would allow the model to adapt more closely to the specific
characteristics of our dataset and the nuances of the sexism identification task. By approaching the
problem from a different classification perspective, we may uncover alternative feature representations
or model architectures better suited to capturing the subtle distinctions between sexist and not sexist
content. This strategic shift could potentially lead to more refined and accurate predictions, ultimately
improving the overall performance of our system. Furthermore, given the interesting results recently
provided on a plethora of tasks, also other few-shot learning [37, 38, 39, 40] or data augmentation
strategies [41, 42, 43, 44] could be employed to improve the results. Upon reviewing the final ranking,
it becomes apparent that our straightforward approach reveals areas where enhancements could be
made. Nevertheless, it is noteworthy that our method necessitated no additional pre-training and
remained computationally feasible using the resources provided by Google Colab. Furthermore, the
proposed approach enabled us to surpass some baselines established by the task organizers. This
achievement underscores the efficacy and accessibility of our methodology, despite its potential for
further refinement.


Acknowledgments
We extend our gratitude to the anonymous reviewers for their insightful comments and valuable
suggestions, which have significantly enhanced the clarity and presentation of this paper.


Authorship Contribution
Marco Siino: Conceptualization, Formal analysis, Investigation, Methodology, Resources, Software,
Validation, Visualization, Writing - Original draft, writing - review & editing. Ilenia Tinnirello:
Writing - review & editing, Methodology.


References
 [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
     Attention is all you need, Advances in neural information processing systems 30 (2017).
 [2] G. Yenduri, M. Ramalingam, G. C. Selvi, Y. Supriya, G. Srivastava, P. K. R. Maddikunta, G. D.
     Raj, R. H. Jhaveri, B. Prabadevi, W. Wang, et al., Gpt (generative pre-trained transformer)–a
     comprehensive review on enabling technologies, potential applications, emerging challenges, and
     future directions, IEEE Access (2024).
 [3] A. Sahu, P. K. Das, S. Meher, Recent advancements in machine learning and deep learning-based
     breast cancer detection using mammograms, Physica Medica 114 (2023) 103138.
 [4] K. Sharifani, M. Amini, Machine learning and deep learning: A review of methods and applications,
     World Information Technology and Engineering Journal 10 (2023) 3897–3904.
 [5] A. Nicosia, N. Cancilla, M. Siino, M. Passerini, F. Sau, I. Tinnirello, A. Cipollina, Alarms Early
     Detection in Dialytic Therapies via Machine Learning Models, in: T. Jarm, R. Šmerc, S. Mahnič-
     Kalamiza (Eds.), 9th European Medical and Biological Engineering Conference, Springer Nature
     Switzerland, Cham, 2024, pp. 55–66.
 [6] F. Colas, P. Brazdil, Comparison of svm and some older classification algorithms in text classification
     tasks, in: IFIP International Conference on Artificial Intelligence in Theory and Practice, Springer,
     2006, pp. 169–178.
 [7] D. Croce, D. Garlisi, M. Siino, An SVM ensemble approach to detect irony and stereotype spreaders
     on twitter, in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), Proceedings of the Working
     Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5th
     - to - 8th, 2022, volume 3180 of CEUR Workshop Proceedings, CEUR-WS.org, 2022, pp. 2426–2432.
 [8] Y. Kim, Convolutional neural networks for sentence classification, in: A. Moschitti, B. Pang,
     W. Daelemans (Eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural
     Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a
     Special Interest Group of the ACL, ACL, 2014, pp. 1746–1751. doi:10.3115/V1/D14-1181.
 [9] M. Siino, E. Di Nuovo, I. Tinnirello, M. La Cascia, Detection of hate speech spreaders using
     convolutional neural networks, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.),
     Proceedings of the Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum,
     Bucharest, Romania, September 21st - to - 24th, 2021, volume 2936 of CEUR Workshop Proceedings,
     CEUR-WS.org, 2021, pp. 2126–2136.
[10] F. Lomonaco, G. Donabauer, M. Siino, COURAGE at checkthat!-2022: Harmful tweet detection
     using graph neural networks and ELECTRA, in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast
     (Eds.), Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation
     Forum, Bologna, Italy, September 5th - to - 8th, 2022, volume 3180 of CEUR Workshop Proceedings,
     CEUR-WS.org, 2022, pp. 573–583.
[11] M. Miri, M. B. Dowlatshahi, A. Hashemi, M. K. Rafsanjani, B. B. Gupta, W. Alhalabi, Ensem-
     ble feature selection for multi-label text classification: An intelligent order statistics approach,
     International Journal of Intelligent Systems 37 (2022) 11319–11341.
[12] M. Siino, I. Tinnirello, M. La Cascia, T100: A modern classic ensemble to profile irony and
     stereotype spreaders, in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), Proceedings of
     the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologna, Italy,
     September 5th - to - 8th, 2022, volume 3180 of CEUR Workshop Proceedings, CEUR-WS.org, 2022,
     pp. 2666–2674.
[13] M. Siino, M. La Cascia, I. Tinnirello, Mcrock at semeval-2022 task 4: Patronizing and condescending
     language detection using multi-channel cnn, hybrid lstm, distilbert and xlnet, in: G. Emerson,
     N. Schluter, G. Stanovsky, R. Kumar, A. Palmer, N. Schneider, S. Singh, S. Ratan (Eds.), Proceedings
     of the 16th International Workshop on Semantic Evaluation, SemEval@NAACL 2022, Seattle,
     Washington, United States, July 14-15, 2022, Association for Computational Linguistics, 2022, pp.
     409–417. doi:10.18653/V1/2022.SEMEVAL-1.55.
[14] M. Siino, DeBERTa at SemEval-2024 Task 9: Using DeBERTa for Defying Common Sense, in:
     A. K. Ojha, A. S. Doğruöz, H. Tayyar Madabushi, G. Da San Martino, S. Rosenthal, A. Rosá
     (Eds.), Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024),
     Association for Computational Linguistics, Mexico City, Mexico, 2024, pp. 291–297.
[15] M. Siino, All-Mpnet at SemEval-2024 Task 1: Application of Mpnet for Evaluating Semantic Textual
     Relatedness, in: A. K. Ojha, A. S. Doğruöz, H. Tayyar Madabushi, G. Da San Martino, S. Rosenthal,
     A. Rosá (Eds.), Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-
     2024), Association for Computational Linguistics, Mexico City, Mexico, 2024, pp. 379–384.
[16] M. Siino, T5-medical at semeval-2024 task 2: Using t5 medical embeddings for natural language
     inference on clinical trial data, in: Proceedings of the 18th International Workshop on Semantic
     Evaluation, SemEval 2024, Mexico City, Mexico, 2024, pp. 40–46.
[17] M. Jullien, M. Valentino, A. Freitas, SemEval-2024 task 2: Safe biomedical natural language
     inference for clinical trials, in: Proceedings of the 18th International Workshop on Semantic
     Evaluation (SemEval-2024), Association for Computational Linguistics, 2024, pp. 1947–1962.
[18] M. Siino, Transmistral at semeval-2024 task 10: Using mistral 7b for emotion discovery and
     reasoning its flip in conversation, in: Proceedings of the 18th International Workshop on Semantic
     Evaluation, SemEval 2024, Mexico City, Mexico, 2024, pp. 298–304.
[19] Y. Wang, J. Mansurov, P. Ivanov, J. Su, A. Shelmanov, A. Tsvigun, C. Whitehouse, O. M. Afzal,
     T. Mahmoud, G. Puccetti, T. Arnold, A. F. Aji, N. Habash, I. Gurevych, P. Nakov, Semeval-2024 task
     8: Multigenerator, multidomain, and multilingual black-box machine-generated text detection, in:
     Proceedings of the 18th International Workshop on Semantic Evaluation, SemEval 2024, Mexico,
     Mexico, 2024, pp. 2057–2079.
[20] M. Siino, Badrock at semeval-2024 task 8: Distilbert to detect multigenerator, multidomain and
     multilingual black-box machine-generated text, in: Proceedings of the 18th International Workshop
     on Semantic Evaluation, SemEval 2024, Mexico City, Mexico, 2024, pp. 239–245.
[21] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
     R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi-
     cation and Characterization in Social Networks and Memes, in: Experimental IR Meets Multilin-
     guality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of
     the CLEF Association (CLEF 2024), 2024.
[22] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
     R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi-
     cation and Characterization in Social Networks and Memes (Extended Overview), in: G. Faggioli,
     N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 – Conference
     and Labs of the Evaluation Forum, 2024.
[23] L. Plaza, J. Carrillo-de Albornoz, R. Morante, E. Amigó, J. Gonzalo, D. Spina, P. Rosso, Overview
     of exist 2023–learning with disagreement for sexism identification and characterization, in:
     International Conference of the Cross-Language Evaluation Forum for European Languages,
     Springer, 2023, pp. 316–342.
[24] F. Rodríguez-Sánchez, J. Carrillo-de Albornoz, L. Plaza, A. Mendieta-Aragón, G. Marco-Remón,
     M. Makeienko, M. Plaza, J. Gonzalo, D. Spina, P. Rosso, Overview of exist 2022: sexism identification
     in social networks, Procesamiento del Lenguaje Natural 69 (2022) 229–240.
[25] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand,
     G. Lengyel, G. Lample, L. Saulnier, et al., Mistral 7b, arXiv preprint arXiv:2310.06825 (2023).
[26] M. Siino, E. Di Nuovo, I. Tinnirello, M. La Cascia, Fake news spreaders detection: Sometimes
     attention is not all you need, Information 13 (2022) 426. doi:10.3390/INFO13090426.
[27] C. Sun, X. Qiu, Y. Xu, X. Huang, How to fine-tune bert for text classification?, in: Chinese
     Computational Linguistics: 18th China National Conference, CCL 2019, Kunming, China, October
     18–20, 2019, Proceedings 18, Springer, 2019, pp. 194–206.
[28] D. Van Thin, D. N. Hao, N. L.-T. Nguyen, Vietnamese sentiment analysis: An overview and
     comparative study of fine-tuning pretrained language models, ACM Transactions on Asian and
     Low-Resource Language Information Processing 22 (2023) 1–27.
[29] F. Lomonaco, M. Siino, M. Tesconi, Text enrichment with japanese language to profile cryptocur-
     rency influencers, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of
     the Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki, Greece, September
     18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 2708–2716.
[30] S. Mangione, M. Siino, G. Garbo, Improving irony and stereotype spreaders detection using data
     augmentation and convolutional neural network, in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast
     (Eds.), Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation
     Forum, Bologna, Italy, September 5th - to - 8th, 2022, volume 3180 of CEUR Workshop Proceedings,
     CEUR-WS.org, 2022, pp. 2585–2593.
[31] D. Dimitrov, F. Alam, M. Hasanain, A. Hasnat, F. Silvestri, P. Nakov, G. Da San Martino, Semeval-
     2024 task 4: Multilingual detection of persuasion techniques in memes, in: Proceedings of the
     18th International Workshop on Semantic Evaluation, SemEval 2024, Mexico City, Mexico, 2024,
     pp. 2009–2026.
[32] M. Siino, Mcrock at semeval-2024 task 4: Mistral 7b for multilingual detection of persuasion
     techniques in memes, in: Proceedings of the 18th International Workshop on Semantic Evaluation,
     SemEval 2024, Mexico City, Mexico, 2024, pp. 53–59.
[33] J. Littenberg-Tobias, G. R. Marvez, G. Hillaire, J. Reich, Comparing few-shot learning with GPT-3
     to traditional machine learning approaches for classifying teacher simulation responses, in: AIED
     (2), volume 13356 of Lecture Notes in Computer Science, Springer, 2022, pp. 471–474.
[34] M. Siino, I. Tinnirello, M. La Cascia, Is text preprocessing still worth the time? a comparative survey
     on the influence of popular preprocessing methods on transformers and traditional classifiers,
     Information Systems 121 (2024) 102342.
[35] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
     E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, Llama: Open and efficient
     foundation language models, 2023. arXiv:2302.13971.
[36] M. Siino, BrainLlama at SemEval-2024 Task 6: Prompting Llama to detect hallucinations and
     related observable overgeneration mistakes, in: A. K. Ojha, A. S. Doğruöz, H. Tayyar Madabushi,
     G. Da San Martino, S. Rosenthal, A. Rosá (Eds.), Proceedings of the 18th International Workshop
     on Semantic Evaluation (SemEval-2024), Association for Computational Linguistics, Mexico City,
     Mexico, 2024, pp. 82–87.
[37] X. Wang, X. Wang, B. Jiang, B. Luo, Few-shot learning meets transformer: Unified query-support
     transformers for few-shot classification, IEEE Trans. Circuits Syst. Video Technol. 33 (2023)
     7789–7802. doi:10.1109/TCSVT.2023.3282777.
[38] B. M. S. Maia, M. C. F. Ribeiro de Assis, L. M. de Lima, M. B. Rocha, H. G. Calente, M. L. A.
     Correa, D. R. Camisasca, R. A. Krohling, Transformers, convolutional neural networks, and few-
     shot learning for classification of histopathological images of oral cancer, Expert Systems with
     Applications 241 (2024) 122418. doi:https://doi.org/10.1016/j.eswa.2023.122418.
[39] M. Siino, M. Tesconi, I. Tinnirello, Profiling cryptocurrency influencers with few-shot learning
     using data augmentation and ELECTRA, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.),
     Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki,
     Greece, September 18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org,
     2023, pp. 2772–2781.
[40] Z. Meng, Z. Zhang, Y. Guan, J. Li, L. Cao, M. Zhu, J. Fan, F. Fan, A hierarchical transformer-
     based adaptive metric and joint-learning network for few-shot rolling bearing fault diagnosis,
     Measurement Science and Technology 35 (2024). doi:10.1088/1361-6501/ad11e9.
[41] F. Muftie, M. Haris, Indobert based data augmentation for indonesian text classification, in: 2023
     International Conference on Information Technology Research and Innovation, ICITRI 2023, 2023,
     p. 128 – 132. doi:10.1109/ICITRI59340.2023.10250061.
[42] M. Siino, F. Lomonaco, P. Rosso, Backtranslate what you are saying and i will tell who you are,
     Expert Systems n/a (2024) e13568. doi:https://doi.org/10.1111/exsy.13568.
[43] J. M. Tapia-Téllez, H. J. Escalante, Data augmentation with transformers for text classification, in:
     L. Martínez-Villaseñor, O. Herrera-Alcántara, H. Ponce, F. A. Castro-Espinoza (Eds.), Advances in
     Computational Intelligence, Springer International Publishing, Cham, 2020, pp. 247–259.
[44] M. Siino, I. Tinnirello, XLNet with Data Augmentation to Profile Cryptocurrency Influencers, in:
     M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of the Conference and
     Labs of the Evaluation Forum (CLEF 2023), Thessaloniki, Greece, September 18th to 21st, 2023,
     volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 2763–2771.