=Paper=
{{Paper
|id=Vol-3740/paper-69
|storemode=property
|title=GPT Hallucination Detection Through Prompt Engineering
|pdfUrl=https://ceur-ws.org/Vol-3740/paper-69.pdf
|volume=Vol-3740
|authors=Marco Siino,Ilenia Tinnirello
|dblpUrl=https://dblp.org/rec/conf/clef/SiinoT24
}}
==GPT Hallucination Detection Through Prompt Engineering==
GPT Hallucination Detection Through Prompt Engineering Notebook for the ELOQUENT Lab at CLEF 2024 Marco Siino1,* , Ilenia Tinnirello2 1 University of Catania, Piazza Università 2, Catania, 95131, Italy 2 University of Palermo, Piazza Marina 61, Palermo, 90133, Italy Abstract Detecting hallucinated or factually inaccurate information from GPT models can be challenging for humans. Consequently, it is crucial to thoroughly test Large Language Models (LLMs) for their accuracy before deployment. One potential method for identifying hallucinated content, which is explored at the ELOQUENT 2024 Lab hosted at CLEF 2024, involves using LLMs to assess the output of other LLMs. In this paper, we discuss the application of a Mistral 7B model to address the task in the hard labelling setup for English and Swedish. Our approach leverages a Mistral 7B model along with a few-shot learning strategy and prompt engineering. Thanks to our approach, on the English test set, our proposal achieved an F1 of 0.72, and on the Swedish test set, it achieved an F1 of 0.75. Our selected approach is able to outperform some of the baselines provided for the competition while outperforming other LLM-based approaches. Keywords GPT, hallucinations detection, mistral 7B, LLM, prompt engineering 1. Introduction In recent years, Natural Language Processing (NLP) has been reshaped by Generative Pre-trained Transformer (GPT) models [1, 2], by generating human-like text across various applications. Despite their impressive capabilities, these models exhibit a phenomenon known as "hallucination" [3, 4, 5] where they produce text that is plausible but factually incorrect or nonsensical. Understanding the hallucination phenomenon is crucial for improving the reliability and trustworthiness of these AI systems. Hallucination in the context of GPT models refers to the generation of content that is not grounded in the input data or real-world knowledge. This can manifest itself in several ways: a) Factual Errors: The model generates incorrect information about well-known facts, such as stating that "Paris is the capital of Italy" instead of France, b) Incoherent Responses: The model produces text that does not logically follow from the input, resulting in gibberish or irrelevant content, c) Invented Details: The model creates details or events that did not occur, which can be problematic in contexts requiring accuracy, such as news articles or scientific reports. The detection of hallucinated content online presents a growing challenge, necessitating the develop- ment of automated tools for data extraction and categorization. These tools can address established and emerging societal concerns. Recent advancements in machine and deep learning architectures have fuelled a surge in interest towards NLP techniques. Building upon this surge in NLP research, several text classification strategies have been proposed in the literature to automate the identification and categorization of online textual content [6, 7]. In the last fifteen years, some of the most successful strategies have been based on SVM [8, 9], on Convolutional Neural Network (CNN) [10, 11], on Graph Neural Network (GNN) [12], on ensemble models [13, 14] and, recently, on Transformers [1, 15, 16, 17]. The surge in the adoption of LLM-based architectures within academic research has been further propelled by diverse methodologies showcased at SemEval 2024 where, regarding the Task 6 (namely, SHROOM), the ELOQUENT Task finds its main foundation. ELOQUENT and SHROOM were proposed CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. $ marco.siino@unipa.it (M. Siino); ilenia.tinnirello@unipa.it (I. Tinnirello) https://github.com/marco-siino (M. Siino) 0000-0002-4453-5352 (M. Siino); 0000-0002-1305-0248 (I. Tinnirello) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings independently around the same time, making them contemporaries. HalluciGen was not based on the SHROOM task; however, due to the timing of the two tasks, SHROOM’s data was leveraged for the paraphrase scenario. It is important to note that SHROOM data was not used for the translation scenario. Additionally, SHROOM and ELOQUENT have different aims: ELOQUENT focuses on both generation and detection, with a specific interest in LLM systems, whereas SHROOM is solely a detection task and does not specify LLM systems. Also at SemEval, LLM applications address a range of tasks and yield notable outcomes. For instance, in Task 2 [18], T5 is utilized to confront the challenge of identifying the inference relation between plain language statements and Clinical Trial Reports [19]. In Task 10, a Mistral 7B model is employed to perform emotion Recognition in Conversation (ERC) within Hindi-English code-mixed conversations [20]. Additionally, in Task 8 [21], a DistilBERT model is leveraged to identify machine-generated text [22]. Finally, for the Task 2 at ELOQUENT 2024 – HalluciGen Detection – the organizers aim to develop robust LLM-based detectors for hallucinated content. To facilitate a cross-model evaluation, the first objective focuses on creating evaluators that can both identify and generate hallucinations. The second objective involves testing these evaluators on challenging hallucination cases to reveal their strengths and weaknesses. To face with the paraphrase scenario of the HalluciGen task, here we describe a system submitted for the English and the Swedish language, proposing a Transformer-based approach which made use of Mistral 7B [23]. We used the model in a particular few-shot way described in the rest of this paper. Specifically, we provided 16 and 20 samples from the English and from the Swedish training set, respectively. We opted for Mistral 7B because the comparative analysis between Mistral 7B and other leading models, namely Llama 2 and Llama 1, reveals noteworthy advancements in common NLP tasks. Across multiple benchmark evaluations, Mistral 7B consistently exhibits superior performance in comparison to Llama 2, a prominent open 13B model. Moreover, its efficacy extends beyond mere parity with, but rather exceeds, the achievements of Llama 1, a state-of-the-art 34B model, particularly in tasks pertaining to reasoning, mathematics, and code generation. Our findings are also supported by the final ranking where several Llama-based approaches underperform when compared to our approach which makes use of Mistral 7B. This is how the remainder of the paper is developed. We give some background information on Task 2 hosted at ELOQUENT 2024 in Section 2. Section 3 offers an explanation of the methodology utilized. We describe the experimental setup used to reproduce our work in Section 4. The official task results and some discussions are given in Section 5. We provide our conclusion and suggestions for further research in section 6. We make all the code publicly available and reusable on GitHub. 2. Task Description This section furnishes background information regarding the Task 2, held at ELOQUENT 2024. This task describes a challenge for participants to develop models that can detect hallucinations in machine translation and paraphrase generation tasks. The challenge seeks creation of multilingual and monolin- gual models that can both detect and generate hallucinations in machine translation (evaluating two translations) and paraphrase generation (assessing a single paraphrase). These models, given a source sentence and potential outputs (hypotheses), need to identify nonsensical content (hallucinations) even without a reference translation or paraphrase for comparison. For our submission, we only addressed the detection step, where we were asked to select which one out of two hypotheses provided was a hallucination. An example from the official Task description is shown in the Figure 1. Finally, the task organizers requested the submission of a CSV file in the format shown in the Figure 2. In the first column is reported the ID of the test sample considered, in the second column it is reported which one of the two hypotheses is the hallucinated one and in the last column there is an optional field where it is possible to report any possible explanation provided by the model. Figure 1: In the Figure is shown a sample from the task description page. The output of the model for the task has to be one out of hyp1 or hyp2. In this case, the hallucinated hypothesis is hyp1. 3. System Overview Even if it has already been empirically shown on a few tasks (e.g., text classification, author profiling etc. [24, 25, 26, 27]) that Transformers alone are not necessarily the best option for performing text classification, depending on the goal some strategies like domain-specific fine-tuning [28, 29], or data augmentation [30, 31] can be beneficial for several applications. As a starting point, we tried to leverage Mistral 7B in a zero-shot way. However, the zero-shot prompting with Mistral 7B Instruct likely resulted in hallucinations rather than providing the expected output labels (hyp1, hyp2) due to several factors: the model may not have been sufficiently trained on the specific task without additional context or examples, leading to reliance on its prior knowledge, which may not align perfectly with the task’s requirements. Additionally, the prompts used may not have been clear or specific enough, resulting in open-ended responses instead of precise labels. The complexity of the task might involve nuances requiring a more guided or fine-tuned approach, and even state-of-the-art models like Mistral 7B Instruct have limitations in zero-shot scenarios, struggling without sufficient context and examples. Considering this preliminary findings, our approach is a few-shot one [32] and makes use of the above-mentioned Mistral 7B. Mistral 7B - specifically Mistral-7B- Instruct-v0.2 from Hugging Face - is a language model equipped with 7 billion parameters, is designed to excel in both performance and efficiency. Compared to the leading open 13B model (Llama 2), Mistral 7B demonstrates superior performance across all evaluated benchmarks [23]. Moreover, it outperforms the top released 34B model (Llama 1) in tasks related to reasoning, mathematics, and code generation. The model leverages grouped-query attention (GQA) to expedite inference, along with sliding window attention (SWA) to efficiently process sequences of varying lengths while minimizing inference costs. Additionally, a fine-tuned variant, Mistral 7B – Instruct, tailored for adhering to instructions, surpasses the Llama 2 13B – chat model across both human and automated benchmarks. The introduction of Mistral 7B Instruct underscores the ease with which the base model can be fine-tuned to achieve notable performance enhancements. The Mistral 7B Instruct variant requires a specific input format, as stated below:[INST] Instruction [/INST] Model answer[INST] Follow-up instruction [/INST] Instruction, along with the following Model answer, can be a single sample with the related label or a set of sample/label pairs (realizing, in this case, a few-shot use of the model). Then, Follow-up instruction is the current sample for which the prediction has to be provided by the model. Specifically, we have prepared a few-shot text string containing samples from the training set along with their respective labels. At this point, the full text containing the training samples plus the sample to be classified were provided as prompt to Mistral. Then the question provided as prompt to mistral was: "Which one of hyp1 and hyp2 is not supported by src?". To this request, the model replied with one option between hyp1 or hyp2. Figure 2: In the Figure, it is shown the output format requested by the organizers for the detection task. One file is requested for each of the two languages (i.e., English and Swedish). So, as an example from the test set, to the sentence: "Mr President, the approach adopted by the rapporteur to the Commission’s 1999 annual economic report is comprehensive and also sensible." and the hypothesis 1: "The approach taken by the rapporteur to the 1997 annual economic report is comprehensive and sensible." and the hypothesis 2: "The approach taken by the rapporteur to the 1999 annual economic report is comprehensive and sensible." the model replied to the prompt: hyp1. It is important to mention that we also tried to use the model in a zero-shot configuration. In this case, we just asked the model to pick one of the two hypotheses as hallucinated content. Unfortunately, the model usually developed discussions as answers that, in most cases, did not identify the hallucinated hypothesis. Finally, we collected all the predictions provided on the test set to into a CSV file with the required format to submit our predictions. The one just discussed is the approach followed for the English test set. In the case of the Swedish test set, we made use of deep_translator from the Google Translator library, to translate samples into English before feeding Mistral 7B. Our preliminary experiments on feeding Mistral with the original Swedish samples did not provide relevant results. As noted in the recent study by [26], the contribution of preprocessing for text classification tasks is generally not impactful when using Transformers. More specifically, the best combination of prepro- cessing strategies is not significantly different from performing no preprocessing at all in the case of the LLMs evaluated. For these reasons, and to keep our system fast and computationally light, we have not performed any preprocessing on the text. The low impact of the best preprocessing techniques - or combinations of techniques - using Transformers, as reported in the study, is due to several factors like preserving the quantity and the quality of the original information available. 4. Experimental Setup We implemented our model on Google Colab. The library we used comes from Hugging Face and as already mentioned is Mistral 7B. We employed the v0.2 iteration of Mistral 7B, which represents an enhanced version of the Mistral-7B-Instruct-v0.1 model. To harness the capabilities of instruction fine-tuning, prompts must be enclosed within [INST] and [/INST] tokens. Additionally, the initial instruction should commence with a sentence identifier. The next instructions should not. The assistant generation will be ended by the end-of-sentence token ID. We also imported the Llama library [33] from llama_cpp. We did not perform any additional fine-tuning on the model. To run the experiment, a T4 GPU from Google has been used. After the generation of predictions, we exported the results on the format required by the organizers. As already mentioned, all of our code is available on GitHub. 5. Results To compile the final ranking, the evaluation used the F1-score based on gold labels indicating which hypothesis contains the hallucination. These labels are human-annotated for the paraphrase task. However, it is worth mentioning that also the accuracy, the precision and the recall were provided in the final ranking. Table 1 Performance of participant models for the English language. Results are sorted according to the F1-score. Our model ranked 6th. Pos Participant acc f1 prec rec Model final_gpt4_en_v2_detection - Narjes 1 0.91 0.91 0.91 0.91 gpt-4-turbo Nikzad.csv Majority voting of 2 test_pg_english - harika vuppala.csv 0.90 0.90 0.91 0.90 different finetuned LLMs Majority vote on google/gemma-7B- it, majority_vote_result_en_narjes - Narjes 3 0.85 0.85 0.86 0.85 meta-llama/Meta- Nikzad.csv Llama-3-8B-Instruct, gpt-3.5-turbo, gpt-4-turbo final_llama3_en_v1_detection - Narjes meta-llama/Meta- 4 0.80 0.80 0.81 0.80 Nikzad.csv Llama-3-8B-Instruct final_gpt_en_v1_detection - Narjes 5 0.73 0.73 0.83 0.73 gpt-3.5-turbo Nikzad.csv eloquent2024_mc_mistral_en_prediction - 6 0.73 0.72 0.73 0.73 Mistral 7B Marco Siino.csv final_gemma_en_v1_detection - Narjes 7 0.71 0.71 0.77 0.71 google/gemma-7B-it Nikzad.csv final_llama3_prompt_narjes_en_v1_detection meta-llama/Meta- 8 0.69 0.69 0.81 0.69 - Narjes Nikzad.csv Llama-3-8B final_gpt_en_narjes_detection - Narjes 9 0.68 0.68 0.75 0.68 gpt-3.5-turbo Nikzad.csv 10 final_gemma_en_vnarjes - Narjes Nikzad.csv 0.54 0.49 0.73 0.54 gemma-7B-it In Table 1, the results obtained by the participants are shown along with the models used. While we do not know the details of other participants’ implementations, we can notice that most of the submissions were made by the same team. Furthermore, it is not clear and fully explained the gap between the team at the position 5 and the team at position 9. While it seems that both used the same model (i.e., GPT-3.5) it is not easy from our perspective to motivate the actual gap. It is also worth noting that our approach ranked better in Swedish than in English. This result is shown in the Table 2 and can be motivated with the findings reported in [34], where the authors state that during the translation process some relevant semantic can be made more explicit. For both languages, GPT 4 Turbo appeared to be the best performing model according to the final ranking. Compared to the best performing models, our simple approach exhibits some room for improvements, although it is able to outperform some of the baseline provided. However, it is worth noticing that it required no further pre-training and the computational cost to address the task is manageable with the free online resources offered by Google Colab. Furthermore, our approach made use of a quantized version of Mistral 7B available on Hugging Face and referenced in our code available on GitHub. 6. Conclusion This paper presents the application of Mistral 7B-model for addressing Task 2 at ELOQUENT 2024 hosted at CLEF 2024. For our submission, we decided to follow a few-shot learning approach, employing as-is, an in-domain pre-trained Transformer. After several experiments, we found it beneficial to build a prompt containing samples from the training set. Then we provide, as a prompt, the few-shot samples together with a test sample. The model was asked to select which hypotheses were an actual hallucination. The task is challenging, and there is still opportunity for improvement, as can be noted Table 2 Performance of participant models for the Swedish language. Results are sorted according to the F1-score. Our model ranked 3rd. Pos Participant acc f1 prec rec Model final_gpt4_se_v1_detection - Narjes 1 0.81 0.81 0.81 0.81 gpt-4-turbo Nikzad.csv Majority voting of 2 test_pg_swedish - harika vuppala.csv 0.80 0.79 0.82 0.80 different finetuned LLMs eloquent2024_mc_mistral_sv_prediction - 3 0.76 0.75 0.78 0.76 Mistral 7B Marco Siino.csv final_gpt_se_v1_detection - Narjes 4 0.71 0.70 0.76 0.71 gpt-3.5-turbo Nikzad.csv Majority vote on google/gemma-7B- it, majority_vote_result_se_narjes - Narjes 5 0.67 0.66 0.72 0.67 meta-llama/Meta- Nikzad.csv Llama-3-8B-Instruct, gpt-3.5-turbo, gpt-4-turbo final_gpt_se_narjes_detection - Narjes 6 0.61 0.60 0.65 0.61 gpt-3.5-turbo Nikzad.csv final_llama3_se_v1_detection - Narjes meta-llama/Meta- 7 0.60 0.59 0.60 0.60 Nikzad.csv Llama-3-8B-Instruct final_gemma_se_v1_detection - Narjes 8 0.59 0.52 0.71 0.59 gemma-7B-it Nikzad.csv final_llama3_prompt_narjes_se_v2_detection meta-llama/Meta- 9 0.57 0.48 0.77 0.57 - Narjes Nikzad.csv Llama-3-8B 10 final_gemma_se_vnarjes - Narjes Nikzad.csv 0.07 0.11 0.47 0.07 google/gemma-7B-it looking at the final ranking. Possible alternative approaches include utilizing the zero-shot capabilities of other models like GPT and T5, increasing the size of the few-shot set by using further data from the training set, or directly integrating ontology-based domain knowledge differently than what has been proposed in our work. Further improvements could be obtained with a fine-tuning and modelling the problem as a different text classification task. Furthermore, given the interesting results recently provided on a plethora of tasks, also other few-shot learning [35, 36, 37, 38] or data augmentation strategies [39, 34, 40, 41] could be employed to improve the results. Looking at the final ranking, our simple approach exhibits some room for improvements. However, it is worth noticing that required no further pre-training and the computational cost to address the task is manageable with the free online resources offered by Google Colab. Also, thanks to the proposed approach, we have been able to outperform the baseline provided by the task organizers. Acknowledgments We extend our gratitude to the anonymous reviewers for their insightful comments and valuable suggestions, which have significantly enhanced the clarity and presentation of this paper. CRediT Authorship Contribution Statement Marco Siino: Conceptualization, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing - Original draft, writing - review & editing. Ilenia Tinnirello: Writing - review & editing, Methodology. References [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). [2] G. Yenduri, M. Ramalingam, G. C. Selvi, Y. Supriya, G. Srivastava, P. K. R. Maddikunta, G. D. Raj, R. H. Jhaveri, B. Prabadevi, W. Wang, et al., GPT (generative pre-trained transformer)–a comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions, IEEE Access (2024). [3] M. Lee, A mathematical investigation of hallucination and creativity in GPT models, Mathematics 11 (2023) 2320. [4] S. A. Athaluri, S. V. Manthena, V. K. M. Kesapragada, V. Yarlagadda, T. Dave, R. T. S. Duddumpudi, Exploring the boundaries of reality: Investigating the phenomenon of artificial intelligence hallu- cination in scientific writing through chatgpt references, Cureus 15 (2023) e37432. [5] M. Siino, BrainLlama at SemEval-2024 task 6: Prompting llama to detect hallucinations and related observable overgeneration mistakes, in: A. K. Ojha, A. S. Doğruöz, H. Tayyar Madabushi, G. Da San Martino, S. Rosenthal, A. Rosá (Eds.), Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), Association for Computational Linguistics, Mexico City, Mexico, 2024, pp. 82–87. [6] M. Siino, All-Mpnet at SemEval-2024 Task 1: Application of Mpnet for Evaluating Semantic Textual Relatedness, in: A. K. Ojha, A. S. Doğruöz, H. Tayyar Madabushi, G. Da San Martino, S. Rosenthal, A. Rosá (Eds.), Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval- 2024), Association for Computational Linguistics, Mexico City, Mexico, 2024, pp. 379–384. [7] M. Siino, DeBERTa at SemEval-2024 Task 9: Using DeBERTa for Defying Common Sense, in: A. K. Ojha, A. S. Doğruöz, H. Tayyar Madabushi, G. Da San Martino, S. Rosenthal, A. Rosá (Eds.), Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), Association for Computational Linguistics, Mexico City, Mexico, 2024, pp. 291–297. [8] F. Colas, P. Brazdil, Comparison of svm and some older classification algorithms in text classification tasks, in: IFIP International Conference on Artificial Intelligence in Theory and Practice, Springer, 2006, pp. 169–178. [9] D. Croce, D. Garlisi, M. Siino, An SVM ensemble approach to detect irony and stereotype spreaders on twitter, in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5th - to - 8th, 2022, volume 3180 of CEUR Workshop Proceedings, CEUR-WS.org, 2022, pp. 2426–2432. [10] Y. Kim, Convolutional neural networks for sentence classification, in: A. Moschitti, B. Pang, W. Daelemans (Eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, ACL, 2014, pp. 1746–1751. URL: https://doi.org/10.3115/v1/ d14-1181. doi:10.3115/V1/D14-1181. [11] M. Siino, E. Di Nuovo, I. Tinnirello, M. La Cascia, Detection of hate speech spreaders using convolutional neural networks, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), Proceedings of the Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, Bucharest, Romania, September 21st - to - 24th, 2021, volume 2936 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 2126–2136. [12] F. Lomonaco, G. Donabauer, M. Siino, COURAGE at checkthat!-2022: Harmful tweet detection using graph neural networks and ELECTRA, in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5th - to - 8th, 2022, volume 3180 of CEUR Workshop Proceedings, CEUR-WS.org, 2022, pp. 573–583. [13] M. Miri, M. B. Dowlatshahi, A. Hashemi, M. K. Rafsanjani, B. B. Gupta, W. Alhalabi, Ensem- ble feature selection for multi-label text classification: An intelligent order statistics approach, International Journal of Intelligent Systems 37 (2022) 11319–11341. [14] M. Siino, I. Tinnirello, M. La Cascia, T100: A modern classic ensemble to profile irony and stereotype spreaders, in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5th - to - 8th, 2022, volume 3180 of CEUR Workshop Proceedings, CEUR-WS.org, 2022, pp. 2666–2674. [15] M. Siino, M. La Cascia, I. Tinnirello, McRock at SemEval-2022 Task 4: Patronizing and Conde- scending Language Detection using Multi-Channel CNN, Hybrid LSTM, DistilBERT and XLNet, in: G. Emerson, N. Schluter, G. Stanovsky, R. Kumar, A. Palmer, N. Schneider, S. Singh, S. Ratan (Eds.), Proceedings of the 16th International Workshop on Semantic Evaluation, SemEval@NAACL 2022, Seattle, Washington, United States, July 14-15, 2022, Association for Computational Linguistics, 2022, pp. 409–417. doi:10.18653/V1/2022.SEMEVAL-1.55. [16] M. Siino, McRock at SemEval-2024 task 4: Mistral 7B for multilingual detection of persuasion techniques in memes, in: A. K. Ojha, A. S. Doğruöz, H. Tayyar Madabushi, G. Da San Martino, S. Rosenthal, A. Rosá (Eds.), Proceedings of the 18th International Workshop on Semantic Evalua- tion (SemEval-2024), Association for Computational Linguistics, Mexico City, Mexico, 2024, pp. 53–59. [17] M. Siino, Mistral at SemEval-2024 task 5: Mistral 7B for argument reasoning in civil procedure, in: A. K. Ojha, A. S. Doğruöz, H. Tayyar Madabushi, G. Da San Martino, S. Rosenthal, A. Rosá (Eds.), Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), Association for Computational Linguistics, Mexico City, Mexico, 2024, pp. 155–162. [18] M. Jullien, M. Valentino, A. Freitas, SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials, in: Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), Association for Computational Linguistics, 2024, pp. 1947–1962. [19] M. Siino, T5-Medical at SemEval-2024 Task 2: Using T5 Medical Embeddings for Natural Language Inference on Clinical Trial Data, in: Proceedings of the 18th International Workshop on Semantic Evaluation, SemEval 2024, Mexico City, Mexico, 2024, pp. 40–46. [20] M. Siino, TransMistral at SemEval-2024 Task 10: Using Mistral 7B for Emotion Discovery and Reasoning its Flip in Conversation, in: Proceedings of the 18th International Workshop on Semantic Evaluation, SemEval 2024, Mexico City, Mexico, 2024, pp. 298–304. [21] Y. Wang, J. Mansurov, P. Ivanov, J. Su, A. Shelmanov, A. Tsvigun, C. Whitehouse, O. M. Afzal, T. Mahmoud, G. Puccetti, T. Arnold, A. F. Aji, N. Habash, I. Gurevych, P. Nakov, Semeval-2024 task 8: Multigenerator, multidomain, and multilingual black-box machine-generated text detection, in: Proceedings of the 18th International Workshop on Semantic Evaluation, SemEval 2024, Mexico, Mexico, 2024, pp. 2057–2079. [22] M. Siino, BadRock at SemEval-2024 Task 8: DistilBERT to Detect Multigenerator, Multidomain and Multilingual Black-Box Machine-Generated Text, in: Proceedings of the 18th International Workshop on Semantic Evaluation, SemEval 2024, Mexico City, Mexico, 2024, pp. 239–245. [23] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al., Mistral 7b, arXiv preprint arXiv:2310.06825 (2023). [24] M. Siino, E. Di Nuovo, I. Tinnirello, M. La Cascia, Fake News Spreaders Detection: Sometimes Attention Is Not All You Need, Information 13 (2022) 426. doi:10.3390/INFO13090426. [25] J. Pizarro, Profiling Bots and Fake News Spreaders at PAN’19 and PAN’20: Bots and Gender Profiling 2019, Profiling Fake News Spreaders on Twitter 2020, in: Proceedings - 2020 IEEE 7th International Conference on Data Science and Advanced Analytics, DSAA 2020, 2020, p. 626 – 630. doi:10.1109/DSAA49011.2020.00088. [26] M. Siino, I. Tinnirello, M. La Cascia, Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers, Information Systems 121 (2024) 102342. [27] R. O. Bueno, B. Chulvi, F. Rangel, P. Rosso, E. Fersini, Profiling irony and stereotype spreaders on twitter (IROSTEREO). overview for PAN at CLEF 2022, in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5th - to - 8th, 2022, volume 3180 of CEUR Workshop Proceedings, CEUR-WS.org, 2022, pp. 2314–2343. [28] C. Sun, X. Qiu, Y. Xu, X. Huang, How to fine-tune BERT for text classification?, in: Chinese Computational Linguistics: 18th China National Conference, CCL 2019, Kunming, China, October 18–20, 2019, Proceedings 18, Springer, 2019, pp. 194–206. [29] D. Van Thin, D. N. Hao, N. L.-T. Nguyen, Vietnamese sentiment analysis: An overview and comparative study of fine-tuning pretrained language models, ACM Transactions on Asian and Low-Resource Language Information Processing 22 (2023) 1–27. [30] F. Lomonaco, M. Siino, M. Tesconi, Text enrichment with japanese language to profile cryptocur- rency influencers, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki, Greece, September 18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 2708–2716. [31] S. Mangione, M. Siino, G. Garbo, Improving irony and stereotype spreaders detection using data augmentation and convolutional neural network, in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5th - to - 8th, 2022, volume 3180 of CEUR Workshop Proceedings, CEUR-WS.org, 2022, pp. 2585–2593. [32] J. Littenberg-Tobias, G. R. Marvez, G. Hillaire, J. Reich, Comparing few-shot learning with GPT-3 to traditional machine learning approaches for classifying teacher simulation responses, in: AIED (2), volume 13356 of Lecture Notes in Computer Science, Springer, 2022, pp. 471–474. [33] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, Llama: Open and efficient foundation language models, 2023. arXiv:2302.13971. [34] M. Siino, F. Lomonaco, P. Rosso, Backtranslate what you are saying and i will tell who you are, Expert Systems n/a (2024) e13568. doi:https://doi.org/10.1111/exsy.13568. [35] X. Wang, X. Wang, B. Jiang, B. Luo, Few-shot learning meets transformer: Unified query-support transformers for few-shot classification, IEEE Trans. Circuits Syst. Video Technol. 33 (2023) 7789– 7802. URL: https://doi.org/10.1109/TCSVT.2023.3282777. doi:10.1109/TCSVT.2023.3282777. [36] B. M. S. Maia, M. C. F. Ribeiro de Assis, L. M. de Lima, M. B. Rocha, H. G. Calente, M. L. A. Correa, D. R. Camisasca, R. A. Krohling, Transformers, convolutional neural networks, and few-shot learning for classification of histopathological images of oral cancer, Expert Systems with Applications 241 (2024) 122418. URL: https://www.sciencedirect.com/science/article/pii/ S0957417423029202. doi:https://doi.org/10.1016/j.eswa.2023.122418. [37] M. Siino, M. Tesconi, I. Tinnirello, Profiling Cryptocurrency Influencers with Few-Shot Learning Using Data Augmentation and ELECTRA, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki, Greece, September 18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 2772–2781. [38] Z. Meng, Z. Zhang, Y. Guan, J. Li, L. Cao, M. Zhu, J. Fan, F. Fan, A hierarchical transformer-based adaptive metric and joint-learning network for few-shot rolling bearing fault diagnosis, Measurement Science and Technology 35 (2024). URL: https://www.scopus.com/inward/ record.uri?eid=2-s2.0-85180156886&doi=10.1088%2f1361-6501%2fad11e9&partnerID=40&md5= 5cf48be6a9dc20b836051fb5c8a4c47b. doi:10.1088/1361-6501/ad11e9. [39] F. Muftie, M. Haris, IndoBERT Based Data Augmentation for Indonesian Text Classification, in: 2023 International Conference on Information Technology Research and Innovation, ICITRI 2023, 2023, p. 128 – 132. doi:10.1109/ICITRI59340.2023.10250061. [40] J. M. Tapia-Téllez, H. J. Escalante, Data augmentation with transformers for text classification, in: L. Martínez-Villaseñor, O. Herrera-Alcántara, H. Ponce, F. A. Castro-Espinoza (Eds.), Advances in Computational Intelligence, Springer International Publishing, Cham, 2020, pp. 247–259. [41] M. Siino, I. Tinnirello, XLNet with Data Augmentation to Profile Cryptocurrency Influencers, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki, Greece, September 18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 2763–2771. A. Online Resources The source code of our submission is available via • https://github.com/marco-siino/eloquent2024