MarSan at PAN: BinocularsLLM, Fusing Binoculars’ Insight with the Proficiency of Large Language Models for Machine-Generated Text Detection Notebook for PAN at CLEF 2024 Ehsan Tavan1,† , Maryam Najafi1,2,† 1 NLP Department, Part AI Research Center, Tehran, Iran 2 Department of Computer Science and Information Systems, University of Limerick, Castletroy, V94 T9PX Limerick, Ireland Abstract Large Language Models have revolutionized natural language processing, exhibiting remarkable fluency and quality in generating human-like text. However, this advancement also brings challenges, particularly in distin- guishing between human and machine-generated content. In this study, we propose an ensemble framework called BinocularsLLM for the PAN 2024 ’Voight-Kampff’ Generative AI Authorship Verification task. Binoc- ularsLLM integrates supervised fine-tuning of LLMs with a classification head and the Binoculars framework, demonstrating promising results in detecting machine-generated text. Through extensive experimentation and evaluation, we showcase the effectiveness of our approach in addressing this critical task, achieving a perfect ROC-AUC score of 96.1%, a Brier score of 92.8%, a C@1 score of 91.2%, an F1 score of 88.4%, and an F0.5u score of 93.2% across all test datasets. BinocularsLLM outperforms all participants and baseline approaches, indicating its superior ability to generalize effectively and distinguish between human and machine-generated content. Our framework achieves the first rank among 30 teams participating in this competition. Keywords PAN 2024, Large Language Models, Machine-Generated Text Detection, Instruction Fine-Tuning 1. Introduction In recent years, Large Language Models (LLMs) have made remarkable advancements, generating text that closely mimics human language with high fluency and quality. Models such as ChatGPT [1], GPT-3 [2], LLaMa [3], and Mistral [4] demonstrate impressive performance in a variety of tasks including question-answering, writing stories, and analyzing program code. These technologies offer significant potential to enhance efficiency and scalability across various domains, driving innovation and productivity [5, 6]. Machine-generated text is now used in a wide range of applications, from powerful chatbots [7] and real-time language translation [8] to analyzing and generating program code [9]. However, the sophistication of these models also introduces new challenges in distinguishing between human- generated and machine-generated content. The ability to reliably detect machine-generated text is crucial. With the rapid expansion of informa- tion on the internet, there is an increased risk of misinformation spreading unchecked. The misuse of LLMs for generating fake news, fake product reviews, and propaganda pose substantial threats to the integrity of online communication. Furthermore, malicious activities such as spamming and fraud are intensified by the advanced capabilities of these models. Effective detection mechanisms are essential to protect against these risks, ensuring that digital content remains trustworthy and authentic. Developing tools and strategies to automatically detect machine-generated texts is essential to mitigate the threats posed by the misuse of LLMs. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France † These authors contributed equally. $ ehsan.tavan@partdp.ai (E. Tavan); maryam.najafi@ul.ie (M. Najafi) € https://github.com/Ehsan-Tavan (E. Tavan)  0000-0003-1262-8172 (E. Tavan); 0000-0001-5025-2044 (M. Najafi) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings In the PAN’24 "Voight-Kampff" Generative AI Authorship Verification task [10, 11], participants are faced with an innovative challenge. Their task involves examining two texts: one authored by a human and the other by a machine. The goal is to identify the text authored by a human. This task highlights the ongoing need for robust methods to differentiate between human and machine-generated content, underscoring the importance of continued research and development in this area [12, 13, 10]. In this study, we explore innovative approaches to machine-generated text detection by investigating several key hypotheses. First, we examine whether leveraging LLMs with instruction fine-tuning can enhance the effectiveness of detecting machine-generated content. Second, we test the feasibility of training LLMs with a classification head that utilizes softmax to produce accurate output labels. Lastly, we investigate whether combining zero-shot techniques, which utilize metrics like perplexity and entropy, with fine-tuned models can significantly improve the accuracy of machine-generated text detection. These hypotheses aim to push the boundaries of our current detection capabilities, potentially leading to breakthroughs in ensuring the authenticity of digital content. In Section 3, we introduce BinocularsLLM, our proposed ensemble framework, which integrates fine-tuned LLama2 [3] and Mistral models with a classification head, while also incorporating the Binoculars [14] model. This framework undergoes evaluation on both the main and nine additional test datasets, demonstrating notably promising results. In this paper, we conduct a comprehensive evaluation of Voight-Kampff Generative AI Authorship Verification tasks, comparing our proposed framework against both baseline models and state-of-the-art approaches. We have made our code and data publicly available on our GitHub repository1 and our fine-tuned models are available on Hugging Face: Generative-AV-Mistral-v0.1-7b2 and Generative-AV- LLaMA-2-7b3 . Our contributions are organized as follows: Section 2 reviews the relevant background literature. Section 3 introduces BinocularsLLM. Section 4 details the evaluation metrics and presents the experimental results. 2. Background The detection of machine-generated text has become a critical area of research, driven by the rapid advancement and widespread use of large language models (LLMs) such as GPT-4 [15], PaLM [16], and ChatGPT. This task is typically formulated as a classification problem. This section reviews existing methodologies categorized into supervised learning approaches, zero-shot detection models, and watermarking techniques. Supervised Learning Approaches: Supervised learning methods train classifiers on labeled datasets [17, 18, 19]. Models like GPT2 Detector [20] and ChatGPT Detector [21] fine-tunes pre-trained models such as RoBERTa [22] on the output of GPT2 [23] and the HC3 [21] dataset. While these models demonstrate high accuracy within their training domains, they often struggle with generalization to out-of-domain texts [24, 25]. Techniques such as adversarial training [26] and abstention [27] have been explored to enhance robustness, but challenges remain, particularly in maintaining low false positive rates across diverse text distributions [28]. Zero-Shot Detection Models: Another approach to identifying machine-generated text involves zero-shot detection models, which leverage statistical features in texts without requiring explicit training on labeled datasets. These models, such as DetectGPT [29] and others [30, 31], analyze universal features inherent in machine-generated texts. They exploit concepts like entropy, perplexity, and n-gram frequencies to distinguish between human and machine-generated text. These models offer robustness across different types of text and languages, circumventing the domain-specific limitations of supervised classifiers [30]. However, the computational demands remain a significant challenge, particularly in methods relying on probability curvature and extensive perturbations [29, 31]. 1 https://github.com/MarSanTeam/BinocularsLLM 2 https://huggingface.co/Ehsan-Tavan/Generative-AV-Mistral-v0.1-7b 3 https://huggingface.co/Ehsan-Tavan/Generative-AV-LLaMA-2-7b Watermarking Techniques: Watermarking involves embedding detectable patterns into the gener- ated text that are imperceptible to humans but identifiable by algorithms. Grinbaum and Adomaitis [32] and Abdelnabi and Fritz [33] utilized syntax tree manipulation to embed watermarks, while Kirchenbauer et al. [34] required access to the LLM’s logits to modify token probabilities. Although effective, these methods necessitate control over the text generation process, limiting their applicability to scenarios where such control is feasible. Samples Instruction Tuning Fine- text 1 text 2 label text 1 text 2 label text 1 text 2 label Tuning Fine-Tuned LLM text 1 text 2 label 0 Mean text 1 is Human-Generated text 1 text 2 Fine-Tuned LLM 1 Mean text 2 is Human-Generated Figure 1: Instruction fine-tuning process. 3. System Overview In this section, we present BinocularsLLM, our ensemble framework to address the PAN’24 "Voight- Kampff" Generative AI Authorship Verification task, with a focus on detecting machine-generated text. Our goals are twofold: to compare the effectiveness of classification-head fine-tuning with instruction fine-tuning and to integrate the power of the Binoculars technique with fine-tuned LLMs. Both approaches utilize QLoRA, ensuring that only the QLoRA and the classification head weights are trained, not all the parameters of the LLM. The Binoculars model4 employs observer and performer models to evaluate perplexity and entropy, critical metrics for identifying machine-generated text. By integrating these evaluations with the advanced capabilities of supervised fine-tuning, our ensemble is designed to be capable of distinguishing between human and machine-generated text. Based on our experiments, we observed that LLM models employing a classification head performed more effectively in detecting machine-generated texts compared to instruction fine-tuning. Conse- quently, BinocularsLLM integrates two fine-tuned LLMs, LLaMA2 and Mistral (selected based on the results in Table 1), alongside the Binoculars approach. This comprehensive approach leverages the capabilities of statistical metrics and LLM fine-tuning, ensuring robust and accurate detection of machine-generated text. 3.1. Instruction Fine-Tuning for Machine-Generated Text Detection Instruction Fine-Tuning (IT) involves further training LLMs with specific input-output pairs and accompanying instructions in a supervised manner. This approach has proven effective in enhancing an 4 https://github.com/ahans30/Binoculars LLM’s ability to generalize to new, unseen tasks [35] and is considered a viable strategy for improving LLM alignment [36, 37]. In our study on Voight-Kampff Generative AI Authorship Verification, we examine the efficacy of the IT method. Specifically, we evaluate various LLMs’ performance when fine-tuned with a specific set of instructions. This process involve creating an instruction dataset, 𝑉 , comprising instruction pairs 𝑠 = (INSTRUCTION, OUTPUT). Each instruction 𝑠 is generated using a fixed template and samples 𝑥 from the training dataset 𝑅. These samples are labeled 𝑥𝑙 based on their corresponding labels in dataset 𝑅. Figure 1 illustrates our instruction fine-tuning process. The resulting instruction text detection dataset 𝑉 consists of instruction pairs along with their source labels. A label of 0 indicates the first text is human-generated, while a label of 1 indicates the second text is human-generated. Thus, the instruction text detection dataset 𝑉 includes pairs along with their corresponding source labels, formally represented as 𝑉 = {(instruction, 𝑥, 𝑥𝑙 ) | 𝑥 ∈ 𝑅}. Here’s an illustration of the instruction format: Instruction: I provide two texts and ask you to determine which one is authored by humans and which one is authored by machines. Your output is simply a 0 or 1; do not generate any additional text. 0 indicates Text1 is authored by the machine, and 1 indicates Text2 is authored by the machine. Text1: [𝑥_𝑡𝑒𝑥𝑡1] Text2: [𝑥_𝑡𝑒𝑥𝑡2] Response: [𝑥_𝑟] Given an LLM with parameters 𝜃 as the initial model for instruction tuning, training the model on the constructed instruction dataset 𝑉 results in adapting the LLM’s parameters from 𝜃 to 𝜃𝑣 , referred to as the LLM-Detector. Specifically, 𝜃𝑣 is obtained by maximizing the probability of predicting the next tokens in the OUTPUT component of each instruction sample 𝑠, conditioned on the INSTRUCTION. This process is formulated as follows: ∑︁ 𝜃𝑣 = arg max log 𝑃 (𝑂𝑈 𝑇 𝑃 𝑈 𝑇 | 𝐼𝑁 𝑆𝑇 𝑅𝑈 𝐶𝑇 𝐼𝑂𝑁 ; 𝜃, 𝑠) (1) 𝜃 𝑠∈𝑉 3.2. Supervised Fine-Tuning LLMs Fine-tuning LLMs involves adjusting model weights using a labeled dataset to enhance performance on specific tasks. This process can be computationally intensive, requiring significant memory resources, particularly when dealing with full LLM fine-tuning due to its substantial memory demands. To address these challenges, Parameter-Efficient Fine-Tuning (PEFT)[38] techniques such as LoRA [39] and QLoRA [40] are employed. LoRA fine-tunes only two smaller matrices that approximate the larger weight matrix, reducing memory requirements and preserving the original LLM weights. Taking a step further, QLoRA enhances memory efficiency by quantizing these smaller matrices to a lower precision, such as 4-bit, without compromising effectiveness. Employing these fine-tuning techniques for both the classification head fine-tuning and instruction fine-tuning augments the LLM’s capacity to accurately distinguish between machine-generated and human-generated text. The Mistral and Llama2 models are fine-tuned exclusively using the provided bootstrap dataset and the QLoRA technique. Each input example consists of a text string ⟨TEXT⟩ and a corresponding label ⟨LABEL⟩ that indicates the source of the text. {︃ The input format can be represented as: 1 for human-generated text ⟨TEXT⟩ : ⟨LABEL⟩, where LABEL = 0 for machine-generated text 3.3. Inference Time During the inference phase, the process initiates by receiving two texts as input. Each text is processed separately via the fine-tuned LLama2 and Mistral models to predict the probability of being human- written. If the probability assigned to the first text surpasses that of the second text, the score for the input sample is calculated by subtracting the score of the first text from that of the second. Conversely, if the probability of the second text is greater, the input text is labeled as 0. Additionally, the input is also processed with the binoculars model, which generates a score for each text using its specialized algorithm. If the binoculars score of the first text exceeds that of the second text, the input score is assigned as 0; otherwise, it is assigned as 1. Figure 2 illustrates BinocularsLLM. BinocularsLLM Fine-Tuned Models Score Calculator Fine-Tuned LLama 2 score 1- score 2 if score 1 - score 2 > 0 else 0.0 Final Score < 0.5 text 1 is Human-Generated text 1 Mean Fine-Tuned Mistral score 1- score 2 if score 1 - score 2 > 0 else 0.0 Final Score text 2 Final Score > 0.5 text 1 is Human-Generated Binoculars score 1- score 2 if score 1 - score 2 > 0 else 0.0 Figure 2: Overview of BinocularsLLM framework 4. Results In this section, we present the implementation details, evaluation metrics, and provide a comprehensive analysis of the results. We utilize the TIRA [41] platform to evaluate our framework using test datasets. 4.1. Implementation Details In this research, the framework was implemented in PyTorch and executed on Nvidia V100 GPUs. The training process was conducted for 5 epochs, utilizing the AdamW optimizer with a learning rate of 2e-5. The training batch size was set to 2, with gradient accumulation set to 8. For QLoRa, we configured LoRA’s rank to 64 and its alpha to 16, employing 4-bit quantization. To evaluate fine-tuned models, we used 20% of the given dataset as a development dataset. 4.2. Evaluation Metrics To evaluate the performance of our proposed model, we used the evaluation metrics provided by PAN, which include the following metrics: • 𝑅𝑂𝐶 − 𝐴𝑈 𝐶: The conventional area under the curve score. • 𝑐@1: Rewards systems that leave complicated problems unanswered. • 𝐹0.5𝑢 : Focus on deciding same-author cases correctly. • 𝐹 1 − 𝑠𝑐𝑜𝑟𝑒: A harmonic way of combining the precision and recall of the model. • 𝐵𝑟𝑖𝑒𝑟: Evaluates the accuracy of probabilistic predictions. 4.3. Result Analysis on Development Dataset As mentioned earlier, we compare two fine-tuning approaches for detecting machine-generated text: instruction fine-tuning and classification-head fine-tuning. The performance of various LLMs under these methodologies is illustrated in Table 1 using the development dataset. Based on the results from Table 1, we select the two top-performing LLMs to integrate into our ensemble framework. Table 1 Performance of different LLMs under classification head fine-tuning and instruction tuning using the development dataset. Classification Head Instruction Fine-Tuning Model roc brier c@1 f1 f05u mean roc brier c@1 f1 f05u mean llama3-7B 1 0.997 0.995 0.995 0.998 0.997 llama2-7B 1 1 1 1 1 1 0.528 0 0.532 0.568 0.586 0.443 Mistral-7B 1 0.995 0.995 0.995 0.998 0.997 0.835 0.853 0.853 0.882 0.838 0.852 SOLAR-7B 1 0.986 0.986 0.984 0.993 0.99 0.623 0.624 0.624 0.655 0.672 0.64 zephyr-7B 1 0.986 0.986 0.984 0.993 0.99 0.526 0.532 0.532 0.582 0.588 0.552 In analyzing the results presented in Table 1, it becomes evident that both the LLama2-7B and Mistral- 7B models, fine-tuned with a classification head, demonstrate promising performance across various evaluation metrics on our development dataset. LLama2-7B demonstrates exceptional scores across all metrics using the classification head fine-tuning approach, showcasing its robustness in distinguishing between human and machine-generated text. Meanwhile, Mistral-7B also has notable performance, indicating its efficacy in authorship verification tasks. These findings show the effectiveness of em- ploying classification head fine-tuning for both LLama2-7B and Mistral-7B within the BinocularsLLM framework. Comparing classification head fine-tuning with instruction tuning, we observe that classification head fine-tuning yields superior performance. These findings indicate that classification head fine-tuning is more effective than instruction tuning for enhancing the performance of LLMs in distinguishing between human and machine-generated text. Table 2 Overview of the accuracy in detecting if a text is written by a human in task 4 on PAN 2024 (Voight-Kampff Generative AI Authorship Verification). We report ROC-AUC, Brier, C@1, F1 , F0.5𝑢 and their mean. Approach ROC-AUC Brier C@1 F1 F0.5𝑢 Mean BinocularsLLM 1.0 0.995 0.997 0.997 0.999 0.998 Binoculars 0.972 0.957 0.966 0.964 0.965 0.965 Fast-DetectGPT (Mistral) 0.876 0.8 0.886 0.883 0.883 0.866 PPMd 0.795 0.798 0.754 0.753 0.749 0.77 Unmasking 0.697 0.774 0.691 0.658 0.666 0.697 Fast-DetectGPT 0.668 0.776 0.695 0.69 0.691 0.704 4.4. Results on Blinded Test Dataset As Table 2 shows, BinocularsLLM achieved outstanding performance across multiple evaluation metrics on the PAN 2024 Task 4 (Voight-Kampff Generative AI Authorship Verification) main test dataset, demonstrating its effectiveness in detecting machine-generated text. With a perfect ROC-AUC score of 1.0 and a Brier score close to 1.0, BinocularsLLM exhibits high discriminative ability and excellent calibration. Additionally, BinocularsLLM outperforms all baseline approaches in terms of C@1, F1, and F0.5u scores. The mean evaluation score further underscores the robustness and reliability of the BinocularsLLM framework in distinguishing between human and machine-generated text. Table 3 presents the analysis of BinocularsLLM across nine variants of the test set. The mean accuracy over these variants provides insights into the generalization capability of different approaches across diverse datasets. Among the approaches evaluated, the BinocularsLLM framework achieved the highest mean accuracy, with a median score of 0.990, indicating strong performance across various test variants. Figure 3: Comparison of model accuracy across different quantiles for various approaches. Table 3 Overview of the mean accuracy over 9 variants of the test set. We report the minimum, median, maximum, the 25-th, and the 75-th quantile, of the mean per the 9 datasets. Approach Minimum 25-th Quantile Median 75-th Quantile Max BinocularsLLM 0.887 0.976 0.990 0.998 1.000 Binoculars 0.342 0.818 0.844 0.965 0.996 Fast-DetectGPT (Mistral) 0.095 0.793 0.842 0.931 0.958 PPMd 0.270 0.546 0.750 0.770 0.863 Unmasking 0.250 0.662 0.696 0.697 0.762 Fast-DetectGPT 0.159 0.579 0.704 0.719 0.982 95-th quantile 0.863 0.971 0.978 0.990 1.000 75-th quantile 0.758 0.865 0.933 0.959 0.991 Median 0.605 0.645 0.875 0.889 0.936 25-th quantile 0.353 0.496 0.658 0.675 0.711 Min 0.015 0.038 0.231 0.244 0.252 However, when compared to baseline approaches, BinocularsLLM consistently outperforms them, showcasing its superior ability to generalize effectively. The performance of baseline approaches varies significantly across different datasets, as evidenced by the wide range between the minimum and maximum scores. This suggests that while some approaches exhibit consistent performance across diverse datasets, others may struggle to generalize effectively. Further analysis of the quantile values elucidates the distribution of performance scores, highlighting the variability and potential challenges in achieving consistent accuracy across different test variants. PPMd and Unmasking display moderate performance, with median accuracies of 0.750 and 0.696, respectively. However, their lower quantiles, particularly the minimum and 25th quantile, indicate significant variability and potential instability in their performance. Fast-DetectGPT shows the most variability among the baselines, with a minimum accuracy of 0.159 and a maximum of 0.982. This wide range suggests inconsistency and unreliability in different test scenarios. The comparative analysis present in Figure 4 illustrates the discernible impact of training on the Mistral and Llama2 model. Before training, both models exhibited limited discriminatory capability on our development dataset between AI-generated and human-written text, as evidenced by the overlapping distribution of data points in their respective scatter plots. However, post-training, a noticeable refinement emerges, with the models demonstrating enhanced proficiency in distinguishing between the two text categories. The scatter plots after training reveal a clearer separation between AI-generated and human-written text samples, indicating an improvement in the model’s ability to capture distinguishing features inherent to each text type. Figure 4: comparison: Mistral and Llama2 models before and after Training, specifically focusing on their ability to distinguish between AI-generated and human-written text. 4.5. Leaderboard on Test Datasets Our team, MarSan, achieves the top position in the task leaderboard among 30 teams with our Binocu- larsLLM framework and demonstrates strong performance across various metrics. Table 4 outlines the performance metrics of the top 10 teams in the competition. Table 4 Leaderboard on Test Datasets Ranking Team ROC-AUC Brier C@1 F1 F0.5u Mean 1 MarSan (our) 0.961 0.928 0.912 0.884 0.932 0.924 2 you-shun-you-de 0.931 0.926 0.928 0.905 0.913 0.921 3 baselineavengers 0.925 0.869 0.882 0.875 0.869 0.886 4 g-fosunlpteam 0.889 0.875 0.887 0.884 0.884 0.884 5 lam 0.851 0.850 0.850 0.852 0.849 0.851 6 drocks 0.866 0.863 0.834 0.825 0.820 0.843 7 aida 0.831 0.825 0.795 0.788 0.782 0.806 8 cnlp-nits-pp 0.844 0.793 0.805 0.789 0.792 0.806 9 fosu-stu 0.833 0.867 0.799 0.748 0.767 0.804 10 ap-team 0.853 0.862 0.795 0.718 0.742 0.796 5. Conclusion In conclusion, the BinocularsLLM framework for the PAN 2024 "Voight-Kampff" Generative AI Au- thorship Verification task demonstrates significant advancements in detecting machine-generated text. Through the integration of supervised fine-tuning of LLMs with a classification head and the Binoculars model, we have achieved outstanding performance, as evidenced by a perfect ROC-AUC score of 1.0 and a Brier score close to 1.0 on the main test dataset. BinocularsLLM framework outperforms all baseline approaches in crucial evaluation metrics, highlighting its robustness and effectiveness in distinguishing between human and machine-generated content. Looking ahead, the success of our approach opens up exciting avenues for future research, including exploring more sophisticated ensemble techniques, investigating the impact of different fine-tuning strategies, and addressing challenges related to scal- ability and computational efficiency. By continuing to innovate in this critical area, we can further advance the field of machine-generated text detection and contribute to enhancing the trustworthiness and authenticity of digital content. References [1] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., Training language models to follow instructions with human feedback, Advances in neural information processing systems 35 (2022) 27730–27744. [2] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot learners, 2020. arXiv:2005.14165. [3] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar- gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, T. Scialom, Llama 2: Open foundation and fine-tuned chat models, 2023. arXiv:2307.09288. [4] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023. arXiv:2310.06825. [5] G. Jawahar, M. Abdul-Mageed, L. V. Lakshmanan, Automatic detection of machine generated text: A critical survey, arXiv preprint arXiv:2011.01314 (2020). [6] N. Lu, S. Liu, R. He, Q. Wang, Y.-S. Ong, K. Tang, Large language models can be guided to evade ai-generated text detection, arXiv preprint arXiv:2305.10847 (2023). [7] D. Bill, T. Eriksson, Fine-tuning a llm using reinforcement learning from human feedback for a therapy chatbot application, 2023. [8] Y. Moslem, R. Haque, J. D. Kelleher, A. Way, Adaptive machine translation with large language models, 2023. arXiv:2301.13294. [9] M. Nejjar, L. Zacharias, F. Stiehle, I. Weber, Llms for science: Usage for code generation and data analysis, arXiv preprint arXiv:2311.16733 (2023). [10] J. Bevendorff, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag, M. Fröbe, D. Ko- renčić, M. Mayerl, A. Mukherjee, A. Panchenko, M. Potthast, F. Rangel, P. Rosso, A. Smirnova, E. Stamatatos, B. Stein, M. Taulé, D. Ustalov, M. Wiegmann, E. Zangerle, Overview of PAN 2024: Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2024. [11] A. A. Ayele, N. Babakov, J. Bevendorff, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag, M. Fröbe, D. Korenčić, M. Mayerl, D. Moskovskiy, A. Mukherjee, A. Panchenko, M. Potthast, F. Rangel, N. Rizwan, P. Rosso, F. Schneider, A. Smirnova, E. Stamatatos, E. Stakovskii, B. Stein, M. Taulé, D. Ustalov, X. Wang, M. Wiegmann, S. M. Yimam, E. Zangerle, Overview of PAN 2024: Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2024. [12] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast, Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot, F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. URL: https://link. springer.com/chapter/10.1007/978-3-031-28241-6_20. doi:10.1007/978-3-031-28241-6_20. [13] J. Bevendorff, M. Wiegmann, E. Stamatatos, M. Potthast, B. Stein, Overview of the Voight-Kampff Generative AI Authorship Verification Task at PAN 2024, in: G. F. N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024. [14] A. Hans, A. Schwarzschild, V. Cherepanova, H. Kazemi, A. Saha, M. Goldblum, J. Geiping, T. Goldstein, Spotting llms with binoculars: Zero-shot detection of machine-generated text, 2024. arXiv:2401.12070. [15] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Al- tenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L. Brakman, G. Brockman, T. Brooks, M. Brundage, K. Button, T. Cai, R. Campbell, A. Cann, B. Carey, C. Carlson, R. Carmichael, B. Chan, C. Chang, F. Chantzis, D. Chen, S. Chen, R. Chen, J. Chen, M. Chen, B. Chess, C. Cho, C. Chu, H. W. Chung, D. Cummings, J. Currier, Y. Dai, C. Decareaux, T. Degry, N. Deutsch, D. Deville, A. Dhar, D. Dohan, S. Dowling, S. Dunning, A. Ecoffet, A. Eleti, T. Eloundou, D. Farhi, L. Fedus, N. Felix, S. P. Fishman, J. Forte, I. Fulford, L. Gao, E. Georges, C. Gibson, V. Goel, T. Gogineni, G. Goh, R. Gontijo-Lopes, J. Gordon, M. Grafstein, S. Gray, R. Greene, J. Gross, S. S. Gu, Y. Guo, C. Hallacy, J. Han, J. Harris, Y. He, M. Heaton, J. Hei- decke, C. Hesse, A. Hickey, W. Hickey, P. Hoeschele, B. Houghton, K. Hsu, S. Hu, X. Hu, J. Huizinga, S. Jain, S. Jain, J. Jang, A. Jiang, R. Jiang, H. Jin, D. Jin, S. Jomoto, B. Jonn, H. Jun, T. Kaftan, Łukasz Kaiser, A. Kamali, I. Kanitscheider, N. S. Keskar, T. Khan, L. Kilpatrick, J. W. Kim, C. Kim, Y. Kim, J. H. Kirchner, J. Kiros, M. Knight, D. Kokotajlo, Łukasz Kondraciuk, A. Kondrich, A. Konstantini- dis, K. Kosic, G. Krueger, V. Kuo, M. Lampe, I. Lan, T. Lee, J. Leike, J. Leung, D. Levy, C. M. Li, R. Lim, M. Lin, S. Lin, M. Litwin, T. Lopez, R. Lowe, P. Lue, A. Makanju, K. Malfacini, S. Manning, T. Markov, Y. Markovski, B. Martin, K. Mayer, A. Mayne, B. McGrew, S. M. McKinney, C. McLeavey, P. McMillan, J. McNeil, D. Medina, A. Mehta, J. Menick, L. Metz, A. Mishchenko, P. Mishkin, V. Monaco, E. Morikawa, D. Mossing, T. Mu, M. Murati, O. Murk, D. Mély, A. Nair, R. Nakano, R. Nayak, A. Neelakantan, R. Ngo, H. Noh, L. Ouyang, C. O’Keefe, J. Pachocki, A. Paino, J. Palermo, A. Pantuliano, G. Parascandolo, J. Parish, E. Parparita, A. Passos, M. Pavlov, A. Peng, A. Perelman, F. de Avila Belbute Peres, M. Petrov, H. P. de Oliveira Pinto, Michael, Pokorny, M. Pokrass, V. H. Pong, T. Powell, A. Power, B. Power, E. Proehl, R. Puri, A. Radford, J. Rae, A. Ramesh, C. Raymond, F. Real, K. Rimbach, C. Ross, B. Rotsted, H. Roussez, N. Ryder, M. Saltarelli, T. Sanders, S. Santurkar, G. Sastry, H. Schmidt, D. Schnurr, J. Schulman, D. Selsam, K. Sheppard, T. Sherbakov, J. Shieh, S. Shoker, P. Shyam, S. Sidor, E. Sigler, M. Simens, J. Sitkin, K. Slama, I. Sohl, B. Sokolowsky, Y. Song, N. Staudacher, F. P. Such, N. Summers, I. Sutskever, J. Tang, N. Tezak, M. B. Thompson, P. Tillet, A. Tootoonchian, E. Tseng, P. Tuggle, N. Turley, J. Tworek, J. F. C. Uribe, A. Vallone, A. Vijayvergiya, C. Voss, C. Wainwright, J. J. Wang, A. Wang, B. Wang, J. Ward, J. Wei, C. Wein- mann, A. Welihinda, P. Welinder, J. Weng, L. Weng, M. Wiethoff, D. Willner, C. Winter, S. Wolrich, H. Wong, L. Workman, S. Wu, J. Wu, M. Wu, K. Xiao, T. Xu, S. Yoo, K. Yu, Q. Yuan, W. Zaremba, R. Zellers, C. Zhang, M. Zhang, S. Zhao, T. Zheng, J. Zhuang, W. Zhuk, B. Zoph, Gpt-4 technical report, 2024. arXiv:2303.08774. [16] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, N. Fiedel, Palm: Scaling language modeling with pathways, 2022. arXiv:2204.02311. [17] M. Najafi, S. Sadidpur, Paa: Persian author attribution using dense and recursive connection (2024). [18] E. Tavan, M. Najafi, R. Moradi, Identifying ironic content spreaders on twitter using psychometrics, contextual and ironic features with gradient boosting classifier., in: CLEF (Working Notes), 2022, pp. 2687–2697. [19] M. Najafi, E. Tavan, Text-to-text transformer in authorship verification via stylistic and semantical analysis., 2022. [20] I. Solaiman, M. Brundage, J. Clark, A. Askell, A. Herbert-Voss, J. Wu, A. Radford, G. Krueger, J. W. Kim, S. Kreps, et al., Release strategies and the social impacts of language models, arXiv preprint arXiv:1908.09203 (2019). [21] B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, J. Yue, Y. Wu, How close is chatgpt to human experts? comparison corpus, evaluation, and detection, arXiv preprint arXiv:2301.07597 (2023). [22] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [23] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (2019) 9. [24] A. Bakhtin, S. Gross, M. Ott, Y. Deng, M. Ranzato, A. Szlam, Real or fake? learning to discriminate machine from human generated text, 2019. arXiv:1906.03351. [25] A. Uchendu, T. Le, K. Shu, D. Lee, Authorship attribution for neural text generation, in: B. Web- ber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 8384–8395. URL: https://aclanthology.org/2020.emnlp-main.673. doi:10.18653/v1/ 2020.emnlp-main.673. [26] X. Hu, P.-Y. Chen, T.-Y. Ho, Radar: Robust ai-text detection via adversarial learning, 2023. arXiv:2307.03838. [27] Y. Tian, H. Chen, X. Wang, Z. Bai, Q. Zhang, R. Li, C. Xu, Y. Wang, Multiscale positive-unlabeled detection of ai-generated texts, arXiv preprint arXiv:2305.18149 (2023). [28] W. Liang, M. Yuksekgonul, Y. Mao, E. Wu, J. Zou, Gpt detectors are biased against non-native english writers, 2023. arXiv:2304.02819. [29] E. Mitchell, Y. Lee, A. Khazatsky, C. D. Manning, C. Finn, Detectgpt: Zero-shot machine-generated text detection using probability curvature, 2023. arXiv:2301.11305. [30] S. Gehrmann, H. Strobelt, A. Rush, GLTR: Statistical detection and visualization of generated text, in: M. R. Costa-jussà, E. Alfonseca (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics, Florence, Italy, 2019, pp. 111–116. URL: https://aclanthology.org/P19-3019. doi:10. 18653/v1/P19-3019. [31] J. Su, T. Y. Zhuo, D. Wang, P. Nakov, Detectllm: Leveraging log rank information for zero-shot detection of machine-generated text, arXiv preprint arXiv:2306.05540 (2023). [32] A. Grinbaum, L. Adomaitis, The ethical need for watermarks in machine-generated language, 2022. arXiv:2209.03118. [33] S. Abdelnabi, M. Fritz, Adversarial watermarking transformer: Towards tracing text provenance with data hiding, 2021. arXiv:2009.03015. [34] J. Kirchenbauer, J. Geiping, Y. Wen, J. Katz, I. Miers, T. Goldstein, A watermark for large language models, 2024. arXiv:2301.10226. [35] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei, et al., The flan collection: Designing data and methods for effective instruction tuning, in: International Conference on Machine Learning, PMLR, 2023, pp. 22631–22648. [36] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, T. B. Hashimoto, Stanford alpaca: An instruction-following llama model, https://github.com/tatsu-lab/stanford_alpaca, 2023. [37] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, O. Levy, Lima: Less is more for alignment, 2023. arXiv:2305.11206. [38] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, S. Paul, B. Bossan, Peft: State-of-the-art parameter- efficient fine-tuning methods, https://github.com/huggingface/peft, 2022. [39] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank adaptation of large language models, 2021. arXiv:2106.09685. [40] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Efficient finetuning of quantized llms, 2023. arXiv:2305.14314. [41] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast, Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot, F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. doi:10.1007/ 978-3-031-28241-6_20.