1. Introduction

Xiv:

10.1007/978-3-031-28241-6_20

MarSan at PAN: BinocularsLLM, Fusing Binoculars' Insight with the Proficiency of Large Language Models for Machine-Generated Text Detection

Ehsan Tavan

Maryam Najafi

0 1 0 Department of Computer Science and Information Systems, University of Limerick , Castletroy, V94 T9PX Limerick , Ireland 1 NLP Department, Part AI Research Center , Tehran , Iran

2401

12070 09 12

Large Language Models have revolutionized natural language processing, exhibiting remarkable fluency and quality in generating human-like text. However, this advancement also brings challenges, particularly in distinguishing between human and machine-generated content. In this study, we propose an ensemble framework called BinocularsLLM for the PAN 2024 'Voight-Kampf' Generative AI Authorship Verification task. BinocularsLLM integrates supervised fine-tuning of LLMs with a classification head and the Binoculars framework, demonstrating promising results in detecting machine-generated text. Through extensive experimentation and evaluation, we showcase the efectiveness of our approach in addressing this critical task, achieving a perfect ROC-AUC score of 96.1%, a Brier score of 92.8%, a C@1 score of 91.2%, an F1 score of 88.4%, and an F0.5u score of 93.2% across all test datasets. BinocularsLLM outperforms all participants and baseline approaches, indicating its superior ability to generalize efectively and distinguish between human and machine-generated content. Our framework achieves the first rank among 30 teams participating in this competition.

eol>PAN 2024 Large Language Models Machine-Generated Text Detection Instruction Fine-Tuning

1. Introduction

In recent years, Large Language Models (LLMs) have made remarkable advancements, generating text that closely mimics human language with high fluency and quality. Models such as ChatGPT [ 1 ], GPT-3 [ 2 ], LLaMa [ 3 ], and Mistral [ 4 ] demonstrate impressive performance in a variety of tasks including question-answering, writing stories, and analyzing program code. These technologies ofer significant potential to enhance eficiency and scalability across various domains, driving innovation and productivity [ 5, 6 ].

Machine-generated text is now used in a wide range of applications, from powerful chatbots [ 7 ] and real-time language translation [ 8 ] to analyzing and generating program code [ 9 ]. However, the sophistication of these models also introduces new challenges in distinguishing between humangenerated and machine-generated content.

The ability to reliably detect machine-generated text is crucial. With the rapid expansion of information on the internet, there is an increased risk of misinformation spreading unchecked. The misuse of LLMs for generating fake news, fake product reviews, and propaganda pose substantial threats to the integrity of online communication. Furthermore, malicious activities such as spamming and fraud are intensified by the advanced capabilities of these models. Efective detection mechanisms are essential to protect against these risks, ensuring that digital content remains trustworthy and authentic. Developing tools and strategies to automatically detect machine-generated texts is essential to mitigate the threats posed by the misuse of LLMs.

In the PAN’24 "Voight-Kampf" Generative AI Authorship Verification task [ 10, 11 ], participants are faced with an innovative challenge. Their task involves examining two texts: one authored by a human and the other by a machine. The goal is to identify the text authored by a human. This task highlights the ongoing need for robust methods to diferentiate between human and machine-generated content, underscoring the importance of continued research and development in this area [ 12, 13, 10 ].

In this study, we explore innovative approaches to machine-generated text detection by investigating several key hypotheses. First, we examine whether leveraging LLMs with instruction fine-tuning can enhance the efectiveness of detecting machine-generated content. Second, we test the feasibility of training LLMs with a classification head that utilizes softmax to produce accurate output labels. Lastly, we investigate whether combining zero-shot techniques, which utilize metrics like perplexity and entropy, with fine-tuned models can significantly improve the accuracy of machine-generated text detection. These hypotheses aim to push the boundaries of our current detection capabilities, potentially leading to breakthroughs in ensuring the authenticity of digital content.

In Section 3, we introduce BinocularsLLM, our proposed ensemble framework, which integrates ifne-tuned LLama2 [ 3 ] and Mistral models with a classification head, while also incorporating the Binoculars [14] model. This framework undergoes evaluation on both the main and nine additional test datasets, demonstrating notably promising results.

In this paper, we conduct a comprehensive evaluation of Voight-Kampf Generative AI Authorship Verification tasks, comparing our proposed framework against both baseline models and state-of-the-art approaches. We have made our code and data publicly available on our GitHub repository1 and our ifne-tuned models are available on Hugging Face: Generative-AV-Mistral-v0.1-7b 2 and Generative-AVLLaMA-2-7b3. Our contributions are organized as follows: Section 2 reviews the relevant background literature. Section 3 introduces BinocularsLLM. Section 4 details the evaluation metrics and presents the experimental results.

2. Background

The detection of machine-generated text has become a critical area of research, driven by the rapid advancement and widespread use of large language models (LLMs) such as GPT-4 [15], PaLM [16], and ChatGPT. This task is typically formulated as a classification problem. This section reviews existing methodologies categorized into supervised learning approaches, zero-shot detection models, and watermarking techniques.

Supervised Learning Approaches: Supervised learning methods train classifiers on labeled datasets [17, 18, 19]. Models like GPT2 Detector [20] and ChatGPT Detector [21] fine-tunes pre-trained models such as RoBERTa [22] on the output of GPT2 [23] and the HC3 [21] dataset. While these models demonstrate high accuracy within their training domains, they often struggle with generalization to out-of-domain texts [24, 25]. Techniques such as adversarial training [26] and abstention [27] have been explored to enhance robustness, but challenges remain, particularly in maintaining low false positive rates across diverse text distributions [28].

Zero-Shot Detection Models: Another approach to identifying machine-generated text involves zero-shot detection models, which leverage statistical features in texts without requiring explicit training on labeled datasets. These models, such as DetectGPT [29] and others [30, 31], analyze universal features inherent in machine-generated texts. They exploit concepts like entropy, perplexity, and n-gram frequencies to distinguish between human and machine-generated text. These models ofer robustness across diferent types of text and languages, circumventing the domain-specific limitations of supervised classifiers [ 30]. However, the computational demands remain a significant challenge, particularly in methods relying on probability curvature and extensive perturbations [29, 31].

1https://github.com/MarSanTeam/BinocularsLLM 2https://huggingface.co/Ehsan-Tavan/Generative-AV-Mistral-v0.1-7b 3https://huggingface.co/Ehsan-Tavan/Generative-AV-LLaMA-2-7b

Watermarking Techniques: Watermarking involves embedding detectable patterns into the generated text that are imperceptible to humans but identifiable by algorithms. Grinbaum and Adomaitis [32] and Abdelnabi and Fritz [33] utilized syntax tree manipulation to embed watermarks, while Kirchenbauer et al. [34] required access to the LLM’s logits to modify token probabilities. Although efective, these methods necessitate control over the text generation process, limiting their applicability to scenarios where such control is feasible.

text 1 text 1 text 1

3. System Overview

In this section, we present BinocularsLLM, our ensemble framework to address the PAN’24 "VoightKampf" Generative AI Authorship Verification task, with a focus on detecting machine-generated text. Our goals are twofold: to compare the efectiveness of classification-head fine-tuning with instruction fine-tuning and to integrate the power of the Binoculars technique with fine-tuned LLMs. Both approaches utilize QLoRA, ensuring that only the QLoRA and the classification head weights are trained, not all the parameters of the LLM.

The Binoculars model4 employs observer and performer models to evaluate perplexity and entropy, critical metrics for identifying machine-generated text. By integrating these evaluations with the advanced capabilities of supervised fine-tuning, our ensemble is designed to be capable of distinguishing between human and machine-generated text.

Based on our experiments, we observed that LLM models employing a classification head performed more efectively in detecting machine-generated texts compared to instruction fine-tuning. Consequently, BinocularsLLM integrates two fine-tuned LLMs, LLaMA2 and Mistral (selected based on the results in Table 1), alongside the Binoculars approach. This comprehensive approach leverages the capabilities of statistical metrics and LLM fine-tuning, ensuring robust and accurate detection of machine-generated text.

3.1. Instruction Fine-Tuning for Machine-Generated Text Detection

Instruction Fine-Tuning (IT) involves further training LLMs with specific input-output pairs and accompanying instructions in a supervised manner. This approach has proven efective in enhancing an LLM’s ability to generalize to new, unseen tasks [35] and is considered a viable strategy for improving LLM alignment [36, 37].

In our study on Voight-Kampf Generative AI Authorship Verification, we examine the eficacy of the IT method. Specifically, we evaluate various LLMs’ performance when fine-tuned with a specific set of instructions. This process involve creating an instruction dataset, , comprising instruction pairs = (INSTRUCTION, OUTPUT). Each instruction is generated using a fixed template and samples from the training dataset . These samples are labeled based on their corresponding labels in dataset . Figure 1 illustrates our instruction fine-tuning process.

Here’s an illustration of the instruction format:

The resulting instruction text detection dataset consists of instruction pairs along with their source labels. A label of 0 indicates the first text is human-generated, while a label of 1 indicates the second text is human-generated. Thus, the instruction text detection dataset includes pairs along with their corresponding source labels, formally represented as = {(instruction, , ) | ∈ }.

Instruction: I provide two texts and ask you to determine which one is authored by humans and which one is authored by machines. Your output is simply a 0 or 1; do not generate any additional text. 0 indicates Text1 is authored by the machine, and 1 indicates Text2 is authored by the machine.

Text1: [_1] Text2: [_2] Response: [_]

Given an LLM with parameters as the initial model for instruction tuning, training the model on the constructed instruction dataset results in adapting the LLM’s parameters from to , referred to as the LLM-Detector. Specifically, is obtained by maximizing the probability of predicting the next tokens in the OUTPUT component of each instruction sample , conditioned on the INSTRUCTION. This process is formulated as follows:

∈ = arg max ∑︁ log ( | ; , ) (1)

3.2. Supervised Fine-Tuning LLMs

Fine-tuning LLMs involves adjusting model weights using a labeled dataset to enhance performance on specific tasks. This process can be computationally intensive, requiring significant memory resources, particularly when dealing with full LLM fine-tuning due to its substantial memory demands. To address these challenges, Parameter-Eficient Fine-Tuning (PEFT)[ 38] techniques such as LoRA [39] and QLoRA [40] are employed.

LoRA fine-tunes only two smaller matrices that approximate the larger weight matrix, reducing memory requirements and preserving the original LLM weights. Taking a step further, QLoRA enhances memory eficiency by quantizing these smaller matrices to a lower precision, such as 4-bit, without compromising efectiveness. Employing these fine-tuning techniques for both the classification head ifne-tuning and instruction fine-tuning augments the LLM’s capacity to accurately distinguish between machine-generated and human-generated text.

The Mistral and Llama2 models are fine-tuned exclusively using the provided bootstrap dataset and ⟨LABEL⟩ that indicates the source of the text. The input format can be represented as: the QLoRA technique. Each input example consists of a text string ⟨TEXT⟩ and a corresponding label ⟨TEXT⟩ : ⟨LABEL⟩, where LABEL = {︃1 for human-generated text

0 for machine-generated text

3.3. Inference Time

During the inference phase, the process initiates by receiving two texts as input. Each text is processed separately via the fine-tuned LLama2 and Mistral models to predict the probability of being humanwritten. If the probability assigned to the first text surpasses that of the second text, the score for the input sample is calculated by subtracting the score of the first text from that of the second. Conversely, if the probability of the second text is greater, the input text is labeled as 0.

Additionally, the input is also processed with the binoculars model, which generates a score for each text using its specialized algorithm. If the binoculars score of the first text exceeds that of the second text, the input score is assigned as 0; otherwise, it is assigned as 1. Figure 2 illustrates BinocularsLLM.

4. Results

In this section, we present the implementation details, evaluation metrics, and provide a comprehensive analysis of the results. We utilize the TIRA [41] platform to evaluate our framework using test datasets.

4.1. Implementation Details

In this research, the framework was implemented in PyTorch and executed on Nvidia V100 GPUs. The training process was conducted for 5 epochs, utilizing the AdamW optimizer with a learning rate of 2e-5. The training batch size was set to 2, with gradient accumulation set to 8. For QLoRa, we configured LoRA’s rank to 64 and its alpha to 16, employing 4-bit quantization. To evaluate fine-tuned models, we used 20% of the given dataset as a development dataset.

4.2. Evaluation Metrics

To evaluate the performance of our proposed model, we used the evaluation metrics provided by PAN, which include the following metrics: • − : The conventional area under the curve score. • @1: Rewards systems that leave complicated problems unanswered. • 0.5: Focus on deciding same-author cases correctly. • 1 − : A harmonic way of combining the precision and recall of the model. • : Evaluates the accuracy of probabilistic predictions.

4.3. Result Analysis on Development Dataset

As mentioned earlier, we compare two fine-tuning approaches for detecting machine-generated text: instruction fine-tuning and classification-head fine-tuning. The performance of various LLMs under these methodologies is illustrated in Table 1 using the development dataset. Based on the results from Table 1, we select the two top-performing LLMs to integrate into our ensemble framework.

In analyzing the results presented in Table 1, it becomes evident that both the LLama2-7B and Mistral7B models, fine-tuned with a classification head, demonstrate promising performance across various evaluation metrics on our development dataset. LLama2-7B demonstrates exceptional scores across all metrics using the classification head fine-tuning approach, showcasing its robustness in distinguishing between human and machine-generated text. Meanwhile, Mistral-7B also has notable performance, indicating its eficacy in authorship verification tasks. These findings show the efectiveness of employing classification head fine-tuning for both LLama2-7B and Mistral-7B within the BinocularsLLM framework.

Comparing classification head fine-tuning with instruction tuning, we observe that classification head ifne-tuning yields superior performance. These findings indicate that classification head fine-tuning is more efective than instruction tuning for enhancing the performance of LLMs in distinguishing between human and machine-generated text.

4.4. Results on Blinded Test Dataset

As Table 2 shows, BinocularsLLM achieved outstanding performance across multiple evaluation metrics on the PAN 2024 Task 4 (Voight-Kampf Generative AI Authorship Verification) main test dataset, demonstrating its efectiveness in detecting machine-generated text. With a perfect ROC-AUC score of 1.0 and a Brier score close to 1.0, BinocularsLLM exhibits high discriminative ability and excellent calibration. Additionally, BinocularsLLM outperforms all baseline approaches in terms of C@1, F1, and F0.5u scores. The mean evaluation score further underscores the robustness and reliability of the BinocularsLLM framework in distinguishing between human and machine-generated text.

Table 3 presents the analysis of BinocularsLLM across nine variants of the test set. The mean accuracy over these variants provides insights into the generalization capability of diferent approaches across diverse datasets. Among the approaches evaluated, the BinocularsLLM framework achieved the highest mean accuracy, with a median score of 0.990, indicating strong performance across various test variants. However, when compared to baseline approaches, BinocularsLLM consistently outperforms them, showcasing its superior ability to generalize efectively. The performance of baseline approaches varies significantly across diferent datasets, as evidenced by the wide range between the minimum and maximum scores. This suggests that while some approaches exhibit consistent performance across diverse datasets, others may struggle to generalize efectively. Further analysis of the quantile values elucidates the distribution of performance scores, highlighting the variability and potential challenges in achieving consistent accuracy across diferent test variants.

PPMd and Unmasking display moderate performance, with median accuracies of 0.750 and 0.696, respectively. However, their lower quantiles, particularly the minimum and 25th quantile, indicate significant variability and potential instability in their performance.

Fast-DetectGPT shows the most variability among the baselines, with a minimum accuracy of 0.159 and a maximum of 0.982. This wide range suggests inconsistency and unreliability in diferent test scenarios.

The comparative analysis present in Figure 4 illustrates the discernible impact of training on the Mistral and Llama2 model. Before training, both models exhibited limited discriminatory capability on our development dataset between AI-generated and human-written text, as evidenced by the overlapping distribution of data points in their respective scatter plots. However, post-training, a noticeable refinement emerges, with the models demonstrating enhanced proficiency in distinguishing between the two text categories. The scatter plots after training reveal a clearer separation between AI-generated and human-written text samples, indicating an improvement in the model’s ability to capture distinguishing features inherent to each text type.

4.5. Leaderboard on Test Datasets

Our team, MarSan, achieves the top position in the task leaderboard among 30 teams with our BinocularsLLM framework and demonstrates strong performance across various metrics. Table 4 outlines the performance metrics of the top 10 teams in the competition.

5. Conclusion

In conclusion, the BinocularsLLM framework for the PAN 2024 "Voight-Kampf" Generative AI Authorship Verification task demonstrates significant advancements in detecting machine-generated text. Through the integration of supervised fine-tuning of LLMs with a classification head and the Binoculars model, we have achieved outstanding performance, as evidenced by a perfect ROC-AUC score of 1.0 and a Brier score close to 1.0 on the main test dataset. BinocularsLLM framework outperforms all baseline approaches in crucial evaluation metrics, highlighting its robustness and efectiveness in distinguishing between human and machine-generated content. Looking ahead, the success of our approach opens up exciting avenues for future research, including exploring more sophisticated ensemble techniques, investigating the impact of diferent fine-tuning strategies, and addressing challenges related to scalability and computational eficiency. By continuing to innovate in this critical area, we can further advance the field of machine-generated text detection and contribute to enhancing the trustworthiness and authenticity of digital content. N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, N. Fiedel, Palm: Scaling language modeling with pathways, 2022. arXiv:2204.02311. [17] M. Najafi, S. Sadidpur, Paa: Persian author attribution using dense and recursive connection (2024). [18] E. Tavan, M. Najafi, R. Moradi, Identifying ironic content spreaders on twitter using psychometrics, contextual and ironic features with gradient boosting classifier., in: CLEF (Working Notes), 2022, pp. 2687–2697. [19] M. Najafi, E. Tavan, Text-to-text transformer in authorship verification via stylistic and semantical analysis., 2022. [20] I. Solaiman, M. Brundage, J. Clark, A. Askell, A. Herbert-Voss, J. Wu, A. Radford, G. Krueger, J. W.

Kim, S. Kreps, et al., Release strategies and the social impacts of language models, arXiv preprint arXiv:1908.09203 (2019). [21] B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, J. Yue, Y. Wu, How close is chatgpt to human experts? comparison corpus, evaluation, and detection, arXiv preprint arXiv:2301.07597 (2023). [22] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,

Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [23] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (2019) 9. [24] A. Bakhtin, S. Gross, M. Ott, Y. Deng, M. Ranzato, A. Szlam, Real or fake? learning to discriminate machine from human generated text, 2019. arXiv:1906.03351. [25] A. Uchendu, T. Le, K. Shu, D. Lee, Authorship attribution for neural text generation, in: B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 8384–8395. URL: https://aclanthology.org/2020.emnlp-main.673. doi:10.18653/v1/ 2020.emnlp-main.673. [26] X. Hu, P.-Y. Chen, T.-Y. Ho, Radar: Robust ai-text detection via adversarial learning, 2023.

arXiv:2307.03838. [27] Y. Tian, H. Chen, X. Wang, Z. Bai, Q. Zhang, R. Li, C. Xu, Y. Wang, Multiscale positive-unlabeled detection of ai-generated texts, arXiv preprint arXiv:2305.18149 (2023). [28] W. Liang, M. Yuksekgonul, Y. Mao, E. Wu, J. Zou, Gpt detectors are biased against non-native english writers, 2023. arXiv:2304.02819. [29] E. Mitchell, Y. Lee, A. Khazatsky, C. D. Manning, C. Finn, Detectgpt: Zero-shot machine-generated text detection using probability curvature, 2023. arXiv:2301.11305. [30] S. Gehrmann, H. Strobelt, A. Rush, GLTR: Statistical detection and visualization of generated text, in: M. R. Costa-jussà, E. Alfonseca (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics, Florence, Italy, 2019, pp. 111–116. URL: https://aclanthology.org/P19-3019. doi:10. 18653/v1/P19-3019. [31] J. Su, T. Y. Zhuo, D. Wang, P. Nakov, Detectllm: Leveraging log rank information for zero-shot detection of machine-generated text, arXiv preprint arXiv:2306.05540 (2023). [32] A. Grinbaum, L. Adomaitis, The ethical need for watermarks in machine-generated language, 2022.

arXiv:2209.03118. [33] S. Abdelnabi, M. Fritz, Adversarial watermarking transformer: Towards tracing text provenance with data hiding, 2021. arXiv:2009.03015. [34] J. Kirchenbauer, J. Geiping, Y. Wen, J. Katz, I. Miers, T. Goldstein, A watermark for large language models, 2024. arXiv:2301.10226. [35] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei, et al., The flan collection: Designing data and methods for efective instruction tuning, in: International Conference on Machine Learning, PMLR, 2023, pp. 22631–22648. [36] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, T. B. Hashimoto, Stanford alpaca: An instruction-following llama model, https://github.com/tatsu-lab/stanford_alpaca, 2023. [37] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh,

M. Lewis, L. Zettlemoyer, O. Levy, Lima: Less is more for alignment, 2023. arXiv:2305.11206. [38] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, S. Paul, B. Bossan, Peft: State-of-the-art parametereficient fine-tuning methods, https://github.com/huggingface/peft, 2022. [39] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank adaptation of large language models, 2021. arXiv:2106.09685. [40] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Eficient finetuning of quantized llms, 2023. arXiv:2305.14314. [41] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast, Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot, F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. doi:10.1007/ 978-3-031-28241-6_20.

[1]

Ouyang ,

Wu ,

Jiang ,

Almeida ,

Wainwright ,

Mishkin ,

Zhang , S. Agarwal,

Slama ,

Ray , et al., Training language models to follow instructions with human feedback , Advances in neural information processing systems 35 ( 2022 ) 27730 - 27744 .

[2] T. B. Brown , B.

Mann , N.

Ryder , M.

Subbiah , J.

Kaplan , P.

Dhariwal , A.

Neelakantan , P.

Shyam , G.

Sastry , A.

Askell , S.

Agarwal , A.

Herbert-Voss , G. Krueger, T.

Henighan , R.

Child , A.

Ramesh , D. M.

Ziegler , J.

Wu , C.

Winter , C.

Hesse , M.

Chen , E. Sigler, M.

Litwin , S.

Gray , B.

Chess , J.

Clark , C.

Berner , S.

McCandlish , A.

Radford , I.

Sutskever , D.

Amodei , Language models are few-shot learners , 2020 . arXiv: 2005 .14165.

[3]

Touvron ,

Martin ,

Stone ,

Albert ,

Almahairi ,

Babaei ,

Bashlykov ,

Batra ,

Bhargava ,

Bhosale ,

Bikel ,

Blecher ,

C. C.

Ferrer ,

Chen ,

Cucurull ,

Esiobu ,

Fernandes ,

Fu ,

Fuller ,

Gao ,

Goswami ,

Goyal ,

Hartshorn ,

Hosseini ,

Hou ,

Inan ,

Kardas ,

Kerkez ,

Khabsa , I. Kloumann ,

Korenev ,

P. S.

Koura , M. -

A. Lachaux , T.

Lavril , J.

Lee , D.

Liskovich , Y.

Lu , Y.

Mao , X.

Martinet , T.

Mihaylov , P.

Mishra , I. Molybog, Y.

Nie , A.

Poulton , J.

Reizenstein , R.

Rungta , K.

Saladi , A.

Schelten , R.

Silva , E. M.

Smith , R.

Subramanian , X. E.

Tan , B.

Tang , R.

Taylor , A.

Williams , J. X.

Kuan , P.

Xu , Z.

Yan , I. Zarov, Y.

Zhang , A.

Fan , M.

Kambadur , S.

Narang , A.

Rodriguez , R.

Stojnic , S.

Edunov , T. Scialom, Llama 2 : Open foundation and fine-tuned chat models , 2023 . arXiv: 2307 . 09288 .

[4]

A. Q.

Jiang ,

Sablayrolles ,

Mensch ,

Bamford ,

D. S.

Chaplot , D. de las Casas,

Bressand , G. Lengyel,

Lample ,

Saulnier ,

L. R.

Lavaud , M. -

A. Lachaux , P.

Stock , T. L.

Scao , T.

Lavril , T.

Wang , T.

Lacroix , W. E.

Sayed , Mistral 7b, 2023 . arXiv: 2310 . 06825 .

[5]

Jawahar ,

Abdul-Mageed ,

L. V.

Lakshmanan , Automatic detection of machine generated text: A critical survey , arXiv preprint arXiv: 2011 . 01314 ( 2020 ).

[6]

Lu , S. Liu,

He ,

Wang ,

Y.-S.

Ong ,

Tang , Large language models can be guided to evade ai-generated text detection , arXiv preprint arXiv:2305.10847 ( 2023 ).

[7]

Bill , T. Eriksson, Fine-tuning a llm using reinforcement learning from human feedback for a therapy chatbot application , 2023 .

[8]

Moslem ,

Haque ,

J. D.

Kelleher , A. Way, Adaptive machine translation with large language models , 2023 . arXiv: 2301 . 13294 .

[9]

Nejjar ,

Zacharias ,

Stiehle , I. Weber , Llms for science: Usage for code generation and data analysis , arXiv preprint arXiv:2311.16733 ( 2023 ).

[10]

Bevendorf ,

X. B.

Casals ,

Chulvi ,

Dementieva ,

Elnagar ,

Freitag ,

Fröbe ,

Korenčić ,

Mayerl ,

Mukherjee ,

Panchenko ,

Potthast ,

Rangel ,

Rosso ,

Smirnova ,

Stamatatos ,

Stein ,

Taulé ,

Ustalov ,

Wiegmann , E. Zangerle, Overview of PAN 2024: Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction . Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF 2024 ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2024 .

[11]

A. A.

Ayele ,

Babakov ,