A Machine-Generated Text Detection Model Based on Text Multi-Feature Fusion Notebook for the PAN Lab at CLEF 2024 Mingcan Guo, Zhongyuan Han* , Haoyang Chen and Jiangao Peng Foshan University, Foshan, China Abstract In the current wave of rapid technological advancement, many cutting-edge large language models (LLMs) have emerged, such as GPT-4 and Llama. However, the ability of these LLMs to generate smooth and coherent text has led to concerns about potential misuse. Therefore, in practical applications, the ability to distinguish them from texts created by human hands becomes especially crucial. A model for detecting machine-generated text is proposed through the PAN Task 4 - Voight-Kampff Generative AI Authorship Verification task. The model emphasizes the extraction of additional semantic information from the text. Various pre-trained language models (PLMs) were employed in the experiments, and the incorporation of multiple text features, such as word frequency and perplexity, was explored to enhance the outcomes. Ultimately, four runs were submitted, with the highest-performing approach attaining a mean score of 0.884 across all test datasets. Keywords LLMs, machine-generated text, text features, word frequency, perplexity 1. Introduction With the advent of the AI era, the content on the internet is increasingly being infiltrated by machine- generated text. The emergence of powerful LLMs like ChatGPT and its derivative versions has made the creation of high-quality text more accessible than ever before. However, this has also posed a challenging problem: How can we differentiate between machine-generated text and human creativity? This concerning issue is troubling people in various domains, such as conspiracy theories[1], plagiarism concerns[2], political biases[3], and misinformation[4]. Detecting machine-generated text is an urgent need to address the social issues arising from the misuse of these large models. To tackle the challenges LLMs pose, PAN 2024[5] has intensely focused on authorship verification tasks based on LLMs. Among them, the Voight-Kampff Generative AI Authorship Verification 2024 task[6] is part of the tasks under the Conference and Labs of the Evaluation Forum (CLEF 2024). The builder-breaker collaboration combines the Generative AI Authorship Verification task with the Voight- Kampff task from the ELOQUENT Lab. The primary focus of the task is to comprehend and detect human and machine-generated text. Participants must build automated systems that accurately attribute the text to human authors by distinguishing between human and machine-generated text. The primary emphasis of this study is on extracting valuable information from the original text. Drawing inspiration from the work of Davies[7] and Przybyla et al.[8], language modeling is investigated from two perspectives: text perplexity and word frequency. Furthermore, multiple PLMs were fine-tuned on an augmented dataset, and text features extracted from the modeling process were incorporated into a hybrid model. The hybrid models were fine-tuned independently and were subsequently assigned weights and ensembled to generate the final results. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09-12, 2024, Grenoble, France * Corresponding author. $ gmc9812@163.com (M. Guo); hanzhongyuan@gmail.com (Z. Han); hoyo.chen.i@gmail.com (H. Chen); wyd1n910@gmail.com (J. Peng)  0000-0002-4977-2138 (M. Guo); 0000-0001-8960-9872 (Z. Han); 0000-0003-3223-9086 (H. Chen); 0009-0006-3780-5023 (J. Peng) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2. Related Work With the advent of Language Models, detecting machine-generated text has become a popular research direction[9, 10]. Existing detection methods can generally be classified into three categories: probability- based detection methods, machine learning-based detection methods, and deep learning-based detection methods. Probability-based methods typically do not require training samples. Gehrmann et al.[11] designed a statistical tool called GLTR, highlighting the distribution differences between generated text and human-written text using different colors. GLTR detects the number of high, medium, and low probability words and presents the results in a visualized format. Mitchell et al.[12] proposed DetectGPT, which calculates log probabilities by perturbing text and comparing the changes between the original and perturbed text. It assumes that machine-generated text tends to lie within the negative log probability curve, while human-written text exhibits higher or lower probabilities compared to perturbed text. Machine learning-based methods, on the other hand, typically employ classical machine learning models. These methods often have fewer parameters and can be easily deployed. For instance, Solaiman et al.[13] applied the TF-IDF method to design a regression classifier that utilizes the Top-K sampling strategy to identify machine-generated text. Fröhling et al.[14] proposed a method that employs advanced features to simulate text’s experience, syntax, and semantics. They utilized simple classification models for detection, achieving performance comparable to mainstream deep learning methods. Deep learning-based detection methods have gained widespread application in recent research and often outperform the previous two categories[15]. The emergence of the Transformer architecture has provided many advanced methods for deep learning. For example, Chen et al.[16] employed RoBERTa and T5 architectures to design a feature extraction and discrimination process between human-written text and text generated by ChatGPT. They trained two text classification models for text detection. Gambini et al.[17] demonstrated that fine-tuning XLNET on tweets written by humans, GPT-2, and earlier generation techniques, such as Markov chains and RNNs, can effectively identify GPT-3-generated tweets with an accuracy of 82.1%. It shows that fine-tuning XLNET on a mixed dataset can be a successful method for detecting machine-generated text. It has been shown through practical applications that incorporating text features into PLMs is an effective method[18, 19, 20]. Some researchers have observed that incorporating features such as word embeddings[21], predictability[19], linguistic inquiry and word count[22], and other linguistic features can yield text encodings that contain high-quality semantic information. This feature integration enhances the models’ representation power and improves their performance in text detection tasks. The present study derives inspiration from the methods above and employs a similar approach by incorporating multiple features with text encoding to improve the model’s performance. 3. Methods This section describes the model employed for automatic differentiation between human and machine- generated text. It includes an overview of the explored fusion features and the final ensemble architecture utilized. The architecture of the model is depicted in Figure 1. A PLM encodes the text and derives a pooled output for text representation. To capture features, an LSTM is utilized for language modeling [23, 24], generating text-based feature vectors. The pooled vectors and feature vectors are concatenated and inputted into a linear layer to obtain fusion logits. Subsequently, the softmax function is applied to derive the output probabilities. Alternatively, the pooled vector can be directly passed through a linear layer to yield PLM logits, which are then softmaxed to obtain the output probabilities. Hybrid model Pooling output PLM logits Output probabilities (w/o feature) direct output Pooling softmax Input text PLM layer Fusion features incorporate Feature extraction module feature Linear Fusion logits Output probabilities (feature) Word frequency extraction softmax LSTM Perplexity extraction Feature vector Figure 1: The model architecture includes two versions of outputs: with features and without features. 3.1. Word Frequency Extraction The inclusion of word frequency features is considered in the text encoding process. The word frequency corpus used in this work is sourced from the English dataset of the Google Book Corpus Ngrams1 . This dataset encompasses textual data from millions of books published between 1500 and 2008, providing a comprehensive resource for word frequency information. The data is presented as "n-grams," consecutive sequences of n words. It can be utilized to study the distribution of token frequencies in machine- generated text compared to human-authored text. Specifically, a word frequency dictionary is created using the Google Book Corpus, where each word group is mapped to its corresponding frequency of occurrence. Then, the tokenizer will be used to segment the input text. Look up the dictionary for each word group 𝑤 to get its counts 𝑡𝑓 (𝑤). If it does not exist in the dictionary, take 1. Finally, take the log of 𝑡𝑓 (𝑤). The process is expressed as Equation 1. Freq = log (max (1, 𝑡𝑓 (𝑤))) (1) When encountering a lengthy word group comprising multiple words, the complete count is assigned to each word within the group. Ultimately, the word frequency feature representation of the text is obtained through the LSTM. 3.2. Perplexity Extraction Perplexity is used to evaluate the predictive ability of a set of sample data. It measures the uncertainty or confusion level of the given data. Research by Tang et al. [10] has shown that language models tend to focus on common patterns in the text they are trained on, resulting in lower perplexity scores for text generated by LLMs. In contrast, human authors can express themselves in multiple styles, making it more challenging for language models to make accurate predictions and resulting in higher perplexity values for text created by humans. GPT2 is considered the underlying generative model for perplexity features since most LLMs are based on the same Transformer architecture. GPT2 tends to assign low perplexity to common text and higher perplexity to text with varied styles. Perplexity can be measured by the entropy of the probability distribution. For a given sequence of n tokens 𝑡𝑖 , the probability distribution entropy 𝐻(𝑡) can be represented as Equation 2, where p represents the probability of generating the token corresponding to the vocabulary, and eps is the deviation term. 𝑛 ∑︁ 𝐻(𝑡) = − 𝑝 (𝑡𝑖 ) log2 (𝑝(𝑡𝑖 ) + eps) (2) 𝑖=1 1 https://storage.googleapis.com/books/ngrams/books/datasetsv3.html In addition, for a given 𝑡𝑖 , 𝑡𝑖+1 represents the next token immediately following 𝑡𝑖 , 𝑡′𝑖 represents the GPT2-predicted token for 𝑡𝑖 (the token with the highest probability in the vocabulary). Therefore, when considering the token probabilities across the entire vocabulary in GPT2, the logarithmic probabilities of the occurrence of the succeeding token 𝑡𝑖+1 and the predicted token 𝑡𝑖 in the vocabulary are also taken into account. As shown in Equation 3 and Equation 4, these measures assess the probability of the context token occurring and the model’s confidence in predicting the token, respectively. 𝐿(𝑡) = 𝑙𝑜𝑔𝑝 (𝑡𝑖+1 ) (3) 𝐶(𝑡) = 𝑙𝑜𝑔𝑝 𝑡′𝑖 (4) (︀ )︀ The number of context probability and prediction confidence values is n-1. Missing values are filled with 0 to align with the number of entropy values for the probability distribution. Finally, these features are concatenated together, and the LSTM network is used to obtain the perplexity feature representation of the text. 3.3. Hybrid Model Due to the excellent performance of Transformer-based PLMs in downstream classification tasks [15], three different variants of PLMs are explored: BERT, BERT-Large, and Roberta-Large. The [CLS] token representation with a length of either 768 or 1024 is utilized, and the output is extracted from the pooling layer. The primary focus is on incorporating text features. As shown in Figure 1, the hybrid model designed includes LSTM and PLM: • Text features are extracted using the word frequency and perplexity extraction introduced in the previous two sections and concatenated. • The concatenated features are fed into the LSTM for encoding, and the hidden layer’s state values are used as the feature vector. • The feature vector is concatenated with the PLM pooled output to obtain fusion features, which are then passed through a linear classification layer. During the training phase, the hybrid model’s logits are directly outputted as the binary classification scores for each text after applying the softmax function. Since the provided samples are text pairs during the prediction phase, potential prediction biases caused by individual texts are minimized by predicting the probabilities for text1 and text2 separately. An ensemble of two models in Figure 2 is considered. For instance, two hybrid models with PLM using BERT-Large and Roberta-Large, respectively, are fine-tuned. The corresponding logits for text1 and text2 are obtained from the models and weighted. The final logits for each text are calculated as the weighted sum, as depicted in Equation 5, where 𝜆 represents the weight coefficient. logitsfinal = 𝜆 · logitsBERT_Large + (1 − 𝜆) · logitsRoberta_Large (5) Finally, the final logits of text1 and text2 are averaged after applying the softmax function. Specifically, the logits of text1 and (1 - logits) of text2 are added together and averaged. This results in the output probabilities during the prediction phase. 4. Experiments and Results 4.1. Data Preprocessing The official train dataset is derived from a collection of real and fake news articles from multiple news headlines in the United States in 2021. On the other hand, the test dataset includes various types of texts, such as news articles, Wikipedia summaries, or fan fiction. The training set consists of one portion Logits of text1 weighted sum Hybrid model softmax (BERT-Large) Text1 Output probabilities Text1 Text2 weighted sum Text2 Hybrid model (Roberta-Large) softmax Logits of text2 Figure 2: During the prediction phase, the logits of two hybrid models, each based on a different PLM, are ensemble combined to generate the final output. of human-authored data and 13 portions of data generated by different LLMs, each corresponding to a human sample topic. Each text data portion contains 1,087 samples and is provided in a dictionary format separated by a set of newline characters, such as {"id": "...", "text": "..."}. The test set will be provided in a different format, where each line contains a pair of texts, such as {"id": "...", "text1": "...", "text2": "...}. One of the texts is human-authored, while the other is machine-generated. Figure 3 displays the text lengths and their corresponding probability density distributions plotted as KDE graphs for each data portion. Most text lengths are distributed between 300 and 500, implying that when utilizing PLMs, the information loss resulting from truncating excessively long texts does not need to be excessively concerned with. Figure 3: The length of texts generated by humans and different LLMs and their corresponding probability density distribution. Considering the imbalance between human-authored and machine-generated texts (1:13), data augmentation techniques are employed to expand the original training set, ensuring the quality of the model’s learning. Expressly, a benchmark dataset2 proposed by Sarvazyan et al.[25], similar in the domain to the provided human dataset and consisting of Wikipedia and news articles, is referred to. Since the LLMs used for machine-generated data differ from the training set, it was chosen not to utilize 2 https://huggingface.co/datasets/symanto/autextification2023 the machine-generated text portion of this benchmark dataset. Instead, only human-authored data was extracted to augment and align the dataset with the training set. A pre-trained sentence-transformers model3 , capable of tasks such as similarity judgment or semantic search, was employed. Similarity comparisons were conducted between each text in the training set and the reference dataset. By encoding the texts using the sentence-transformers model, cosine similarity was calculated, and the top 12 texts with the highest similarity scores were selected. In the end, 13,044 similar texts were extracted for expansion. 4.2. Experimental Setting The training set was re-divided into train and test sets using a 7:3 ratio. BERT, BERT-Large, and Roberta-Large were utilized as the base PLMs. The model was developed based on the PaddlePaddle framework for the framework and parameter selection. The batch size was 64, the max length was 512, and the learning rate was 2e-5. The model was trained for 10 epochs using the AdamW optimizer on an environment with an A800 GPU. Additionally, the method introduces two parameters, 𝜆, and 𝑖𝑠_𝑓 𝑒𝑎𝑡𝑢𝑟𝑒, which specify the ensemble models’ weight proportion and decide whether the hybrid model outputs fusion logits or PLM logits. It provides us with the flexibility to generate different approaches. The approach’s effectiveness was validated by running four different approaches on the final test set of the TIRA platform [26]. Their introductions are as follows: • rapid-pole: As a baseline for comparison, using only the prediction results from BERT, without including any word frequency or perplexity feature, 𝑖𝑠_𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 set to False. • savory-plate: Similar to rapid-pole, but employs the Bert-Large model without any word frequency or perplexity feature, 𝑖𝑠_𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 set to False. • lazy-iteration: Ensembles Bert-Large and Roberta-Large hybrid models, with the weighting coefficient 𝜆 in Equation 5 set to 0.9, 𝑖𝑠_𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 set to True. • gritty-producer: Similar to lazy-iteration, but with the weighting coefficient 𝜆 in Equation 5 set to 0.1, 𝑖𝑠_𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 set to True. 4.3. Metrics and Baselines This section discusses the official metrics and baselines used in the task. The metrics employed encompass six different dimensions: • ROC-AUC: The area under the ROC (Receiver Operating Characteristic) curve. • Brier: The complement of the Brier score (mean squared loss). • C@1: A modified accuracy score that assigns non-answers (score = 0.5) the average accuracy of the remaining cases. • F1: The harmonic mean of precision and recall. • F0.5u: A modified F0.5 measure (precision-weighted F measure) that treats non-answers (score = 0.5) as false negatives. • The arithmetic mean of all the metrics above. The LLM detection baseline includes seven implementations: PPMd Compression-based Cosine (PPMd CBC), Authorship Unmasking, Binoculars, DetectLLM LRR and NPR, DetectGPT, Fast-DetectGPT, and Text length. PPMd CBC and Authorship Unmasking utilize a bag-of-words model, while Binoculars, DetectLLM, and DetectGPT employ LLMs to measure text perplexity. Text length serves as a randomness indicator for data integrity checks. The results of the partial baseline implementations can be found in the section 4.4. 3 https://huggingface.co/annakotarba/sentence-similarity 4.4. Results The external results of the model can be seen in Table 1 and 2. Table 1 shows the results that the official baselines provided by the PAN organizers and summary statistics of all submissions to the task (i.e., the maximum, median, minimum, and 95-th, 75-th, and 25-th percentiles over all submissions to the task). Table 2 shows the summarized results averaged (arithmetic mean) over 10 variables of the test dataset. The best-performing approach, "gritty-producer," achieved an average score of 0.966, surpassing all baseline methods. In the quantile results, the scores of participating teams are arranged in ascending order. This approach falls within the range of 75% to 95%. As expected, the performance of approaches that incorporate features (lazy-iteration and gritty- producer) outperforms the individual PLM models (rapid-pole and savory-plate) on both datasets. The results show that the scheme focusing on using Roberta_Large probability with language features (gritty-producer) is better than the scheme focusing on using Bert_Large probability with language features (lazy-iteration), which shows that the choice of PLM is also important. The model with more advanced pre-training skills is usually better for this use case. This task uses PLMs to maintain a lead over almost all advanced baselines, including DetectLLM and DetectGPT using LLM, so introducing LLM to achieve the task was not explored. Table 1 Overview of the accuracy in detecting if a text is written by an human in task 4 on PAN 2024 (Voight-Kampff Generative AI Authorship Verification). We report ROC-AUC, Brier, C@1, F1 , F0.5𝑢 and their mean. Approach ROC-AUC Brier C@1 F1 F0.5𝑢 Mean rapid-pole 0.978 0.935 0.954 0.908 0.918 0.939 savory-plate 0.947 0.899 0.913 0.888 0.891 0.908 lazy-iteration 0.979 0.925 0.955 0.948 0.948 0.951 gritty-producer 0.989 0.949 0.965 0.963 0.962 0.966 Baseline Binoculars 0.972 0.957 0.966 0.964 0.965 0.965 Baseline Fast-DetectGPT (Mistral) 0.876 0.8 0.886 0.883 0.883 0.866 Baseline PPMd 0.795 0.798 0.754 0.753 0.749 0.77 Baseline Unmasking 0.697 0.774 0.691 0.658 0.666 0.697 Baseline Fast-DetectGPT 0.668 0.776 0.695 0.69 0.691 0.704 95-th quantile 0.994 0.987 0.989 0.989 0.989 0.990 75-th quantile 0.969 0.925 0.950 0.933 0.939 0.941 Median 0.909 0.890 0.887 0.871 0.867 0.889 25-th quantile 0.701 0.768 0.683 0.657 0.670 0.689 Min 0.131 0.265 0.005 0.006 0.007 0.224 Table 3 showcases the final results achieved by the model in the task. The individual effectiveness scores are aggregated across all test datasets and corrected by half a standard deviation to penalize unstable classification performance. The "gritty-producer" approach achieved a mean score of 0.884 and ranked fourth among 30 teams. 5. Conclusion This paper proposes a text-based multi-feature fusion hybrid model for addressing the Voight- Kampff Generative AI Authorship Verification 2024 task. Experiments were conducted using various Transformer-based PLMs, and detailed insights into different feature extraction methods are provided. These methods effectively enhance the model’s performance, resulting in the task’s mean score of 0.884. For future work, improving the selection of hyperparameters, such as the output weights for each model, can be focused on, where grid search techniques can help identify better values. Additionally, further exploration can be done by incorporating additional text features to enhance the final output results. Table 2 Overview of the mean accuracy over 9 variants of the test set. We report the minumum, median, the maximum, the 25-th, and the 75-th quantile, of the mean per the 9 datasets. Approach Minimum 25-th Quantile Median 75-th Quantile Max rapid-pole 0.540 0.711 0.936 0.950 0.984 savory-plate 0.386 0.732 0.908 0.950 0.977 lazy-iteration 0.659 0.849 0.951 0.977 0.990 gritty-producer 0.743 0.959 0.966 0.990 0.996 Baseline Binoculars 0.342 0.818 0.844 0.965 0.996 Baseline Fast-DetectGPT (Mistral) 0.095 0.793 0.842 0.931 0.958 Baseline PPMd 0.270 0.546 0.750 0.770 0.863 Baseline Unmasking 0.250 0.662 0.696 0.697 0.762 Baseline Fast-DetectGPT 0.159 0.579 0.704 0.719 0.982 95-th quantile 0.863 0.971 0.978 0.990 1.000 75-th quantile 0.758 0.865 0.933 0.959 0.991 Median 0.605 0.645 0.875 0.889 0.936 25-th quantile 0.353 0.496 0.658 0.675 0.711 Min 0.015 0.038 0.231 0.244 0.252 Table 3 Final ranking results Rank Team Approach ROC-AUC Brier C@1 F1 F0.5𝑢 Mean 1 marsan staff-trunk 0.961 0.928 0.912 0.884 0.932 0.924 2 you-shun-you-de charitable-mole_v3 0.931 0.926 0.928 0.905 0.913 0.921 3 baselineavengers svm 0.925 0.869 0.882 0.875 0.869 0.886 4 g-fosunlpteam gritty-producer 0.889 0.875 0.887 0.884 0.884 0.884 (26 more) Acknowledgments This work is supported by the Social Science Foundation of Guangdong Province, China (No.GD24CZY02) References [1] S. Levy, M. Saxon, W. Y. Wang, Investigating memorization of conspiracy theories in text generation, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 4718– 4729. [2] N. Dehouche, Plagiarism in the age of massive generative pre-trained transformers (gpt-3), Ethics in Science and Environmental Politics 21 (2021) 17–23. [3] D. Rozado, The political biases of chatgpt, Social Sciences 12 (2023) 148. [4] G. Spitale, N. Biller-Andorno, F. Germani, Ai model gpt-3 (dis) informs us better than humans, Science Advances 9 (2023) eadh1850. [5] A. A. Ayele, N. Babakov, J. Bevendorff, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag, M. Fröbe, D. Korenčić, M. Mayerl, D. Moskovskiy, A. Mukherjee, A. Panchenko, M. Potthast, F. Rangel, N. Rizwan, P. Rosso, F. Schneider, A. Smirnova, E. Stamatatos, E. Stakovskii, B. Stein, M. Taulé, D. Ustalov, X. Wang, M. Wiegmann, S. M. Yimam, E. Zangerle, Overview of PAN 2024: Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2024. [6] J. Bevendorff, M. Wiegmann, J. Karlgren, L. Dürlich, E. Gogoulou, A. Talman, E. Stamatatos, M. Potthast, B. Stein, Overview of the “Voight-Kampff” Generative AI Authorship Verification Task at PAN and ELOQUENT 2024, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org, 2024. [7] M. Davies, Making google books n-grams useful for a wide range of research on language change, International Journal of Corpus Linguistics 19 (2014) 401–416. [8] P. Przybyła, Detecting bot accounts on twitter by measuring message predictability, 2019. [9] E. Crothers, N. Japkowicz, H. L. Viktor, Machine-generated text: A comprehensive survey of threat models and detection methods, IEEE Access (2023). [10] R. Tang, Y.-N. Chuang, X. Hu, The science of detecting llm-generated text, Communications of the ACM 67 (2024) 50–59. [11] S. Gehrmann, H. Strobelt, A. M. Rush, Gltr: Statistical detection and visualization of generated text, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2019, pp. 111–116. [12] E. Mitchell, Y. Lee, A. Khazatsky, C. D. Manning, C. Finn, Detectgpt: Zero-shot machine-generated text detection using probability curvature, in: International Conference on Machine Learning, PMLR, 2023, pp. 24950–24962. [13] I. Solaiman, M. Brundage, J. Clark, A. Askell, A. Herbert-Voss, J. Wu, A. Radford, J. Wang, Release strategies and the social impacts of language models (2019). [14] L. Fröhling, A. Zubiaga, Feature-based detection of automated language models: tackling gpt-2, gpt-3 and grover, PeerJ Computer Science 7 (2021) e443. [15] D. Macko, R. Moro, A. Uchendu, J. Lucas, M. Yamashita, M. Pikuliak, I. Srba, T. Le, D. Lee, J. Simko, et al., Multitude: Large-scale multilingual machine-generated text detection benchmark, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 9960–9987. [16] Y. Chen, H. Kang, V. Zhai, L. Li, R. Singh, B. Raj, Gpt-sentinel: Distinguishing human and chatgpt generated content, arXiv preprint arXiv:2305.07969 (2023). [17] M. Gambini, T. Fagni, F. Falchi, M. Tesconi, On pushing deepfake tweet detection capabilities to the limits, in: Proceedings of the 14th ACM Web Science Conference 2022, 2022, pp. 154–163. [18] A. Palmer, N. Schneider, N. Schluter, G. Emerson, A. Herbelot, X. Zhu, Proceedings of the 15th international workshop on semantic evaluation (semeval-2021), in: Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), 2021. [19] P. Przybyła, N. Duran-Silva, S. Egea-Gómez, I’ve seen things you machines wouldn’t believe: Measuring content predictability to identify automatically-generated text, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2023). CEUR Workshop Proceedings, CEUR-WS, Jaén, Spain, 2023. [20] P. Fivez, W. Daelemans, T. Van de Cruys, Y. Kashnitsky, S. Chamezopoulos, H. Mohammadi, A. Giachanou, A. Bagheri, W. Poelman, J. Vladika, et al., The clin33 shared task on the detection of text generated by large language models, Computational Linguistics in the Netherlands Journal 13 (2024) 233–259. [21] E. Ferracane, S. Wang, R. Mooney, Leveraging discourse information effectively for authorship attribution, in: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2017, pp. 584–593. [22] A. Uchendu, T. Le, K. Shu, D. Lee, Authorship attribution for neural text generation, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 8384–8395. [23] M. Lippi, M. A. Montemurro, M. Degli Esposti, G. Cristadoro, Natural language statistical features of lstm-generated texts, IEEE Transactions on Neural Networks and Learning Systems 30 (2019) 3326–3337. [24] M. Sundermeyer, R. Schlüter, H. Ney, Lstm neural networks for language modeling., in: Interspeech, volume 2012, 2012, pp. 194–197. [25] A. Mikael Sarvazyan, J. Ángel González, M. Franco-Salvador, F. Rangel, B. Chulvi, P. Rosso, Overview of autextiflcation at iberlef 2023: Detection and attribution of machine-generated text in multiple domains., Procesamiento del Lenguaje Natural 71 (2023). [26] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast, Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot, F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. doi:10.1007/ 978-3-031-28241-6_20.