<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Generative AI Text Classification using Ensemble LLM Approaches</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Harika Abburi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Suesserman</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nirmala Pudota</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Balaji Veeramani</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edward Bowen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sanmitra Bhattacharya</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deloitte</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Touche LLP</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Deloitte &amp; Touche Assurance &amp; Enterprise Risk Services India Private Limited</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>Large Language Models (LLMs) have shown impressive performance across a variety of Artificial Intelligence (AI) and natural language processing tasks, such as content creation, report generation, etc. However, unregulated malign application of these models can create undesirable consequences such as generation of fake news, plagiarism, etc. As a result, accurate detection of AI-generated language can be crucial in responsible usage of LLMs. In this work, we explore 1) whether a certain body of text is AI generated or written by human, and 2) attribution of a specific language model in generating a body of text. Texts in both English and Spanish are considered. The datasets used in this study are provided as part of the Automated Text Identification (AuTexTification) shared task. For each of the research objectives stated above, we propose an ensemble neural model that generates probabilities from diferent pre-trained LLMs which are used as features to a Traditional Machine Learning (TML) classifier following it. For the first task of distinguishing between AI and human generated text, our model ranked in fith and thirteenth place (with macro  1 scores of 0.733 and 0.649) for English and Spanish texts, respectively. For the second task on model attribution, our model ranked in first place with macro  1 scores of 0.625 and 0.653 for English and Spanish texts, respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>generative AI</kwd>
        <kwd>text classification</kwd>
        <kwd>large language models</kwd>
        <kwd>ensemble</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Rapid advances in the capabilities of LLMs, and their ease of use in generating sophisticated
and coherent content is leading to the production of AI-generated content at scale. Some of the
applications of LLMs are in AI-assisted writing [1], medical question answering [
        <xref ref-type="bibr" rid="ref1">2, 3</xref>
        ] and a
wide range of tasks in the financial services industry [ 4] and legal domain [5]. Foundational
models such as OpenAI’s GPT-3 [6], Meta’s OPT [7], and Big Science’s BLOOM [8] can generate
such sophisticated content with basic text prompts that it is often challenging to manually
discern between human and AI-generated text. While these models demonstrate the ability to
understand the context and generate coherent human-like responses, they do not have a true
understanding of what they are producing [9, 10]. This could potentially lead to adverse
consequences when used in downstream applications. For example, consider an application of a LLM
used for summarizing a medicinal drug data-sheet that inadvertently produces wrong dosage
information. Generating plausible but false content (referred to as hallucination [11, 12]), may
inadvertently help propagate misinformation, false narratives, fake news, and spam. Given the
pace of LLM adoption by the general public, and the rate of dissemination of information across
the globe, widespread misinformation propagation is an imminent risk that both individuals
and organizations will have to deal with in the near future [13, 14]. An understanding of the
source of the authored content – whether by an AI system or a human – would allow one to
appropriately use the content in downstream applications with suitable oversight. In the case
of AI-generated content, the knowledge of the source LLM would allow one to watch out for
potential known biases and limitations associated with that model. Given the risks associated
with unchecked and unregulated adoption of generative AI models, it is imperative to develop
approaches to detect AI-generated content, and identify the source of the AI-generated content.
      </p>
      <p>Motivated by these challenges, automatic detection of AI-generated text has become an active
area of research. Recent work such as DetectGPT [15] generates minor perturbations of a
passage using a generic pre-trained Text-to-Text Transfer Transformer (T5) model, and then
compares the log probability of the original sample with each perturbed sample to determine
if it is AI generated. Another approach [16] developed a model for a generative multi-class
AI detection challenge involving Russian language text using Decoding- enhanced BERT with
disentangled attention (DeBERTa) as a pre-trained language model for classification and an
ensemble approach is proposed by [17] for binary classification. Mitrovic et al. [18] developed a
ifne-tuned transformer-based approach to distinguish between human and ChatGTP generated
text, with addition of SHapely Additive exPlanations (SHAP) values for model explainability.
Statistical detection methods have also been applied for detection of generative AI text, such as
the Giant Language model Test Room (GLTR) approach developed by researchers [19]. Research
involving repeated higher-order n-grams found that they occur more often in AI generated
text as compared to human generated text [20]. Using an ensemble of classifiers, higher-order
n-grams can be used to help distinguish between human and AI generated text.</p>
      <p>To boost this area of research further, the Automated Text Identification (AuTexTification)
[21] shared task, part of IberLEF 2023 [22], put forth two tasks, each covering both English and
Spanish language texts: 1. Human or generated: determine whether a given input text has been
automatically generated or not – a binary classification task with two classes ‘human’ or ‘AI’
2. Model attribution: for automatically generated text, determine which one of six diferent
text generation AI models was the source – a multi-class classification task with six classes,
where each class represents a text generation model. For each of these tasks, we propose an
ensemble classifier, where the probabilities generated from various state-of-the-art LLMs are
used as input feature vectors to TML models to produce final predictions. Our experiments
show multiple instances of the proposed framework outperforming several baselines using
established evaluation metrics.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>The data for all the subtasks and languages, is provided by the AuTexTification shared task
organizers [21]. For Task 1, Human or generated, the data consists of texts from five domains.
During the training phase of the shared task the domains are not disclosed to the participants.
For Task 1, for both English and Spanish, data from three domains is provided as training data,
and data from two new domains is provided in the test set. For Task 2, Model attribution, also
the data comes from five diferent domains, but the same domains are in train and test splits.
This data is evenly distributed into six class (A, B, C, D, E, and F), where each class represents a
text generation model. An interesting point to note here is that the six text generation models
are of increasing number of neural parameters, ranging from 2B to 175B. The motivation here
is to emulate realistic AI text detection approaches which should be versatile enough to detect a
diverse set of text generation models and writing styles. More details about the data can be
found in the AuTextification overview paper [ 21].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Ensemble Approach</title>
      <p>In this section, we describe our approaches for detection and classification of generative AI text.</p>
      <sec id="sec-3-1">
        <title>3.1. Models</title>
        <p>We explored various state-of-the-art large language models [23] such as Bidirectional Encoder
Representations from Transformers (BERT), DeBERTa, Robustly optimized BERT approach
(RoBERTa), and cross-lingual language model RoBERTa (XLM-RoBERTa) along with their
variants. Since the datasets are diferent for each task and language, and the same set of
the models will not fit across them, we fine-tuned diferent models for diferent subtasks
and languages and pick the best models based on validation data. Table 1 lists the diferent
models that we explored for the diferent subtasks: Task 1 in English (Binary-English), Task
1 in Spanish (Binary-Spanish), Task 2 in English (Multiclass-English) and Task 2 in Spanish
(Multiclass-Spanish). We also explored diferent TML models such as Linear SVC [ 24],
ErrorCorrecting Output Codes (ECOC) [25], OneVsRest [26] and Voting classifier which includes
Logistic Regression (LR), Random Forest (RF), Gaussian Naive Bayes (NB), Support Vector
machines (SVC) [27] in our approach and the best model for each subtask is shown in results
section.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Proposed Ensemble Approach</title>
        <p>Each input text is passed through variants of the pre-trained large language models such as
DeBERTa (D), XLM-RoBERTa (X), RoBERTa (R), BERT (B), etc. During the model training phase,
these models are fine-tuned on the training data. For inference and testing, each of these models
can independently generate classification probabilities (P), namely   ,   ,   ,   , etc. These
probabilities are concatenated (  ) or averaged (  ), and the output is passed as a feature vector
to train TML models to produce final predictions.</p>
        <p>Task Large language models
Binary- deberta-large [28], xlm-r-100langs-bert-base-nli-stsb-mean-tokens [29],
English roberta-base-openai-detector [30], xlm-roberta-large-xnli-anli, roberta-large
Binary- bertin-roberta-base-spanish [31], MarIA [32], sentence_similarity_spanish_es,
Spanish xlm-roberta-large-xnli-anli, xlm-roberta-large-finetuned-conll02-spanish [33]
Multiclass- xlm-roberta-large-finetuned-conll03-english, scibert_scivocab_cased [34],
English deberta-base, roberta-large [35], longformer-base-4096 [36],
bert-largeuncased-whole-word-masking-finetuned-squad [37]
Multiclass- xlm-roberta-large-finetuned-conll03-english, MarIA,
sentence_similarSpanish ity_spanish_es, bert-base-multilingual-cased-finetuned-conll03-spanish,
roberta-large</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>This section provides the experimental evaluation of the proposed methods. For both the tasks
we report results on accuracy ( ), macro F1 score (  ), precision (   ) and recall ( ).</p>
      <sec id="sec-4-1">
        <title>4.1. Implementation Details</title>
        <p>We set aside 20% from the training data for validation. For the testing phase, the validation set is
merged with the training set. All the results are reported on testing data. The hyper-parameters
are used for model fine-tuning are shown in Table 2.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results</title>
        <p>For each task and language, we submitted three runs to the leaderboard (team name Drocks).
These runs correspond to the most promising approaches on the validation data. In this paper,
we show only the top run results for each of the tasks. The complete leaderboard is available
at https://sites.google.com/view/autextification/results [21]. The results on test data for
BinaryEnglish and Binary-Spanish are shown in Tables 3 and 4, respectively. With the concatenated
feature vector (  ) as an input, a voting classifier results in   score of 73.3 on Binary-English</p>
        <sec id="sec-4-2-1">
          <title>Ensemble with OneVsRest classifier (  as a input feature ) 0.704</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>Ensemble with ECOC classifier (  as a input feature )</title>
          <p>roberta-large
Ensemble with Voting classifier (  as a input feature ) 0.751


data, whereas a onevsrest classifier outperforms other methods with  
score of 64.9 on
Binary-Spanish data.
feature vector (  ) outperforms the other approaches with  
hand, a linear SVC classifier with averaged feature vector (   ) as an input outperforms the
score of 62.5. On the other
other approaches with  
score of 65.4 for the Multiclass-Spanish data.


0.594
0.575
0.558
0.574
0.582
0.579
roberta-large
Ensemble with Linear SVC classifier (  as a input feature ) 0.653</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we described our submission to the AuTexTification shared task which consists
of two tasks on the classification of generative AI content. In our experiments, we found that
our proposed ensemble LLM approach is a promising strategy, as our model ranked fith with a
macro  1 score of 73.3% for English and thirteenth with 64.9% macro  1 score for Spanish in the
Human or generated binary classification task. For the
Model attribution multiclass classification
task, our model ranked in the first place for both English and Spanish, with macro  1 scores
of 62.5% and 65.3%, respectively. While our approach shows promising results for the Model
attribution task, further exploration is needed to enhance and tune our models for Human or
generated binary classification task.
22–25.</p>
      <p>e0000198.
[1] M. Hutson, Robo-writers: the rise and risks of language-generating ai, Nature 591 (2021)
R. Aggabao, G. Diaz-Candido, J. Maningo, et al., Performance of chatgpt on usmle: Potential
for ai-assisted medical education using large language models, PLoS digital health 2 (2023)
[3] M. Wang, M. Wang, F. Yu, Y. Yang, J. Walker, J. Mostafa, A systematic review of automatic
text summarization for biomedical literature and ehrs, Journal of the American Medical
Informatics Association 28 (2021) 2287–2297.
[4] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D.
Rosenberg, G. Mann, Bloomberggpt: A large language model for finance, arXiv preprint
arXiv:2303.17564 (2023).</p>
      <p>arXiv:2303.09136 (2023).
[5] Z. Sun, A short survey of viewing large language models in legal aspect, arXiv preprint
[6] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in
neural information processing systems 33 (2020) 1877–1901.
[7] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li,
X. V. Lin, et al., Opt: Open pre-trained transformer language models, arXiv preprint
arXiv:2205.01068 (2022).
[8] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni,
F. Yvon, M. Gallé, et al., Bloom: A 176b-parameter open-access multilingual language
model, arXiv preprint arXiv:2211.05100 (2022).
[9] H. Li, J. T. Moon, S. Purkayastha, L. A. Celi, H. Trivedi, J. W. Gichoya, Ethics of large
language models in medicine and medical research, The Lancet Digital Health (2023).
[10] M. Turpin, J. Michael, E. Perez, S. R. Bowman, Language models don’t always say
what they think: Unfaithful explanations in chain-of-thought prompting, arXiv preprint
arXiv:2305.04388 (2023).
[11] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung,
et al., A multitask, multilingual, multimodal evaluation of chatgpt on reasoning,
hallucination, and interactivity, arXiv preprint arXiv:2302.04023 (2023).
[12] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, P. Fung, Survey
of hallucination in natural language generation, ACM Computing Surveys 55 (2023) 1–38.
[13] H. Ali, J. Qadir, Z. Shah, Chatgpt and large language models (llms) in healthcare:
Opportunities and risks (2023).
[14] L. Weidinger, J. Mellor, M. Rauh, C. Grifin, J. Uesato, P.-S. Huang, M. Cheng, M. Glaese,
B. Balle, A. Kasirzadeh, et al., Ethical and social risks of harm from language models, arXiv
preprint arXiv:2112.04359 (2021).
[15] E. Mitchell, Y. Lee, A. Khazatsky, C. D. Manning, C. Finn, Detectgpt: Zero-shot
machinegenerated text detection using probability curvature, arXiv preprint arXiv:2301.11305
(2023).
[16] B. Li, Y. Weng, Q. Song, H. Deng, Artificial text detection with multiple training strategies,
arXiv preprint arXiv:2212.05194 (2022).
[17] N. Maloyan, B. Nutfullin, E. Ilyushin, Dialog-22 ruatd generated text detection, arXiv
preprint arXiv:2206.08029 (2022).
[18] S. Mitrović, D. Andreoletti, O. Ayoub, Chatgpt or human? detect and explain. explaining
decisions of machine learning model for detecting short chatgpt-generated text, arXiv
preprint arXiv:2301.13852 (2023).
[19] S. Gehrmann, H. Strobelt, A. M. Rush, Gltr: Statistical detection and visualization of
generated text, arXiv preprint arXiv:1906.04043 (2019).
[20] M. Gallé, J. Rozen, G. Kruszewski, H. Elsahar, Unsupervised and distributional detection
of machine-generated text, arXiv preprint arXiv:2111.02878 (2021).
[21] A. M. Sarvazyan, J. Á. González, M. Franco Salvador, F. Rangel, B. Chulvi, P. Rosso,
Overview of autextification at iberlef 2023: Detection and attribution of machine-generated
text in multiple domains, in: Procesamiento del Lenguaje Natural, Jaén, Spain, 2023.
[22] S. M. Jiménez-Zafra, F. Rangel, M. Montes-y Gómez, Overview of IberLEF 2023: Natural
Language Processing Challenges for Spanish and other Iberian Languages, Procesamiento
del Lenguaje Natural 71 (2023).
[23] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
M. Funtowicz, et al., Transformers: State-of-the-art natural language processing, in:
Proceedings of the 2020 conference on empirical methods in natural language processing:
system demonstrations, 2020, pp. 38–45.
[24] S. Sulaiman, R. A. Wahid, A. H. Arifin, C. Z. Zulkifli, Question classification based on
cognitive levels using linear svc, Test Eng Manag 83 (2020) 6463–6470.
[25] K.-H. Liu, J. Gao, Y. Xu, K.-J. Feng, X.-N. Ye, S.-T. Liong, L.-Y. Chen, A novel soft-coded
error-correcting output codes algorithm, Pattern Recognition 134 (2023) 109122.
[26] J.-H. Hong, S.-B. Cho, A probabilistic multi-class strategy of one-vs.-rest support vector
machines for cancer classification, Neurocomputing 71 (2008) 3275–3281.
[27] A. Mahabub, A robust technique of fake news detection using ensemble voting classifier
and comparison with other classifiers, SN Applied Sciences 2 (2020) 525.
[28] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled
attention, in: International Conference on Learning Representations, 2021. URL: https:
//openreview.net/forum?id=XPZIaotutsD.
[29] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing, Association for Computational Linguistics, 2019. URL: http://arxiv.org/abs/1908.10084.
[30] I. Solaiman, M. Brundage, J. Clark, A. Askell, A. Herbert-Voss, J. Wu, A. Radford, G. Krueger,
J. W. Kim, S. Kreps, et al., Release strategies and the social impacts of language models,
arXiv preprint arXiv:1908.09203 (2019).
[31] J. D. la Rosa y Eduardo G. Ponferrada y Manu Romero y Paulo Villegas y Pablo González
de Prado Salas y María Grandury, Bertin: Eficient pre-training of a spanish language
model using perplexity sampling, Procesamiento del Lenguaje Natural 68 (2022) 13–23.</p>
      <p>URL: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403.
[32] A. G. Fandiño, J. A. Estapé, M. Pàmies, J. L. Palao, J. S. Ocampo, C. P. Carrino, C. A. Oller,
C. R. Penagos, A. G. Agirre, M. Villegas, Maria: Spanish language models, Procesamiento
del Lenguaje Natural 68 (2022). URL: https://upcommons.upc.edu/handle/2117/367156#
.YyMTB4X9A-0.mendeley. doi:1 0 . 2 6 3 4 2 / 2 0 2 2 - 6 8 - 3 .
[33] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at
scale, arXiv preprint arXiv:1911.02116 (2019).
[34] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained language model for scientific text, in:
EMNLP, Association for Computational Linguistics, 2019. URL: https://www.aclweb.org/
anthology/D19-1371.
[35] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V.
Stoyanov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692
(2019). URL: http://arxiv.org/abs/1907.11692. a r X i v : 1 9 0 7 . 1 1 6 9 2 .
[36] I. Beltagy, M. E. Peters, A. Cohan, Longformer: The long-document transformer,
arXiv:2004.05150 (2020).
[37] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.
org/abs/1810.04805. a r X i v : 1 8 1 0 . 0 4 8 0 5 .</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Kung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cheatham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Medenilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sillos</surname>
          </string-name>
          , L. De Leon,
          <string-name>
            <given-names>C.</given-names>
            <surname>Elepaño</surname>
          </string-name>
          , M. Madriaga,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>