<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.1007/978-3-031-28241-6_20</article-id>
      <title-group>
        <article-title>Team CNLP-NITS-PP at PAN: Advancing Generative AI Detection: Mixture of Experts with Transformer Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lekkala Sai Teja</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Annepaka Yadagiri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Partha Pakray</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Institute of Technology</institution>
          ,
          <addr-line>Sichar</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <fpage>236</fpage>
      <lpage>241</lpage>
      <abstract>
        <p>Generative Artificial Intelligence (Gen AI) texts are evolving globally, from mundane to significant matters. We humans tend not to know that the texts are written by them, but not by an AI, so we do things like adding our content to the original generated AI texts. This works proposes a new method for the classification of potentially obfuscated text and for the classification of a document collaboratively authored by humans and AI. This work is a part of PAN at CLEF 2025 shared task named Voight-Kampf Generative AI Detection. Our new method explores the integration of Mixture-of-Experts (MoE) architecture with transformer-based language models for text classification. This work involves two tasks: Voight-Kampf AI Detection Sensitivity and Human-AI Collaborative Text Classification. The SoftMoE employs a gating mechanism to dynamically combine expert outputs, while the HardMoE selects a single expert per input. We stand in the 5th position in Subtask 1 and the 11th position in Subtask 2, with our results consistently outperforming the oficial baselines. Our experiments tell us that MoE-enhanced models achieve competitive performance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;PAN 2025</kwd>
        <kwd>Gen AI Detection</kwd>
        <kwd>Mixture-of-Experts</kwd>
        <kwd>Transformers</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        A significant advancement of transformer-based language models [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has made a great impact in
natural language processing (NLP) [2] capabilities, particularly in text classification tasks. These
models, such as BERT [3], RoBERTa [4], and DeBERTa [5], have demonstrated remarkable performance
across a wide range of benchmarks by capturing deep contextual representations and long-range
dependencies in text. However, the computational complexity and resource demands of these models
shows challenges for both scalability and eficiency as the number of trained parameters increases
during the training. Moreover, the trend of continuously increasing the size of model architectures to
gain exceptional improvements in accuracy raises concerns about energy consumption and inference
speed. As a result, many of the researchers are showing their interest in developing lightweight and
scalable transformer variants or hybrid architectures that can maintain high accuracy while significantly
reducing computational overhead, energy consumption, and including environmental sustainability.
      </p>
      <p>Mixture of Experts (MoE) [6] [7] architectures ofer a promising solution by distributing the
computational load across multiple specialized sub-networks, or “experts” through a gating network, each of
which is responsible for handling diferent aspects or subsets of the input data. This dynamic allocation
of processing tasks allows the model to activate only a small portion of the total parameters during
inference, which reduces computational overhead while preserving or even enhancing performance.</p>
      <p>In this study, we investigate the application of both Soft and Hard MoE frameworks integrated with
transformer models, including DistilBERT [8], DeBERTa, ModernBERT [9], XLNet [10], RoBERTa [4],
and ALBERT [11], for binary and multi-class text classification on the respective datasets. The Soft MoE
dynamically combines expert outputs through a gating mechanism, while the Hard MoE selects a single
expert per input, optimizing for computational sparsity. By leveraging the CLS token for classification
and visualizing expert routing, we aim to evaluate the efectiveness of these MoE variants in improving
classification performance.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task</title>
      <p>Generative AI Detection This is a shared task organized by the PAN 2025 laboratory [12, 13, 14]
on digital text forensics and stylometry. In addition, it is divided into two subtasks, Subtask 1 (Webis)
AI Detection Sensitivity Analysis, Subtask 2 (MBZUAI ) Fine-grained recognition of the collaborative
document of human-AI.</p>
      <p>Subtask 1 AI Detection Sensitivity Analysis for Identifying Unobfuscated and Obfuscated
LLMGenerated Text.</p>
      <p>Subtask 2 Detailed classification of documents created through human-AI collaboration: For a given
document produced by humans and AI systems, assign it to one of these categories: (1) Fully
humanwritten, (2) Human-initiated, then machine-continued, (3) Human-written, then machine-polished, (4)
Machine-written, then machine-humanized (obfuscated), (5) Machine-written, then human-edited, (6)
Deeply-mixed text sections.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset Statistics</title>
      <p>This task provides two datasets presenting one for each subtask.</p>
      <p>In Subtask 1 dataset is with Human texts and texts from the AI models are gpt-3.5-turbo,
gpt-4omini, gpt-4o, ministral-8b-instruct-2410, gemini-2.0-flash, o3-mini, gemini-1.5-pro, llama-3.1-8b-instruct,
deepseek-r1-distill-qwen-32b, falcon3-10b-instruct, llama-3.3-70b-instruct, gpt-4.5-preview,
gpt-4-turboparaphrase, gemini-pro, gpt-4-turbo, qwen1.5-72b-chat-8bit, llama-2-70b-chat, mistral-7b-instruct-v0.2,
gemini-pro-paraphrase, text-bison-002, mixtral-8x7b-instruct-v0.1, llama-2-7b-chat. Model Distribution
follows the Figure 1 below.</p>
      <p>In Subtask 2 dataset the dataset is made by the AI texts from the models mixtral-8x7b, gpt-4o,
llama370b, gemma-7b-it, llama3-8b, gemma2-9b-it, chatgpt, gemini1.5, llm1-llm2, gpt-3.5-turbo-to-mistral-7b,
mistral-7b, gpt-3.5-turbo-to-gemini1.5, gpt-3, claude3.5-sonnet, llama-370b, gpt-4, llama2, mgt, chatglm,
stablelm, dolly, llama3.1-405b, chatgpt-turbo. Model Distribution follows the Figure 2 below.</p>
      <p>Further dataset textual analysis of both subtasks are given in the appendix A.</p>
    </sec>
    <sec id="sec-4">
      <title>4. System Description</title>
      <p>Our proposed system integrates a Mixture-of-Experts (MoE) architecture with transformer-based
language models to enhance binary and multi-class text classification performance. We implemented
two variants of MoE, namely SoftMoE and HardMoE, with 2 (for subtask 1) and 6 (for subtask 2) experts
to a diverse set of pre-trained transformer models.</p>
      <sec id="sec-4-1">
        <title>4.1. System Architecture</title>
        <p>The system is built with a transformer-based backbone added with an MoE layer for classification. The
transformer backbones include DistilBERT, RoBERTa, ALBERT, XLNet, DeBERTa, and ModernBERT, all
base models.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. MoE Layer</title>
        <p>In our approach, we utilize two distinct types of Mixture of Experts (MoE) classifiers:
HardMoE and SoftMoE. The HardMoE classifier operates using a discrete gating mechanism, where a
lightweight linear gating network takes the CLS token output from the transformer layer, denoted
as Transformer(x)[:,0,:], and transforms it into a set of expert scores using the equation: g =
W hCLS + b. The expert associated with the highest score is chosen using an argmax operation,
ensuring that only a single expert is activated per input. The selected expert processes the input
and produces a prediction via a softmax layer. Additionally, the raw gating scores can be utilized for
computing auxiliary losses during training.</p>
        <p>In contrast, the SoftMoE classifier relies on a continuous, probabilistic gating mechanism. Instead of
selecting just one expert, the gating network generates a score for each expert, which is then normalized
using the softmax function to produce a set of attention-like weights. These weights are used to compute
a weighted combination of all expert outputs, allowing the model to leverage information from all
experts simultaneously. The core distinction between HardMoE and SoftMoE lies in this gating strategy:
while HardMoE enforces a strict “winner-takes-all" approach, SoftMoE softly blends contributions from
all available experts. The forward pass logic for both architectures, including the flow of data and the
classification process, is detailed in Algorithm 1, and a deeper visualization of the expert routing can be
seen in Figure 3.</p>
        <p>Soft MoE: Consists of 2 or 6 expert linear layers, each mapping the 768-dimensional CLS token to 2
or 6 output classes. A gating network (a linear layer followed by a softmax) computes weights for each
expert, producing a weighted sum of expert outputs. Gate weights are stored for visualization and can
be seen in the Appendix 6.</p>
        <p>Hard MoE: Similar to Soft MoE but selects a single expert per input based on the highest gate weight,
enforcing computational sparsity. The MoE layer replaces the standard classification head, leveraging
specialized expert knowledge for diverse input patterns.</p>
        <p>Dropout: A dropout layer with a probability of 0.1 is applied to the CLS token before the MoE layer
to mitigate overfitting.</p>
        <p>Algorithm 1 Forward Pass for MoE Classifier (Hard and Soft)</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Training Method</title>
        <p>Models are trained on Amazon Web Services (AWS) Cloud server, Amazon Elastic Compute Cloud (EC2)
instance. In the EC2 instance, we initiated an instance for Accelerated Computing. The specifications
are g6e.xlarge instance, which provides 3rd generation AMD EPYC processors (AMD EPYC 7R13),
with a NVIDIA L40S Tensor Core GPU with 48 GB GPU memory, and 4x vCPU with 32 GiB
memory and a network bandwidth of 20GBps, and our OS type is Ubuntu Server 24.04 LTS (HVM),
EBS General Purpose (SSD) Volume Type.</p>
        <p>Models are trained on a CUDA-enabled GPU, and for all the models the hyperparameter settings are
as follows: the batch-size is 32, the maximum sequence length is 512, AdamW optimizer with a learning
rate of 1e-5 and weight decay of 0.01, Cross-entropy loss, ReduceLROnPlateau reduces the learning
rate by a factor of 0.1 if validation loss plateaus for 1 epoch, up to 10 epochs with early stopping, with a
maximum mean of ROC-AUC, Brier, c@1, F1, F0.5u for Subtask 1, and maximum Recall for Subtask 2.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>For subtask 1, we submitted our best-performing model to TIRA [15] for further execution, and for
subtask 2, we submitted the corresponding “.zip” file, which contained a “predictions.jsonl” file with ‘id’
and ‘label’ in CodaLab. Table 6 shows the performance of models on subtask 1 in val-set and smoke-test
set. Table 7 shows the performance of the models in subtask 2 in the dev set. The AUC-ROC curves of
a few models for subtask 2 are shown in the Appendix D. All the Training results for subtask 1 and
subtask 2 can be seen in the Appendix C. We stood at rank 5 and rank 11 in subtask 1 and subtask 2,
respectively. The final results on the oficial leaderboard are shown below in Tables 4 and 5.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we presented our work to the PAN: Voight-Kampf Generative AI Detection 2025. We
used the Mixture-of-Experts architecture with several transformer backbones and checked which model
gives better performance, surpassing the baselines. An ablation study on expert routing highlights
the critical role of the gating mechanism in enhancing performance. We stand at the 5th position in
Subtask 1 and the 11th position in Subtask 2, with our results consistently outperforming the oficial
baselines. These rankings validate the efectiveness and generalizability of our proposed approach
across multiple evaluation criteria. However, further analysis of misclassified cases could uncover
specific weaknesses for future improvement. Our findings highlight the scalability, interpretability, and
superior performance of MoE-enhanced transformers, establishing a robust framework for advancing
generative AI detection and making a significant contribution to the tasks.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Grammarly, ChatGPT, and Gemini to: check
grammar and spelling, paraphrase, reword, and refine code, improve writing style, and generate
OverleafLaTeX tables. After using this tool/service, the author(s) reviewed and edited the content as needed and
take(s) full responsibility for the content of the publication.
Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[2] K. Chowdhary, K. Chowdhary, Natural language processing, Fundamentals of artificial intelligence
(2020) 603–649.
[3] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, in: Proceedings of the 2019 conference of the North American chapter
of the association for computational linguistics: human language technologies, volume 1 (long
and short papers), 2019, pp. 4171–4186.
[4] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,</p>
      <p>Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
[5] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention.</p>
      <p>arxiv 2020, arXiv preprint arXiv:2006.03654 (2006).
[6] S. Masoudnia, R. Ebrahimpour, Mixture of experts: a literature survey, Artificial Intelligence</p>
      <p>Review 42 (2014) 275–293.
[7] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, J. Dean, Outrageously large
neural networks: The sparsely-gated mixture-of-experts layer, arXiv preprint arXiv:1701.06538
(2017).
[8] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster,
cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019).
[9] B. Warner, A. Chafin, B. Clavié, O. Weller, O. Hallstrom, S. Taghadouini, A. Gallagher, R. Biswas,
F. Ladhak, T. Aarsen, et al., Smarter, better, faster, longer: A modern bidirectional encoder for fast,
memory eficient, and long context finetuning and inference (2024), arXiv preprint arXiv.2412.13663
(2024).
[10] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, Q. V. Le, Xlnet: Generalized autoregressive
pretraining for language understanding, Advances in neural information processing systems 32
(2019).
[11] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A lite bert for self-supervised
learning of language representations, arXiv preprint arXiv:1909.11942 (2019).
[12] J. Bevendorf, D. Dementieva, M. Fröbe, B. Gipp, A. Greiner-Petter, J. Karlgren, M. Mayerl, P. Nakov,
A. Panchenko, M. Potthast, A. Shelmanov, E. Stamatatos, B. Stein, Y. Wang, M. Wiegmann,
E. Zangerle, Overview of PAN 2025: Generative AI Authorship Verification, Multi-Author Writing
Style Analysis, Multilingual Text Detoxification, and Generative Plagiarism Detection, in:
Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fourteenth
International Conference of the CLEF Association (CLEF 2025), Lecture Notes in Computer Science,
Springer, Berlin Heidelberg New York, 2025.
[13] J. Bevendorf, D. Dementieva, M. Fröbe, B. Gipp, A. Greiner-Petter, J. Karlgren, M. Mayerl, P. Nakov,
A. Panchenko, M. Potthast, A. Shelmanov, E. Stamatatos, B. Stein, Y. Wang, M. Wiegmann,
E. Zangerle, Overview of PAN 2025: Voight-Kampf Generative AI Detection, Multilingual Text
Detoxification, Multi-Author Writing Style Analysis, and Generative Plagiarism Detection, in:
J. C. de Albornoz, J. Gonzalo, L. Plaza, A. G. S. de Herrera, J. Mothe, F. Piroi, P. Rosso, D. Spina,
G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction.
Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF 2025),
Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2025.
[14] J. Bevendorf, Y. Wang, J. Karlgren, M. Wiegmann, M. Fröbe, A. Tsivgun, J. Su, Z. Xie, M. Abassy,
J. Mansurov, R. Xing, M. N. Ta, K. A. Elozeiri, T. Gu, R. V. Tomar, J. Geng, E. Artemova, A. Shelmanov,
N. Habash, E. Stamatatos, I. Gurevych, P. Nakov, M. Potthast, B. Stein, Overview of the
“VoightKampf” Generative AI Authorship Verification Task at PAN and ELOQUENT 2025, in: G. Faggioli,
N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025 – Conference and Labs of the
Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org, 2025.
[15] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast,
Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot,
F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances in</p>
    </sec>
    <sec id="sec-8">
      <title>A. Data Analysis</title>
      <sec id="sec-8-1">
        <title>A.1. Sub Task 1 dataset</title>
        <p>We have visualized how the data are by the following linguistic features by label count: 1) Stop Word
Count, 2) Hapax Legomenon Rate, 3) Type Token Ratio.
(a) Subtask 1, train dataset Type</p>
        <p>Token Ratio histograms for
both Human and Machine
Labels.</p>
        <p>(b) Subtask 1, train dataset Hapax</p>
        <p>Legomenon Rate histograms
for both Human and Machine
Labels.</p>
        <p>(c) Subtask 1, train dataset Stop</p>
        <p>Word Count histograms for
both Human and Machine
Labels.
(d) Subtask 1, dev dataset Type
Token Ratio histograms for both
Human and Machine Labels.</p>
        <p>(e) Subtask 1, dev dataset Hapax</p>
        <p>Legomenon Rate histograms
for both Human and Machine
Labels.</p>
        <p>(f) Subtask 1, dev dataset Stop</p>
        <p>Word Count histograms for
both Human and Machine
Labels.
We have visualised how the data is by the following linguistic features by label count: 1) Bigram
Uniqueness, 2) Hapax Legomenon Rate, 3) Type Token Ratio.
(a) Subtask 2, train dataset Type</p>
        <p>Token Ratio histograms for all
classes.</p>
        <p>(b) Subtask 2, train dataset Hapax</p>
        <p>Legomenon Rate histograms
for all classes.</p>
        <p>(c) Subtask 2, train dataset BiGram</p>
        <p>Uniqueness histograms for all
classes.
(d) Subtask 2, dev dataset Type
Token Ratio histograms for all
classes.</p>
        <p>(e) Subtask 2, dev dataset Hapax</p>
        <p>Legomenon Rate histograms
for all classes.</p>
        <p>(f) Subtask 2, dev dataset BiGram</p>
        <p>Uniqueness histograms for all
classes.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>B. SoftMoE Expert Routing SubTask 2</title>
      <p>The expert routing visualizations that are present in Figure 6 reveal that the gating mechanism in
SoftMoE exhibits a preference toward a single expert across all transformer backbone models, such as
DistilBERT, ALBERT, and RoBERTa. This skewed distribution indicates that while the gating network is
functional, it often fails to fully utilize the diversity of available experts. From the algorithm (Algorithm 1),
it is evident that the gating logits are computed from the [CLS] token via a learned linear transformation,
and a softmax operation determines the expert weights in the SoftMoE setting. The consistent expert
bias suggests that the learned gating transformation overfits to favor a specific semantic representation
or decision path within the expert pool. While this kind of routing could benefit in tasks where a single
dominance representation leverages, but limits the potential of MoE architectures to improve expert
diversity.</p>
    </sec>
    <sec id="sec-10">
      <title>C. Training Results</title>
      <p>0.941
0.971</p>
      <p>F1
0.996
0.997
(e) RoBERTa Soft MoE Expert
Routing</p>
      <sec id="sec-10-1">
        <title>D.2. HardMoE AUC-ROC</title>
        <p>(a) DistilBERT Hard MoE
AUCROC Curves
(b) ALBERT Hard MoE AUC-ROC
Curves
(c) DeBERTa Hard MoE
AUCROC Curves
(d) ModernBERT Hard MoE
AUCROC Curves
(e) RoBERTa-base Hard MoE
AUCROC Curves
(f) XLNet Hard MoE AUC-ROC
Curves
(g) DeBERTa-V3-Large Hard MoE</p>
        <p>AUC-ROC Curves</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , L. u. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          , in: I. Guyon,
          <string-name>
            <given-names>U. V.</given-names>
            <surname>Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>30</volume>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>