<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Team jr at Generative Plagiarism Detection 2025: A Two-Stage Approach: From TF-IDF/Jaccard Filtering to Transformer Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jieren Luo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mancheng Huang</string-name>
          <email>huangmc3@chinaunicom.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Biao Liu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhongyuan Han</string-name>
          <email>hanzhongyuan@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>China United Network Communications Group Co., Ltd.</institution>
          <addr-line>Foshan Branch</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Foshan University</institution>
          ,
          <addr-line>Foshan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Detecting plagiarism in text generated by large language models is challenging due to the high fluency and variability of AI-authored passages. In the first stage, we apply two lightweight filters in sequence. We compute the TF-IDF cosine similarity and then use character 3-gram Jaccard similarity. This quickly removes sentence pairs that are unlikely to contain plagiarism. In the second stage, remaining candidates are batched and fed through a fine-tuned BERT base classifier. Evaluated on the PAN 2025 validation set, our system achieves a micro_F1 of 0.2158 (micro_recall 0.1767, micro_precision 0.5819) and a macro_F1 of 0.1890 (macro_recall 0.1503, macro_precision 0.5642).</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;PAN 2025</kwd>
        <kwd>Generative Plagiarism Detection</kwd>
        <kwd>TF-IDF Filtering</kwd>
        <kwd>Jaccard Similarity</kwd>
        <kwd>BERT Classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <p>To achieve both high throughput and strong detection quality, we split our system into two stages.
Stage One applies lightweight IR filters (TF–IDF cosine similarity followed by character 3 -gram Jaccard)
input</p>
      <p>Suspicious</p>
      <p>Texts
SoSuorucrece
TeSxotsurce</p>
      <p>TeTxetxsts
stage 1
TF-IDF calculation
Jaccard similarity
filtering
stage 2</p>
      <p>BERT
classification
output</p>
      <p>Suspicious -Source</p>
      <p>XML
to prune most sentence pairs, and Stage Two runs a fine -tuned BERT classifier only on the remaining
high-confidence candidates.</p>
      <sec id="sec-2-1">
        <title>2.1. Stage One: IR-Based Filtering</title>
        <p>In the first stage, we leverage two lightweight information-retrieval techniques to quickly eliminate
the vast majority of sentence pairs that are unlikely to contain reused text. We begin by fitting a
single TF-IDF vectorizer over all sentences from both suspicious and source documents; this global
TF-IDF model captures the overall distribution of terms in the entire corpus. At inference time, each
candidate sentence pair is transformed into sparse TF-IDF vectors and compared via cosine similarity
[5], a technique also demonstrated to be efective in detecting both direct copying and online plagiarism
when paired with semantic models. Pairs whose similarity falls below our threshold are discarded
immediately, since they share too little contextual content to plausibly be paraphrased or reused. The
surviving sentence pairs then undergo a second round of filtering based on character 3 -gram Jaccard
similarity: we convert each sentence into a set of character 3-gram shingles and compute the ratio
of their intersection to their union. If this Jaccard score is below the appropriate obfuscation-specific
threshold, we again discard the pair. By cascading TF-IDF (to remove grossly dissimilar pairs) and
character 3-gram Jaccard (to weed out noisy overlaps), we reduce the candidate pool by several orders
of magnitude. This ensures that only a small, high-likelihood subset progresses to the more expensive
Transformer-based classification in the next stage.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Stage Two: Transformer-Based Classification</title>
        <p>In the second stage, all sentence pairs that survived IR-based filtering are prepared for
Transformerbased classification. First, we organize the remaining pairs into chunks of a fixed size (e.g., 100 pairs
per chunk) to balance GPU memory usage and throughput. Each chunk is then split into batches
(e.g., 64 pairs per batch), and every sentence pair within a batch is tokenized and padded to a uniform
length before being moved onto the GPU. We utilize a Hugging Face text-classification pipeline backed
by a BERT-base model fine-tuned on our prepared training data [ 6]. During inference, the pipeline
receives each batch of tokenized sentence pairs and outputs a softmax confidence score for “plagiarism”
versus “non-plagiarism”. We compare each pair’s plagiarism confidence to its obfuscation-specific
threshold (higher for medium/hard, slightly lower for simple). Only those pairs whose BERT score
exceeds the threshold are marked as positive detections. This approach ensures that BERT is applied
only to the small subset of pairs most likely to contain reused text—maximizing classification accuracy
while keeping GPU utilization eficient.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Output Generation</title>
        <p>In the final stage, we generate structured output files in the PAN XML format to report all detected cases
of plagiarism. For each file pair processed in a chunk, the system collects all positively classified sentence
pairs, merges overlapping or adjacent segments using a configurable merge_gap, and writes the resulting
intervals to an XML document. Each plagiarism instance is encoded as a &lt;feature
name="detectedplagiarism" ... /&gt; element, specifying both the suspicious and source text ofsets and lengths. This step
ensures that the output aligns with PAN’s evaluation format and can be directly used for scoring. Our
implementation automates this process within each chunk, leveraging batch BERT results and filtering
logic to generate accurate and concise detection records, eficiently saving them to disk.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiment</title>
      <sec id="sec-3-1">
        <title>3.1. Experimental Setup</title>
        <p>1. Preprocessing: We first read the pairs file to identify every suspicious–source document pair.</p>
        <p>For each listed pair, both the suspicious and the source documents are loaded from their respective
folders, and their raw text is normalized (e.g., ensuring consistent UTF-8 encoding). We then
apply NLTK’s sentence tokenizer to split each document into a sequence of sentences [7]. Once
all sentences are extracted, we form the initial candidate pool by pairing each sentence in the
suspicious document with each sentence in the source document. This complete cross-product
of sentences becomes the input to our first filtering stage, ensuring that potential plagiarism
fragment (regardless of its position) is considered.
2. Data: The PAN 2025 organizers provide three separate datasets—spot_check, train, and
validation—each following the same subfolder layout: a src/ folder, a susp/ folder, a pairs file listing
suspicious-source pairs, and a corresponding _truth folder containing XML annotations1. The
train and validation datasets only difer in scale: train contains roughly 2.4 million pairs (used
for model fine-tuning), validation contains approximately 238 k pairs (held out for final scoring),
and spot_check is a small sample meant for rapid local testing. During evaluation, our system
must produce an XML file named &lt;suspicious&gt;-&lt;source&gt;.xml for each pair in the pairs list;
each output XML includes &lt;feature name="detected-plagiarism" . . . &gt; elements that
are compared against the ground-truth &lt;feature name="plagiarism" . . . &gt; entries in the
corresponding _truth folder.
3. Parameters: We extract positive and negative sentence pairs using a helper script that reads each
_truth XML, splits the specified spans into sentences (positive examples), and then randomly
samples non-overlapping sentences as negatives at a fixed ratio (approximately 0.43 negatives
per positive). To limit computational overhead and simulate a low-resource setting, we randomly
selected 10,000 sentence pairs from the full training set for BERT fine-tuning, drawing positives
and negatives in a 7:3 ratio (i.e., 7,000 positive pairs and 3,000 negative pairs). We fine-tuned
BERT with a learning rate of 2 × 10 −5 , weight decay of 0.01, for 3 epochs, using a per-device train
batch size of 4 (efective batch size 16) and mixed-precision. For filtering, we compute TF-IDF
cosine similarity (keeping pairs above 0.7) and then apply character 3-gram Jaccard (removing
pairs below 0.4 for “simple” versus 0.7 for other obfuscations). Surviving pairs are grouped into
chunks of 100 and processed in batches of 64 by a fine-tuned BERT model, using confidence
cutofs of 0.75 for “simple” and 0.8 for “medium/hard.”
4. Baseline: The PAN reference implementation uses a simple character-level n-gram approach
to detect near-copy plagiarism2. First, it preprocesses each suspicious document by removing
all punctuation and whitespace and then sliding a fixed-length (50-character) window over the
text to build a hash table of every 50-character n-gram and its positions. Next, for each source
document, it similarly slides the same 50-character window and looks up whether that n-gram</p>
        <sec id="sec-3-1-1">
          <title>1https://zenodo.org/records/14969012</title>
          <p>2https://github.com/pan-webis-de/pan-code/tree/master/clef25/generated-plagiarism-detection
exists in the suspicious document’s index. Whenever a match is found, the baseline attempts to
extend the match forward (skipping over any punctuation/whitespace) to produce the longest
possible contiguous aligned span. All detected spans are collected and written out as &lt;feature
name="detected-plagiarism" . . . &gt; entries in an XML file named &lt;suspicious&gt;-&lt;source&gt;.xml[ 8].
This “IR-only” method is fast and easy to implement but tends to miss paraphrased or heavily
obfuscated passages.
5. Evaluation Metrics: We use the standard PAN toolkit to measure micro_recall and macro_recall
and precision on character-level overlaps, along with granularity [9] (the average number of
detected segments per true case). During evaluation, each ground-truth &lt;feature name="plagiarism"
. . . &gt; is compared against our &lt;feature name="detected-plagiarism" . . . &gt; entries. Micro-averaged
metrics count total overlapping characters across all documents, while macro-averaged metrics
compute recall and precision separately for each case or detection before averaging. Granularity
quantifies how many detected spans overlap each true plagiarism case—ideally one-to-one. These
measures are computed by the oficial plagiarism-detection-evaluation.py script, which outputs
micro_recall, micro_precision, macro_recall, macro_precision, and granularity for final scoring.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Results and Analysis</title>
        <p>Table 1 reports results on the validation subset. Our approach raises the micro_F1 score from 0.1081
to 0.2158 and the macro_F1 score from 0.0775 to 0.1890. Granularity (average number of detected
segments) decreases from 2.15 to 1.39, indicating fewer false positives.
• TF-IDF only: apply only the TF–IDF cosine-similarity threshold (threshold_tfidf=0.7), no Jaccard.
• Jaccard only: skip TF–IDF (threshold_tfidf=0), apply only 3-gram Jaccard.</p>
        <p>• TF-IDF and Jaccard: the default pipeline combining both filters.</p>
        <p>As Table 2 demonstrates that adding the 3-gram Jaccard filter to TF–IDF improves overall F1
performance compared to using TF–IDF alone. Although both filters rely on token overlap, Jaccard’s emphasis
on exact contiguous 3-gram matches helps eliminate noisy or boilerplate pairs that TF–IDF’s cosine
similarity still admits. This complementary efect validates the benefit of cascading the two filters
before invoking the Transformer classifier.</p>
        <p>Table 3 reports per-obfuscation performance on the validation subset. As obfuscation strength
increases from “simple” to “hard”, we observe a clear and steady decline in both recall and F1, with
precision also dropping across the board3. For “simple” cases, our pipeline achieves its best performance</p>
        <sec id="sec-3-2-1">
          <title>3https://github.com/wolike666/evaluate_by_obfuscation-for-PAN25</title>
          <p>(micro_F1=0.4674), whereas medium and hard obfuscations exhibit lower micro_F1 scores of 0.2517 and
0.0874, respectively. The decreasing precision—from 0.5728 in the “simple” category to just 0.0549 for
“hard”—indicates that while the model is generally precise when it does predict plagiarism, it increasingly
fails to detect heavily obfuscated passages. Granularity remains close to 1.43 in all categories, showing
that detections tend to form single, contiguous spans rather than fragmented segments. These findings
confirm that our two-stage IR filtering plus BERT classification excels at catching lightly obfuscated reuse
but that further enhancements—such as more targeted filtering thresholds or ensemble modeling—are
needed to boost recall on the most challenging cases.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>We presented a two-stage pipeline for generative plagiarism detection that combines fast IR-based
ifltering (TF-IDF and Jaccard) with a fine-tuned BERT classifier. By first pruning the vast majority of
sentence pairs with simple filters and then applying BERT only to the remaining candidates, our method
achieves a strong balance between throughput and accuracy. On the PAN 2025 validation set, this
hybrid approach more than doubles the micro_F1 compared to the IR-only baseline while reducing false
positives and keeping fragmentation low. Although detection of heavily obfuscated passages remains
challenging, our framework provides a solid foundation for further improvements—such as enhanced
ifltering or ensemble models—to boost recall on the hardest cases. To advance plagiarism detection
systems, future studies should prioritize the refinement of cross-lingual detection via knowledge graphs
and multilingual embeddings [10]. Meanwhile, hybrid architectures fusing conventional rule-based
methods with AI-driven algorithms could present a scalable pathway to enhance detection eficiency
and adaptability.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work is supported by the National Social Science Foundation of China (24BYY080).</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT (OpenAI) and DeepSeek for grammar
and spelling checks, paraphrasing and rewording, and for translation assistance. After using these
tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility
for the publication’s content.
E. Zangerle, Overview of PAN 2025: Voight-Kampf Generative AI Detection, Multilingual Text
Detoxification, Multi-Author Writing Style Analysis, and Generative Plagiarism Detection, in:
J. C. de Albornoz, J. Gonzalo, L. Plaza, A. G. S. de Herrera, J. Mothe, F. Piroi, P. Rosso, D. Spina,
G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction.
Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF 2025),
Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2025.
[3] A. Amirzhanov, C. Turan, A. Makhmutova, Plagiarism types and detection methods: a systematic
survey of algorithms in text analysis, Frontiers in Computer Science 7 (2025). URL: https://www.
frontiersin.org/articles/10.3389/fcomp.2025.1504725/full. doi:10.3389/fcomp.2025.1504725.
[4] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast,
Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: Advances in Information
Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes in Computer
Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241.
[5] D. M. Setu, T. Islam, M. Erfan, S. K. Dey, M. R. Al Asif, M. Samsuddoha, A comprehensive strategy
for identifying plagiarism in academic submissions, Journal of Umm Al-Qura University for
Engineering and Architecture 2 (2025) 310–325. doi:10.1007/s43995-025-00108-1.
[6] M. Khadhraoui, H. Bellaaj, M. B. Ammar, H. Hamam, M. Jmaiel, Survey of bert-base models
for scientific text classification: Covid-19 case study, Applied Sciences 12 (2022) 2891. URL:
https://www.mdpi.com/2076-3417/12/6/2891. doi:10.3390/app12062891.
[7] M. Wang, F. Hu, The application of nltk library for python natural language processing in corpus
research, Theory and Practice in Language Studies 11 (2021) 1041–1049. doi:10.17507/tpls.
1109.09.
[8] A. Greiner-Petter, M. Fröbe, J. P. Wahle, T. Ruas, B. Gipp, A. Aizawa, M. Potthast, Overview
of the Generative Plagiarism Detection Task at PAN 2025, in: G. Faggioli, N. Ferro, P. Rosso,
D. Spina (Eds.), Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum,
CEUR Workshop Proceedings, CEUR-WS.org, 2025.
[9] M. Potthast, M. Hagen, A. Beyer, M. Busse, M. Tippmann, P. Rosso, B. Stein, Overview of the 6th
international competition on plagiarism detection, in: Working Notes for CLEF 2014, 2014, pp.
845–872.
[10] A. Amirzhanov, C. Turan, A. Makhmutova, Plagiarism types and detection methods: a systematic
survey of algorithms in text analysis, Frontiers in Computer Science 7 (2025). URL: https://www.
frontiersin.org/articles/10.3389/fcomp.2025.1504725. doi:10.3389/fcomp.2025.1504725.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sajid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanaullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fuzail</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. S.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Shuhidan</surname>
          </string-name>
          ,
          <article-title>Comparative analysis of text-based plagiarism detection techniques</article-title>
          ,
          <source>PLOS ONE 20</source>
          (
          <year>2025</year>
          )
          <article-title>e0319551</article-title>
          . doi:
          <volume>10</volume>
          .1371/journal.pone.
          <volume>0319551</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Greiner-Petter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shelmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>