<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Semantic Similarity and Overlap Ratio Optimized for Generated Plagiarism Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Derui Mo</string-name>
          <email>moderui44@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Huaiyu Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaojun Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leilei Kong</string-name>
          <email>kongleilei@fosu.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Foshan University</institution>
          ,
          <addr-line>Foshan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper describes the method submitted to the PAN 2025 Generated Plagiarism Detection task. This task aim to identify unauthorized content reuse involving complex rewriting in texts. Addressing the limitations of existing baseline models in continuous fragment merging eficiency, interference from stop words, and longfragment semantic verification capability, this paper proposes an improved multi-feature fusion plagiarism detection algorithm. The algorithm core consists of three modules: (1) Sentence overlap calculation based on word frequency statistics; (2) Continuous fragment merging strategy based on an adjacency matrix; (3) Result verification mechanism with multi-threshold constraints. By extracting deep semantic features of texts using the pre-trained GloVe.6B.300d model and combining them with traditional statistical features like word frequency overlap and adjacency relationships, a multi-dimensional detection framework is constructed, efectively enhancing the identification capability for generated plagiarism (especially texts deeply rewritten by Large Language Models (LLMs)).</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Generated Plagiarism Detection</kwd>
        <kwd>GloVe</kwd>
        <kwd>Semantic Similarity</kwd>
        <kwd>Large Language Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>1. Text Preprocessing</title>
      </sec>
      <sec id="sec-1-2">
        <title>2. Semantic Similarity Feature Extraction based on GloVe</title>
      </sec>
      <sec id="sec-1-3">
        <title>3. Integrating the Semantic Similarity FeatureFusion and the Traditional Features</title>
      </sec>
      <sec id="sec-1-4">
        <title>4. Fragment Detection and Dynamic Merging</title>
      </sec>
      <sec id="sec-1-5">
        <title>5. Result Verification and Calibration</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Our method</title>
      <p>
        This section clarifies the objective of the PAN 2025 Generated Plagiarism Detection task[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: For a given
suspicious document set Dsusp, it is necessary to retrieve and compare against the source document set
Dsrc to identify all continuous, maximal-length text fragments within Dsusp that originate from Dsrc
(including content deeply rewritten by LLMs). The final performance evaluation metric is Plagdet (the
harmonic mean combining Recall and Precision), calculated based on the detected fragment’s starting
ofset (Ofset) and length (Length) within the documents.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Text Preprocessing</title>
        <p>Preprocessing aims to convert raw text into a standardized format for feature extraction. Give a
document, we prepross the text as follows.</p>
        <p>Text Basic Cleaning: Uniform encoding (UTF-8), removal of HTML tags, punctuation (retaining
inter-sentence separators) and consecutive whitespace; execution of lowercase conversion to eliminate
formatting diferences.</p>
        <p>Tokenization and Lemmatization: Use of spaCy English tokenizer for fine-grained word segmentation;
application of WordNet lemmatizer to convert vocabulary to base forms (e.g., "running" → "run"),
improving feature consistency.</p>
        <p>Stop Word Filtering: Removal of function words (e.g., "the", "and") based on the NLTK stop word list,
retaining content words (nouns, verbs, adjectives) with substantial semantics.</p>
        <p>Sentence-level Structuring: Utilization of spaCy sentence boundary detection model to segment text
into sentence units. Each unit contains: (a) The original token sequence; (b) Position metadata (sentence
number, starting character ofset); (c) Statistical features (word count, character count, part-of-speech
distribution, which includes the frequency distribution of POS tags such as nouns, verbs, and adjectives
in a bag-of-tags manner rather than syntactic tree structures).</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Semantic Similarity Feature Extraction based on GloVe</title>
        <p>This module utilizes a pre-trained word vector model, GloVe, to capture deep semantic associations in
text.</p>
        <p>Given a preprocessed sentence, we obtain its sentence-level semantic vectors Svec by applying
the average pooling: retrieve the GloVe embedding for each word in the sentence and average these
embeddings to yield the sentence vector. This method efectively preserves the overall semantic tendency
of the sentence. The details is shown in Eq (1).</p>
        <p>vec = 1 ∑︁ ,vec(1)</p>
        <p>=1
where ,vec is the GloVe vector of the -th word, and n is the number of words in the sentence.</p>
        <p>
          For the GloVe, we use the publicly available GloVe.6B.300d model (trained on a 6-billion-word
Wikipedia corpus), whose 300-dimensional word vectors contain rich syntactic and semantic information
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] to encode the word in a sentence. For the out-of-vocabulary (OOV) words, we handled using
randomly initialized vectors.
        </p>
        <p>Futhermore, we use Cosine Similarity to measure semantic association between sentences:
glove =</p>
        <p>
          1,vec · 2,vec
‖1,vec‖ · ‖ 2,vec‖
(2)
Its value range is [
          <xref ref-type="bibr" rid="ref1">− 1, 1</xref>
          ], with higher values indicating greater semantic similarity.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Integrating the Semantic Similarity FeatureFusion and the Traditional Features</title>
        <p>To balance semantic understanding and exact matching, fuse GloVe semantic features with traditional
statistical features.</p>
        <p>We employed the following two sets of traditional text similarity features:
• Word Frequency Overlap Features:
– Base Overlap (base): Number of common words in two sentences / Number of words in the
shorter sentence (This is for the application of traditional methods, and this algorithm is
not employed);
– Stop Word Filtered Overlap (filtered ): Number of common content words after removing
stop words / Number of content words in the shorter sentence.
• Adjacency Matrix Feature: Construction of an inter-sentence co-occurrence relationship matrix
to capture potential paragraph structural similarity (e.g., sequential consistency of consecutive
sentences).</p>
        <p>Then we integrate the semantic similarity feature and the traditional features by a linear weighted
fusion</p>
        <p>combine =  · glove + (1 −  ) · filtered (3)
where  is the semantic feature weight. This fusion strategy embodies the detection logic of "semantics
ifrst, exact matching as a fallback".</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Fragment Detection and Dynamic Merging</title>
        <p>Based on semantic similarity graph adjacency analysis, we identify and merge consecutive similar
sentences.</p>
        <p>Firstly, we construct an adjacency relationship graph for further mining semantic similarity fragment.
We represent sentences as nodes and combine as edge weights to form a graph  = (, ). An edge 
exists if combine(, ) ≥  1 ,where  1 is a preset threshold.</p>
        <p>We use the Breadth-First Search (BFS) algorithm to traverse graph , identifying connected
subgraphs as candidate plagiarism fragments. Compared to edit distance-based greedy merging, BFS
can efectively discover long-distance similar fragments spanning paragraphs.</p>
        <p>Then we design the dynamic merging rules to obtain the candidate fragments:
• Position Proximity: The sentence order diference within the fragment ≤  2 ( 2 is a preset
threshold)sentences in the original text (preventing erroneous connections across sections);
• Semantic Coherence: The average combine of sentences within the fragment ≥  3 ( 3 is a preset
threshold, filtering fragments with semantic breaks);
• Minimum Length Constraint: Merged fragment word count ln (reducing noise interference).</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Result Verification and Calibration</title>
        <p>We employ a Result Verification and Calibration method through multi-layer threshold filtering and
boundary optimization to enhance detection reliability.</p>
        <p>We first apply strict conditions to initially detected fragments:
• Character Length: Fragment length ≥ ℎ characters (avoiding false positives for short
fragments);</p>
        <p>Number of Co-occurring Words
• Word Overlap Ratio: OverlapRatio = Number of Words in Suspicious Fragment ≥  4 ( 4 is a preset
threshold);
• Comprehensive Similarity: combine ≥  5 ( 5 is a preset threshold).</p>
        <p>Then we perform local expansion search at fragment start and end positions, calculate the combine of
the expanded fragment, and select the local maximum point as the final boundary, mitigating semantic
break issues caused by improper sentence segmentation.</p>
        <p>
          We employ the Non-Maximum Suppression (NMS) algorithm [8]to retain the candidate with the
highest combine among overlapping fragments. Unlike traditional hard-threshold NMS used in object
detection [8], we adopt a **soft-decay strategy** inspired by computer vision’s Soft-NMS [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] to preserve
semantically similar LLM-generated fragments. Specifically, fragments with 0.65 &lt; combine &lt; 0.75
and overlap &lt;50% are downweighted (via Gaussian decay) rather than discarded, which is critical for
retaining long-distance semantic associations in rewritten text [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. This approach aligns with the
"feature fusion for irregular fragment matching" principle in image restoration tasks [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], demonstrating
cross-domain applicability.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Results and Analysis</title>
      <sec id="sec-3-1">
        <title>3.1. Datasets</title>
        <p>The dataset is available via Zenodo. Enclosed in the train and validation corpora, two folders are found:
(1) the text data and (2) the annotation data (postfix) _truths.</p>
        <p>Text Data: contains a file which lists all pairs of suspicious documents (in the folder) and source
documents (in the folder) to be compared. pairssuspsrc</p>
        <p>Annotation Data: contains XML files for each pair in the file providing information about the
locations and its source of reused texts. pairs</p>
        <p>The annotation data contains the following information that should be used for training:
&lt;document reference="suspicious-documentXYZ.txt"&gt;
&lt;feature
name="plagiarism"
this_offset="5"
this_length="1000"
source_reference="source-documentABC.txt"
source_offset="100"
source_length="1000"
...
/&gt;
&lt;feature
name="altered"
this_offset="5"
this_length="1000"
source_reference="source-documentABC.txt"
...
/&gt;
...
&lt;/document&gt;</p>
        <p>The feature specifies an aligned passage of text between suspicious-documentXYZ.txt and
source-documentABC.txt, and that it is of length 1000 characters, starting at character ofset 5 in
the suspicious document and at character ofset 100 in the source document. The other attributes are
used to allow for a more detailed analysis of the results and can be ignored for training.</p>
        <p>The altered feature specifies the location of paraphrased text that was not reused (no plagiarism).
This allows to distinguish between genuine LLM generated texts and reused text. For the evaluation,
only the plagiarism features need to be predicted.</p>
        <p>For each pair suspicious-documentXYZ.txt and source-documentABC.txt in the pairs file,
your plagiarism detector shall output an XML file which specifies the location of the plagiarism cases
detected within. The name of the feature should be detected-plagiarism and specify the ofsets
and lengths in the suspicious and the source document. No other attributes are evaluated. For example:
&lt;document reference="suspicious-documentXYZ.txt"&gt;
&lt;feature
name="detected-plagiarism"
this_offset="5"
this_length="1000"
source_reference="source-documentABC.txt"
source_offset="100"
source_length="1000"
/&gt;
&lt;feature ... /&gt;
...
&lt;/document&gt;</p>
        <p>
          For evaluation, the ofset and length attributes of detected-plagiarism features will be compared
against the plagiarism features in the annotation data. No other information will be evaluated[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Experimental setting</title>
        <p>For the linear weighted fusion parameter  in Eq.(3), we optimized via grid search combined with 5-fold
crossvalidation, experimentally determined to be 0.5.</p>
        <p>1 was experimentally determined to be 0.46.
 2 was experimentally determined to be 9.
 3 was experimentally determined to be 0.5.
ln was experimentally determined to be 16.</p>
        <p>Charln was experimentally determined to be 190.
 4 was experimentally determined to be 0.36.</p>
        <p>5 was experimentally determined to be 0.47.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Experimental results and analysis</title>
        <p>
          On PAN 2025 oficial dataset using Tira platform, evaluate this algorithm (Use Glove.6B.300d) and
traditional baseline algorithms (pan12-baseline) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Evaluation metrics follow PAN standard [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ],
including micro/macro average Plagdet, Recall, Precision, Granularity and Runtime. Experimental
results:
        </p>
        <p>
          Experimental results indicate that the proposed algorithm demonstrates significant superiority over
the baseline method [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] across multiple evaluation metrics. On the Spot-Check dataset (50 document
pairs), the algorithm achieved a Micro-PlagDet of 0.598 (+336% improvement over the baseline’s 0.137)
and a Macro-PlagDet of 0.541 (+452%), as shown in Tables 1 and 2. The Micro-Recall of 0.777 captures
81% more plagiarized content than the baseline (0.154), while the Granularity metric of 1.000 confirms
precise identification of contiguous fragments compared to the baseline’s 2.337.
        </p>
        <p>On the larger Validation dataset (7,976 pairs), the algorithm sustained its with a Micro-PlagDet of
0.584 (+441%) and Macro-PlagDet of 0.541 (+603%), as reported in Tables 3 and 4. The Macro-Recall of
0.668 reflects consistent performance across diverse document pairs, while the 86% reduction in runtime
(220.93s vs. 1609.12s) demonstrates practical scalability for real-world applications.</p>
        <p>Although the Micro-Precision slightly decreased (0.486 vs. baseline 0.554), the substantial Recall
improvement drove a net gain in PlagDet. This trade-of is intentional and beneficial for detecting
LLM-generated plagiarism, where deep semantic rewriting often reduces lexical overlap. The
algorithm’s superiority in Granularity (1.000 across datasets) confirms its ability to identify maximal-length
fragments, aligning with the PAN task’s emphasis on contiguous text reuse detection. These results
validate the efectiveness of integrating GloVe semantic features with adjacency matrix-based merging
for sophisticated plagiarism identification.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>This paper proposes an improved multi-feature fusion plagiarism detection algorithm to address the
limitations of existing baselines in continuous fragment merging eficiency, stop-word interference,
and long-fragment semantic verification for generated plagiarism detection. The algorithm integrates
GloVe-based semantic features with traditional statistical metrics (e.g., word frequency overlap and
adjacency matrix analysis) through a linear weighting strategy, constructing a multi-dimensional
framework to identify deeply rewritten fragments by LLMs. Experimental results on PAN 2025 datasets
demonstrate significant superiority: the method achieves Micro-PlagDet scores of 0.598 and 0.584 on
the Spot-Check and Validation datasets, respectively, outperforming the baseline by 336These findings
validate the algorithm’s capability to enhance detection accuracy for LLM-generated plagiarism through
semantic-syntactic feature fusion.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work is supported by the National Social Science Foundation of China (Grant No. 22BTQ101).</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Deepseek-R1 in order to: Drafting content and
Text Translation. Further, the authors used ChatGPT-4 and Deepseek-R1 in order to: Grammar and
spelling check. After using these tools/services, the authors reviewed and edited the content as needed
and take full responsibility for the publication’s content.
[8] Vedoveli, H. (2023). NMS Unveiled: Elevating Object Detection Accuracy. Medium.
(https://medium.com/@henriquevedoveli/nms-unveiled-elevating-object-detection-accuracye40b8c690f8f)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Potthast</surname>
          </string-name>
          ,
          <article-title>Martin and Gollub, Tim and Hagen, Matthias and others, Overview of the 4th international competition on plagiarism detection, in</article-title>
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          (Online Working Notes/Labs/Workshop),
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>GloVe: Global Vectors for Word Representation</article-title>
          .
          <source>In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          (pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          ). Doha, Qatar: Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <article-title>Janek and Dementieva, Daryna and Fröbe, Maik and Gipp, Bela and Greiner-Petter, André and Karlgren, Jussi and Mayerl, Maximilian and Nakov, Preslav and Panchenko, Alexander and Potthast, Martin and Shelmanov, Artem and Stamatatos, Efstathios and Stein, Benno and Wang, Yuxia and Wiegmann, Matti</article-title>
          and Zangerle, Eva, Overview of PAN 2025:
          <article-title>Voight-Kampf Generative AI Detection, Multilingual Text Detoxification, Multi-Author Writing Style Analysis, and Generative Plagiarism Detection, in Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), edited by Jorge
          <string-name>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          and
          <article-title>Julio Gonzalo and Laura Plaza and Alba García Seco de Herrera and Josiane Mothe and Florina Piroi and Paolo Rosso and Damiano Spina and Guglielmo Faggioli</article-title>
          and Nicola Ferro, Springer, Berlin Heidelberg New York,
          <year>September 2025</year>
          , Madrid, Spain.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <article-title>Maik and Wiegmann, Matti and Kolyada, Nikolay and Grahm, Bastian and Elstner, Theresa and Loebe, Frank and Hagen, Matthias and Stein, Benno and Potthast, Martin, Continuous Integration for Reproducible Shared Tasks with TIRA.io</article-title>
          ,
          <source>in Advances in Information Retrieval. 45th European Conference on IR Research (ECIR</source>
          <year>2023</year>
          ), Springer, Berlin Heidelberg New York,
          <year>April 2023</year>
          , Dublin, Irland, pp.
          <fpage>236</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Greiner-Petter</surname>
          </string-name>
          ,
          <article-title>André and Fröbe, Maik and Wahle, Jan Philip and Ruas, Terry and Gipp, Bela and Aizawa, Akiko and Potthast, Martin, Overview of the Generative Plagiarism Detection Task at PAN 2025</article-title>
          , in Working Notes of CLEF 2025 -
          <article-title>Conference and Labs of the Evaluation Forum, edited by Guglielmo Faggioli and Nicola Ferro and Paolo Rosso and Damiano Spina, CEUR-WS</article-title>
          .org,
          <year>September 2025</year>
          , Vienna, Austria, in CEUR Workshop Proceedings.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Bodla</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chellappa</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>L. S.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Soft-NMS - Improving Object Detection With One Line of Code</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          . (https://doi.org/10.1109/CVPR.
          <year>2017</year>
          .
          <volume>364</volume>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            , &amp;
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>PairingNet: A Learning-based Pair-searching and -matching Network for Image Fragments</article-title>
          .
          <source>arXiv preprint arXiv:2312</source>
          .
          <fpage>08704</fpage>
          . (https://arxiv.org/abs/2312.08704)
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>