<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Online Journal of Communication and Media Technologies 13
(2023). doi:10.30935/ojcmt/13572.
[17] Z. Fu</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.30935/ojcmt/13572</article-id>
      <title-group>
        <article-title>Overview of the Plagiarism Detection Task at PAN 2025</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>André Greiner-Petter</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maik Fröbe</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Philip Wahle</string-name>
          <email>wahle@uni-goettingen.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Terry Ruas</string-name>
          <email>ruas@gipplab.org</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bela Gipp</string-name>
          <email>gipp@gipplab.org</email>
          <email>greinerpetter@gipplab.org</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Akiko Aizawa</string-name>
          <email>aizawa@nii.ac.jp</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Potthast</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Friedrich-Schiller-Universität Jena</institution>
          ,
          <addr-line>Jena</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Georg-August-Universität</institution>
          ,
          <addr-line>Göttingen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>National Institute of Informatics</institution>
          ,
          <addr-line>Tokyo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>ScaDS.AI</institution>
          ,
          <addr-line>Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Kassel</institution>
          ,
          <addr-line>Kassel</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>hessian.ai</institution>
          ,
          <addr-line>Darmstadt</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <volume>1178</volume>
      <fpage>12848</fpage>
      <lpage>12856</lpage>
      <abstract>
        <p>The generative plagiarism detection task at PAN 2025 aims at identifying automatically generated textual plagiarism in scientific articles and aligning them with their respective sources. We created a novel large-scale dataset of automatically generated plagiarism using three large language models: Llama, DeepSeek-R1, and Mistral. In this task overview paper, we outline the creation of this dataset, summarize and compare the results of all participants and four baselines, and evaluate the results on the last plagiarism detection task from PAN 2015 in order to interpret the robustness of the proposed approaches. We found that the current iteration does not invite a large variety of approaches as naive semantic similarity approaches based on embedding vectors provide promising results of up to 0.8 recall and 0.5 precision. In contrast, most of these approaches underperform significantly on the 2015 dataset, indicating a lack in generalizability.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;PAN</kwd>
        <kwd>Plagiarism Detection</kwd>
        <kwd>Generative AI Detection</kwd>
        <kwd>Semantic Similarity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Generative Plagiarism Detection</title>
      <p>
        Plagiarism detection has a long-standing tradition at PAN, with the main tasks running from 2009 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
to 2015 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Over time, the focus gradually shifted toward specialized intrinsic tasks, such as the
still active authorship analysis challenges. However, the recent breakthrough of generative artificial
intelligence (AI) has dramatically transformed the landscape of plagiarism detection. For the first time
in history, large language models (LLMs) can serve as so-called automatic plagiarists [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. At the same
time, major scientific venues adjust their submission policies to allow (at least partially) AI-generated
content [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ]. The annual conference on AI (AAAI) recently announced to deploy an AI-assistend
peer review assessment system for 20261. This shift inspired us to revive a classic plagiarism detection
task for 2025, this time centered on automatically generated plagiarism using LLMs.
      </p>
      <sec id="sec-1-1">
        <title>For the 2025 edition, we adhered to the well-established foundations of the 2015 plagiarism detection</title>
        <p>
          task, particularly in evaluation methodology and dataset formatting [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Following the same formats
will later allow us to evaluate new submissions on the older datasets to investigate the robustness of new
approaches. Therefore, this format allows us to re-run the old baselines on this new dataset to judge
the overall challenge of the new data versus the previous dataset. The participants receive an annotated
synthetic dataset of pairs of documents (,  ), where  is a source document and  is the plagiarism
document in which some paragraphs  are replaced with paraphrased versions ′ of paragraphs  in 
using an LLM without citation. This setup closely mirrors the 2015 PAN text alignment task2.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>The 2025 PAN task has received four submissions in total, outperforming all our baselines. Since all</title>
        <p>of these submissions (and our baselines) follow a similar approach of aligning text fragments based
on their semantic similarity in terms of vector representations, we set up a fourth baseline using the</p>
        <sec id="sec-1-2-1">
          <title>Linq-Embed-Mistral model [7]3. Linq outperforms all submissions, indicating that specialized models</title>
          <p>
            for the text retrieval task might suit the task for plagiarism detection particularly well. Note that this
summary is an extended and in-depth version of the Overview of PAN 2025 paper [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ].
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <sec id="sec-2-1">
        <title>To the best of our knowledge, no large-scale dataset with automatically generated cases of textual</title>
        <p>
          reuse exists. Some studies suggest that LLMs can disguise plagiarism via paraphrasing the original
source [
          <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
          ]. Additionally, LLMs have already been successfully used to replace human paraphrasing
on scale [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. For this task revival, we aim to create a novel dataset with realistic cases of textual reuse
disguised via automated paraphrasing. To make this dataset large enough to enable possible fine-tuning
approaches, we automated the full dataset creation pipeline.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>For this year’s iteration, we focus on the text alignment task setup, i.e., we provide participants with</title>
        <p>pairs of source and plagiarized documents (,  ) and the participants are asked to identify and align
the LLM-generated, plagiarized paragraphs ′ in  with their respective source paragraphs  in .
2.1. Data Creation</p>
        <sec id="sec-2-2-1">
          <title>We use arXiv as the source corpus for our novel dataset. Specifically, the ar5iv 4 release from 2025 of</title>
          <p>arXiv. This dataset contains all arXiv documents in a structured HTML5 format, which allows us to
avoid most parsing problems of identifying paragraph splits, author identicfiations, citations, and more.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>We sample a subset of 100,000 documents with an even distribution across all arXiv categories (also</title>
        <p>
          known as archives), to ensure a wide variety of topics. These 100,000 documents serve as candidates
for . Afterwards, we use the SPECTER model [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] to create document embeddings and identify the
semantically most similar documents (in terms of cosine similarity) to each . This gives us 100,000
pairs of (,  ).
        </p>
        <p>For each document pair (,  ), we first select a random number of paragraphs in  that should be
replaced with paragraphs from . Additionally, we add paragraphs  that cite  to the pool, as otherwise
the document could contain genuine, referenced materials from . For each selected , we than find the
most semantically similar paragraphs  based on three criteria. The alignment score is computed as a
weighted aggregate: 50% semantic similarity via SPECTER sentence embeddings, 40% lexical similarity
using TF-IDF vector similarity, and 10% section title similarity using again SPECTER embeddings. The
inclusion of similarity in the title of the section helps discourage the alignment of paragraphs from
unrelated sections of the documents and preserve a more coherent document structure within  . For
each pair (,  ), we select one of three LLMs: LLaMA-3 [13] (3.3 70B Instruct), DeepSeek-R1 [14]
(Distill-Qwen-32B) or Mistral [15] (7B Instruct v0.3), and replace all selected  in each aligned paragraph
(, ) with LLM-paraphrased versions ′ derived from paragraphs  in .
2.2. Categorization
To support a more detailed analysis of system performances, we establish several categories of document
pairs, which later allows us to slice the dataset and investigate performances (e.g., least recall) on specific
subsets of the data. First, 5% of the 100,000 pairs remain unchanged, i.e., both  and  are original arXiv
documents without textual reuse. An additional 20% of pairs do not contain any plagiarism, but some</p>
      </sec>
      <sec id="sec-2-4">
        <title>2http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-web/plagiarism-detection.html</title>
      </sec>
      <sec id="sec-2-5">
        <title>3in the following referred to as Linq</title>
      </sec>
      <sec id="sec-2-6">
        <title>4https://ar5iv.labs.arxiv.org/</title>
        <p>paragraphs in  have been paraphrased by an LLM independently of . These examples are useful
for evaluating systems that aim to detect LLM-generated content rather than plagiarism specifically.</p>
      </sec>
      <sec id="sec-2-7">
        <title>We want to discourage such approaches, as the use of LLMs in modern research does not necessarily</title>
        <p>indicate academic misconduct or even plagiarism [16]. Those document pairs are called altered. The
remaining 75% of document pairs are constructed as plagiarism pairs as described above. In about half
of these plagiarized documents, we also add 10% of altered paragraphs so that plagiarized documents
may also contain LLM-generated but otherwise genuine paragraphs.
2.2.1. Severity.</p>
      </sec>
      <sec id="sec-2-8">
        <title>We classify the severity of plagiarism in  into three levels: low, medium, and high. These refer to</title>
        <p>the proportion of paragraphs in  that are replaced with paraphrased versions from . In 30% of the
document pairs, the severity is low, with 20% to 40% of paragraphs replaced. In 40% of the pairs, severity
is medium, with 40% to 60% replaced. The remaining 30% has high severity, where 70% to 100% of
paragraphs in  are substituted.
2.2.2. Paraphrasing Prompts.</p>
      </sec>
      <sec id="sec-2-9">
        <title>For paraphrasing, we use three prompt types: simple, default, and complex. While severity is defined</title>
        <p>on a document pair level, each pair of paragraphs within one document pair can use diferent types
of prompts. For each pair, we follow a distribution of 60% simple prompts, 30% default prompts, and
10% complex prompts. The simple prompt instructs the LLM to paraphrase a given paragraph without
additional constraints.</p>
        <p>Æ</p>
        <p>Simple Paraphrasing Prompt
Paraphrase the given paragraph for a professional audience.</p>
      </sec>
      <sec id="sec-2-10">
        <title>We found that, especially technical texts, like the ones we often find in scientific articles from arXiv, do</title>
        <p>not produce suficient paraphrasing. This is especially prominent to see if the texts contain mathematical
formulae. To encourage the LLMs to generate more sophisticated paraphrasing, we use diferent default
prompt that elevates the use of a complete reformulation rather than slight adjustments.
Æ</p>
        <p>Default Paraphrasing Prompt
Reformulate the given paragraph in a sophisticated manner while preserving its
meaning. Modify sentence structure, reword phrases, and incorporate elements of
general knowledge to ensure coherence. The less token overlap, the better.</p>
      </sec>
      <sec id="sec-2-11">
        <title>As the synthetic data faces the issue of replacing paragraphs from an existing, genuine document,</title>
        <p>one could potentially identify incoherent logical steps from one paragraph to the other in order to
identify replaced paragraphs. In order to make this a more realistic setup, we define a third type of
prompt that tries to take the previous paragraph into account as a context for the LLM to generate
slightly more appropriate paraphrasing.</p>
        <p>Æ</p>
        <p>Complex Paraphrasing Prompt Structure with Context
Completely rephrase the given paragraph in your own words. Feel free to
incorporate elements from general knowledge to ensure coherence, flow, and
better understanding.
{context_before}</p>
      </sec>
      <sec id="sec-2-12">
        <title>All prompts include additional instructions to output only the paraphrased content, avoiding any explanatory text. Special tokens are used to suppress verbose output, tailored to each LLM. For</title>
        <p>DeepSeek-R1, a custom &lt;thinking&gt;. . . &lt;/thinking&gt; block was used to suppress the model’s
internal reasoning steps, which would otherwise significantly slow down the generation. It is worth
noting that Mistral performed poorly in following prompt instructions. It often produces explanatory
content, hallucinated facts, or gets stuck in output loops, an issue reminiscent of neural network
architectures before the attention mechanism era [17]. We presume the 7B parameter model variant is
simply too small to perform paraphrasing of highly technical texts. In total, the final dataset consists of
78,038 document pairs, divided into training, validation, and test subsets. The training and validation
sets are provided to participants, while the test set is kept private for the evaluation phase. The data
splits and sizes are given in Table 1.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation</title>
      <sec id="sec-3-1">
        <title>All systems are submitted and evaluated on the TIRA platform [18]. The participants are tasked with</title>
        <p>identifying all the paragraphs ′ in  and aligning each with the corresponding paragraph  in . The
training and validation sets contain all alignments (, ′) for each pair of documents (,  ), together
with the full text of both documents. The evaluation is carried out using the original scripts from the 2015</p>
      </sec>
      <sec id="sec-3-2">
        <title>PAN plagiarism detection task. We used granularity as well as the micro-averaged and macro-averaged</title>
        <p>variants of plagdet, recall, and precision for comparability purposes with past plagiarism detection
tasks [19]. All of these metrics take into account the exact character spans of the source and plagiarism
and calculate the overlap regions in comparison to the truth values. While the micro-averaged variants
take the length of plagiarism spans into account, the macro-averaged variants are length independent.</p>
      </sec>
      <sec id="sec-3-3">
        <title>The micro-averaged variants made especially sense for the old task setups at PAN, as earlier iterations infused plagiarism on sentence and sometimes even subsentence levels. As our dataset is constructed based on paragraph borders, the micro-variants are less indicative for our evaluations. For the sake of completeness, we evaluated all algorithms on both variants.</title>
        <p>The granularity metric counts how often a truth case is detected on average. This metric is useful
as we want to avoid a single case of plagiarism being detected multiple times. The domain of the
granularity metric is [1, ||] where || is the number of detections for a single document pair. A perfect
score of 1 means that every truth case of plagiarism is detected at most once by the given algorithm. As
a reminder, plagdet is defined via the 1 score and with respect to the granularity:
plagdet(, ) =</p>
        <p>1(, )
log2 (1 + gran(, ))
where  indicates the actual case of plagiarism in the truth data and  the detected cases in (,  ).
3.1. Baselines</p>
      </sec>
      <sec id="sec-3-4">
        <title>We implement three new baselines that use semantic similarity with large language models and the</title>
        <p>
          baseline from the 2012 edition of PAN [20] that uses lexical similarity. For the three large language model
baselines, we split  and  into their paragraphs. For each paragraph in  we take the semantically
closest paragraph in  in terms of cosine similarity based on Linq [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], Qwen2 7B instruct5 [21], and
Llama-3.3 70B Instruct6 [13]. For each model, we define a cut-of threshold that classifies the closest
pairs as plagiarism. Pairs below that threshold are then discarded. The threshold is determined by
calculating the ideal cut-ofs on the training split of the data. To compare this class of semantic plagiarism
detectors to previous lexical approaches, we also include the baseline from the 2012 edition of the
plagiarism detection task at PAN. The 2012 baseline tokenizes the text while normalizing white spaces
and punctuation and then detects sequences of overlapping n-grams between  and  as plagiarism
cases.
3.2. Team Submissions
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>Four teams participated in the task by submitting software.</title>
        <p>3.2.1. Team chi-zi-zhi-xin-dui.</p>
      </sec>
      <sec id="sec-3-6">
        <title>Su et al. [22] split the document of each pair into sentences and aligned the sentences of  and</title>
        <p>according to the SBERT, MPNet, TF-IDF, or BERT score, whichever passed a pre-defined threshold,
which was also determined based on the training data. After the alignment, they performed a merging
logic to combine subsequences of detected sentences into single blocks.
3.2.2. Team foshan-university.</p>
      </sec>
      <sec id="sec-3-7">
        <title>Tang et al. [23] also pre-processed documents by splitting them into sentence chunks and aligned all</title>
        <p>sentences from  with sentences from  based on E5 embeddings (intfloat/e5-base-v2). Again,
the threshold was determined with the training data. They also performed a span aggregation if two
spans have been categorized as plagiarism within a distance of 30 characters.
3.2.3. Team jrluo.
3.2.4. Team yukino.</p>
      </sec>
      <sec id="sec-3-8">
        <title>Jieren et al. [24] also split the documents into sentences and first aligned pairs by using TF-IDF vector similarities. For each pair, he calculated the word-based Jaccard similarity and discarded all pairs below a given threshold. All remaining sentence pairs were classified as plagiarism or genuine by a BERT classifier fine-tuned on the training data.</title>
      </sec>
      <sec id="sec-3-9">
        <title>Mo et al.[25] also splits the data into chunks of sentences. Each sentence gets a vector representation</title>
        <p>as the averaged vector representation of each token based on Glove (6B model with 300 dimensions).</p>
      </sec>
      <sec id="sec-3-10">
        <title>Afterwards, all sentences are aligned according to their cosine similarities. Like all other teams, Mo et al. also employed a merging strategy for positive detections based on position proximity, semantic coherence (based on cosine similarity), and a minimum length constraint.</title>
        <p>
          3.3. Discussion and Results
Table 2 shows the evaluation results for all submissions and baselines on our new dataset. The final score
is the average of all sub-scores and is reported as the final score in the lab overview paper [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. While
Linq seems to outperform most other approaches, the best performers vary in terms of precision and
granularity. This is especially surprising as the baselines Linq, Qwen2, and Llama have been deployed
for paragraph splitting rather than sentence splitting with subsequent merging techniques. We would
assume these baselines have a slight advantage, especially on the granularity score. It should also be
noted that Linq was deployed afterwards to investigate the performance of a special model that aimed
5In the following referred to as Qwen2.
6In the following referred to as Llama
submission
        </p>
        <p>qwen2
linq
llama
pan12
foshan-university
jrluo
chi-zi-zhi-xin-dui
yukino
submission</p>
        <p>qwen2
linq
llama
pan12
foshan-university
jrluo
chi-zi-zhi-xin-dui
towards text retrieval tasks. Otherwise, most submissions outperform the baselines with the exception
of team jrluo. Team jrluo has a relatively low recall compared to high precision scores. We suspect this
is related to an agressive filtering of the initial TF-IDF similarity calculations.</p>
        <p>Table 3 shows the same results on the old PAN12 dataset. Unfortunately, team yukino could not be
evaluated as we ran into issues when applying the old datasets. All submissions (except the original</p>
      </sec>
      <sec id="sec-3-11">
        <title>PAN12 baseline) face a significant drop in performance. This is not as surprising for the baselines, as the</title>
        <p>paragraph splitting simply should not have been applied to the old dataset. This is also evident when
looking at the high granularity scores. The team submissions perform significantly better in terms
of granularity. An outlier is again team jrluo with very high precision values. It seems the two-stage
ifltering approach is particularly useful on the older dataset.</p>
        <p>Figure 1 shows the results as a heatmap layout. We can see that team yukino performs overall
similarly to Linq but loses significant on recall. It is also noteworthy that the new dataset is significantly
easier in terms of granularity, as entire paragraphs have been plagiarized. It is therefore relatively rare
that multiple detections detect the same plagiarized paragraph.
3.3.1. Data Subsets.</p>
      </sec>
      <sec id="sec-3-12">
        <title>To investigate the performance on specific subsets of the data, we calculate the recall values on slices of the data. We only calculate the recall metrics of all approaches on the new data, as precision, plagdet, and granularity would require us to rerun all submissions on a pre-filtered dataset. However, the recall</title>
        <p>Performance Overview on PAN-12
Performance Overview on PAN-25
performances are suficient to identify trends across all submissions. Figure 2 shows the recall values
of all algorithms on diferent models (i.e., which model is used to generate the paraphrasing) and the
obfuscation level (i.e., which prompt has been used to generate the paraphrasing).</p>
        <p>We can see that overall, Mistral is easier to detect by almost all approaches. A possible explanation is
the fact that Mistral, with 7B parameters, is also the smallest of the used models and as such did not
provide as high-quality paraphrasing as the other models did. Counterintuitively, the Llama baseline
performs significantly worse on detecting paraphrasing generated with Llama. This might be surprising
as classical LLM-detection methods typically perform best when the same model was used for the
detection as for the generation of the texts [26, 27, 28]. However, the results should not be confused with
classical LLM-detection approaches that are often based on logit-value comparisons. The approaches
here are marely based on cosine similarities of content embeddings rather than logit values between
tokens.</p>
      </sec>
      <sec id="sec-3-13">
        <title>Another trend is also visible in the obfuscation level overview. The recall values per obfuscation level</title>
        <p>confirm a clear diference between prompt types. Almost all approaches find more plagiarism generated
with simple prompts. Likewise, all approaches have the lowest success detection rate with complex
prompts. While some approaches, such as by chi-zi-zhi-xin-dui, are more susceptible to model changes,
some approaches are relatively stable regardless of prompt or model type, such as foshan-university.</p>
        <p>Lastly, Figure 3 shows the recall performances on the actual plagiarism cases compared to all altered
cases. Detecting an altered case is considered a false-positive. We want approaches that minimize these
false classifications, as they could be interpreted as potentially harmful false accusations when handling
plagiarism detections. Surprisingly, We can identify a clear diference between participant’s submissions
and two of our baselines even though the underlying approaches are not particularly diverse. We can
see that all submissions by participants show a significantly lower recall on altered cases, sometimes up
to 20% lower. The baselines of Llama and Qwen2 are particularly noteworthy as opposing approaches.
as the recall on altered cases is significantly higher (in the case of Llama, even twice as high) than
on actual plagiarized cases. That means, an identified case of plagiarism with these approaches is
significantly more likely to be a wrong accusation than an actual case of plagiarism. We assume this
discrepancy comes from the construction of the dataset, as all pairs (,  ) have been constructed to be
semantically close. We can therefore assume a relatively high, general similarity across all paragraphs
between  and  even without infused plagiarism. It seems Llama and Qwen2 have particular issues
with diferentiating these nuances in semantic similarities based on these embeddings.</p>
      </sec>
      <sec id="sec-3-14">
        <title>In summary, the results mostly underperform our expactations. All submitted approaches and</title>
        <p>baselines follow a simple detection approach based on cosine similarities of content embeddings and
achieve mostly values below 0.6 in plagdet. In comparison, on the 2014 edition of the text alignment</p>
        <p>Recall per Model</p>
        <p>Model</p>
        <p>DeepSeek-R1
Micro Recall - LLM PAN-25</p>
        <p>Recall per Obfuscation Level</p>
        <p>Obfuscation Level
Simple Default Complex</p>
        <p>Micro Recall - LLM PAN-25
0.8
baseline-llama
baseline-pan12
baseline-qwecnh2i-zi-zhi-xinf-odsuhian-university
Macro Recall - LLM PAN-25
jrluo
yukino
baseline-linq
baseline-llama
baseline-pan12
baseline-qwecnh2i-zi-zhi-xinf-odsuhian-university
Macro Recall - LLM PAN-25
jrluo
baseline-llama
baseline-pan12
baseline-qwen2
chi-zi-zhi-xin-dui
foshan-university
jrluo
yukino
baseline-linq
baseline-llama
baseline-pan12
baseline-qwen2
chi-zi-zhi-xin-dui
foshan-university
jrluo
yukino
task7 the majority of submissions achieved plagdet scores above 0.8. Unfortunately, it is unclear if
this can be attributed to a more dificult task setup or the simplicity of detection approaches. The
comparison to the the PAN12 dataset indicates that all approaches are not robust against changes in the
data. However, this also includes the previous PAN12 baseline as it outperforms other methodologies
on the PAN12 task but significantly underperforms on the new dataset.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Future Work</title>
      <sec id="sec-4-1">
        <title>The revival of the plagiarism detection task can be summarized as successful. However, there are a few</title>
        <p>crucial improvements that can be made to make this task more realistic. The main point of criticism is
the actual generation of plagiarism in the new dataset. The current pipeline starts with two genuine
documents and infuses synthetic plagiarism by replacing a subset of paragraphs with a paraphrased
version of another article. Typically, the textual content of scientific articles is not that interchangeable.
Likewise, real-world plagiarism typically does not start with an existing publication and adds paragraphs
from other works to it. In order to overcome this issue, in future iterations, we will start from multiple
genuine documents (or a single document) and generate a new article by paraphrasing the content of
each source rather than replacing paragraphs within an existing document excerpt. This should also
promote a larger variety of detection approaches, as all submissions have been following very similar
approaches. The new pipeline will also allow us to revive the important retrieval aspect of plagiarism
detection tasks, in which participants start from a suspicious document without knowing if it is genuine
or what the sources are. Another shortcoming is the relatively narrow domain of arXiv. As we have
seen with the evaluations on the PAN12 dataset, all approaches, including the PAN12 baseline, are not
very robust and perform vastly diferent on diferent datasets. This means newer iterations of this task
must incorporate a larger variety of types and possibly domain of plagiarism. In the future, we will</p>
      </sec>
      <sec id="sec-4-2">
        <title>7https://pan.webis.de/clef14/pan14-web/text-alignment.html</title>
        <p>Recall on Plagiarized vs Altered Cases</p>
        <p>Altercation Type
Plagiarized</p>
        <p>Altered</p>
        <p>Micro Recall - LLM PAN-25
0.8
roe0.6
c
S
e
ag0.4
r
e
v
A0.2
0.0
0.8
roe0.6
c
S
e
ag0.4
r
e
v
A0.2
0.0
baseline-linq</p>
        <p>a
baseline-llam
baseline-pan12
baseline-qwen2
chi-zi-zhi-xin-dui
foshan-university
jrluo</p>
        <p>yukino</p>
        <p>Macro Recall - LLM PAN-25
baseline-linq</p>
        <p>a
baseline-llam
baseline-pan12
baseline-qwen2
chi-zi-zhi-xin-dui
foshan-university
jrluo
yukino
incorporate especially the medical domain to bring more variety to the dataset.</p>
        <p>Another challenge is the rapid development of LLMs and plagiarism in itself. Recently, Zochi, a
scientific LLM has generated a publication that passed the scrutiny of peer reviews at a reputable
international conference8. This shows that LLMs are capable of generating genuine, new scientific
texts without plagiarizing existing work. Nonetheless, plagiarizing existing work is now easier than
ever for perpetrators. Future iterations of this task must therefore focus more on proper citations and
the actual case of idealogical reuse or copying of reasoning-chains to stay relevant. Proper citation of
 in  was only touched on the surface in the creation of this iteration’s dataset and not separately
evaluated. Lastly, this development also deemphasizes the alignment task because, moving forward,
there will be fewer straightforward cases of matching sources to plagiarism. Instead, indicators such as
structural, ideological, or reasoning chain similarities will have to be utilized to detect plagiarism. We
will therefore reframe future iterations of this task to ensure that the dataset and plagiarism detection
approaches stay relevant regardless of the development of LLMs.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>This work has been funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Founda</title>
        <p>tion) – 554559555, 564661959, 437179652; the Lower Saxony Ministry of Science and Culture, and the</p>
      </sec>
      <sec id="sec-5-2">
        <title>VW Foundation.</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <sec id="sec-6-1">
        <title>The author(s) have not employed any Generative AI tools.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Eiselt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>PAN plagiarism corpus 2009 (PAN-PC09) (version 1</article-title>
          ),
          <year>2009</year>
          . doi:
          <volume>10</volume>
          .5281/zenodo.3250083.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M. R.</given-names>
            <surname>Pardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>Overview of the PAN/CLEF 2015 evaluation lab</article-title>
          ,
          <source>in: 6th International Conference of the CLEF Association, CLEF</source>
          <year>2015</year>
          , Toulouse, France, September 8-
          <issue>11</issue>
          ,
          <year>2015</year>
          , Proceedings, volume
          <volume>9283</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2015</year>
          , pp.
          <fpage>518</fpage>
          -
          <lpage>538</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -24027-5\_
          <fpage>49</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>Corpus and evaluation measures for automatic plagiarism detection</article-title>
          ,
          <source>in: Proceedings of the International Conference on Language Resources and Evaluation</source>
          ,
          <string-name>
            <surname>LREC</surname>
          </string-name>
          <year>2010</year>
          ,
          <volume>17</volume>
          -
          <fpage>23</fpage>
          May
          <year>2010</year>
          , Valletta, Malta,
          <source>European Language Resources Association</source>
          ,
          <year>2010</year>
          . URL: http://www.lrec-conf.org/proceedings/lrec2010/summaries/35.html.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Boyd-Graber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Okazaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <article-title>ACL 2023 policy on ai writing assistance</article-title>
          ,
          <year>2023</year>
          . URL: https://2023.aclweb.org/blog/ACL-2023-policy/.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>[5] A. for the Advancement of Artificial Intelligence, AAAI publication policies &amp; guidelines, 2025</article-title>
          . URL: https://aaai.org/aaai-publications/
          <article-title>aaai-publication-policies-guidelines/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Brunskill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Engelhardt</surname>
          </string-name>
          ,
          <source>Clarification on large language model policy llm</source>
          ,
          <year>2023</year>
          . URL: https://icml.cc/Conferences/2023/llm-policy.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kwon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sohn</surname>
          </string-name>
          ,
          <string-name>
            <surname>Linq-</surname>
          </string-name>
          embed-mistral
          <source>technical report, CoRR abs/2412</source>
          .03223 (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.2412.03223.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Greiner-Petter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shelmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          , E. Zangerle,
          <source>Overview of the Generative Plagiarism Detection Task at PAN</source>
          <year>2025</year>
          , in: CLEF
          <source>2025 Proceedings, CEUR-WS.org</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Wahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ruas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Meuschke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          ,
          <article-title>Are neural language models good plagiarists? A benchmark for neural paraphrase detection</article-title>
          ,
          <source>in: ACM/IEEE Joint Conference on Digital Libraries, JCDL</source>
          <year>2021</year>
          ,
          <article-title>Champaign</article-title>
          , IL, USA, September
          <volume>27</volume>
          -
          <issue>30</issue>
          ,
          <year>2021</year>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>226</fpage>
          -
          <lpage>229</lpage>
          . doi:
          <volume>10</volume>
          .1109/ JCDL52503.
          <year>2021</year>
          .
          <volume>00065</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Wahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ruas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kirstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          ,
          <article-title>How large language models are transforming machineparaphrase plagiarism</article-title>
          ,
          <source>in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2022</year>
          ,
          <string-name>
            <given-names>Abu</given-names>
            <surname>Dhabi</surname>
          </string-name>
          ,
          <source>United Arab Emirates, December</source>
          <volume>7</volume>
          -
          <issue>11</issue>
          ,
          <year>2022</year>
          , ACL,
          <year>2022</year>
          , pp.
          <fpage>952</fpage>
          -
          <lpage>963</lpage>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2022</year>
          .EMNLP-MAIN.
          <year>62</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cegin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Simko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Brusilovsky</surname>
          </string-name>
          ,
          <article-title>Chatgpt to replace crowdsourcing of paraphrases for intent classification: Higher diversity and comparable model robustness</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2023</year>
          , Singapore, December 6-
          <issue>10</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>1889</fpage>
          -
          <lpage>1905</lpage>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          .EMNLP-MAIN.
          <year>117</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feldman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Downey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Weld</surname>
          </string-name>
          ,
          <article-title>SPECTER: document-level representation learning using citation-informed transformers, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</article-title>
          ,
          <source>ACL 2020, Online, July</source>
          <volume>5</volume>
          -
          <issue>10</issue>
          ,
          <year>2020</year>
          , Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>2270</fpage>
          -
          <lpage>2282</lpage>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2020</year>
          .ACL-MAIN.
          <year>207</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>