<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>An Embedding-Based Approach for Identifying LLM-Generated Code in Student Assignments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paulina Gacek</string-name>
          <email>paulina.gacek.pl@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AGH University of Krakow</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Electrical Engineering</institution>
          ,
          <addr-line>Automatics, IT and Biomedical Engineering</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Mickiewicza 30</institution>
          ,
          <addr-line>30-059 Krakow</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The widespread availability of large language models has introduced new challenges to academic integrity in programming courses. This paper presents a lightweight and practical system for detecting AI-assisted code submissions by leveraging code embeddings to compare student submissions against representative LLMgenerated solutions. Experimental results from real student assignments in a university-level Algorithms and Data Structures course demonstrate that the system efectively highlights submissions with strong semantic similarity to LLM-generated solutions, even when minor edits were applied. This system can significantly reduce manual inspection workload by flagging suspicious submissions for instructor review, serving as a decision-support tool rather than an automated classifier.</p>
      </abstract>
      <kwd-group>
        <kwd>code similarity</kwd>
        <kwd>large language models</kwd>
        <kwd>AI-assisted plagiarism</kwd>
        <kwd>code embeddings</kwd>
        <kwd>programming education</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The emergence of large language models, such as GPT-4, has significantly reshaped computer science
education. These models can solve complex algorithmic problems within seconds, pass technical
interviews on platforms like LeetCode, and often outperform students in programming-related
assessments [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. While LLMs present valuable opportunities as educational tools, their use in academic settings
raises serious concerns regarding academic integrity and the authenticity of student learning [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        True coding proficiency develops through consistent engagement with algorithmic thinking,
debugging, and mastering syntax. However, when students use large language models to generate solutions,
they often bypass this process, hindering the development of fundamental programming skills. Despite
these pedagogical concerns, students commonly employ LLMs for assignments [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        This increasing reliance on AI in programming introduces unique detection challenges. While most
current AI detection tools focus on AI-generated text, they largely overlook AI-generated code [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
Distinguishing LLM-assisted or AI-generated code from human-written code is particularly complex.
Unlike text detection, which can leverage stylistic features and linguistic patterns, programming tasks
often have a constrained solution space [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Consequently, independently written solutions by diferent
students might exhibit structural or semantic similarities, making it dificult to reliably diferentiate
between authentic and AI-generated code.
      </p>
      <p>Building on the identified need, this paper introduces a lightweight and interpretable system designed
to assist educators in identifying code submissions that may have been generated or significantly
influenced by large language models. Rather than making grading decisions or issuing accusations, the
system computes similarity scores between student submissions and representative LLM-generated
solutions. These scores serve as a decision-support tool for educators, helping to flag submissions
that necessitate a closer examination. Proposed system aims to foster informed, constructive dialogue
between instructors and students, thereby promoting academic integrity in an era where LLMs are
increasingly integrated into the programming process.
(ECAI-2025), October 26, 2025 — Bologna, Italy</p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>The goal of the proposed system is to detect potential use of large language models in student
programming assignments by simulating how students might realistically use AI-generated code while
attempting to avoid detection. The approach combines LLM code generation with an embedding-based
similarity analysis to compare student submissions against a range of plausible AI-generated solutions.</p>
      <sec id="sec-2-1">
        <title>2.1. Modeling Student Behavior</title>
        <p>To guide the design of the simulation, an anonymous survey was conducted among 50 Computer
Science students at AGH University of Kraków. The results, summarised in Figure 1, indicated that
94% of students reported using large language models when solving programming tasks. Among these
students, 66.7% stated they deliberately modify the generated code to avoid detection. Specifically,
the most common modifications included removing comments, reported by 78.1% of LLM users who
admitted to modify LLM-generated code. Additionally, 46.9% of LLM users reported changing variable
names, and 46.9% made manual modifications to the code to make it appear less AI-generated, often
involving structural or stylistic changes such as simplifying advanced constructions.</p>
        <p>Complementing the findings on LLM adoption, survey also investigated the specific LLMs students
utilize for programming tasks, as illustrated in Figure 2. The results indicate a predominant reliance
on GPT, with 97.9% of students reporting its use. Other frequently employed LLMs include Deepseek
(54.2%) and Gemini (41.7%), while models such as Claude (20.8%), Grok (8.3%), Copilot (2.1%), Llama
(2.1%), and Mistral (2.1%) were used to a lesser extent. This distribution highlights the landscape of
tools students are currently integrating into their programming education.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Code Generation and Augmentation</title>
        <p>The system begins by querying selected large language model to generate a reference solution for a
given programming assignment. To simulate realistic student behavior, the model is then prompted to
rewrite the code  times while preserving its original functionality. Each rewritten version represents a
variation that a student might plausibly submit. This process is illustrated in Figure 3.</p>
        <p>A key architectural advantage of proposed system is its extensibility, allowing for easy integration of
additional LLM API connections. In the current implementation, based on the prevalent usage observed
in the survey, APIs to GPT-4o1 and Gemini 2.5 Flash2 have been incorporated.</p>
        <p>The output of this step is a set of diverse yet semantically equivalent code snippets that reflect the
kinds of edits students might apply to LLM-generated code before submission.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Similarity Detection Using Embeddings</title>
        <p>To assess whether a student submission may be derived from an LLM-generated solution, an
embeddingbased similarity detection approach was employed, as illustrated in Figure 4.</p>
        <sec id="sec-2-3-1">
          <title>1https://openai.com/index/hello-gpt-4o/ 2https://deepmind.google/models/gemini/flash/</title>
          <p>
            The process begins by removing all comments from both student submissions and LLM-generated
code variants to eliminate stylistic noise. The cleaned code snippets are then embedded using
QodoEmbed-1 [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ], a state-of-the-art embedding model trained on a diverse corpus of programming code,
including Python. This model captures the semantic structure of code and is robust to minor textual
variations, such as changes in variable names or formatting.
          </p>
          <p>Cosine similarity is then computed between each student submission and each LLM-generated variant.
This yields a quantitative measure of semantic similarity, allowing the system to identify structurally or
algorithmically similar code even when superficial edits are present. High similarity scores may suggest
that a student submission was influenced—either directly or indirectly—by AI-generated content.</p>
          <p>It is important to emphasize that similarity scores alone do not provide definitive evidence of
misconduct. Due to the narrow solution space of many programming problems, diferent students
may independently arrive at similar implementations. Consequently, no fixed threshold can guarantee
perfect discrimination between original and AI-assisted submissions. In this study, the similarity cutof
was empirically set to 0.94, based on manual analysis of the submission dataset and the observed
distribution of similarity scores.</p>
          <p>This embedding-based approach significantly reduces the manual workload associated with reviewing
large numbers of student submissions. In scenarios involving over a hundred code submissions, detailed
manual inspection is often impractical. The system eficiently identifies potentially AI-influenced
code, enabling instructors to concentrate their eforts on a small subset of cases that warrant further
investigation.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>To assess the efectiveness of the proposed system in detecting AI-assisted student submissions, a series
of experiments were conducted using real-world programming assignments and authentic student
data. The evaluation was performed on a dataset comprising 509 student submissions collected across
multiple assignments from the Algorithms and Data Structures course at AGH University of Krakow.
Each task required students to implement a function that solves a well-defined algorithmic problem,
applying data structures and techniques introduced during the course.</p>
      <p>Evaluating AI-generated code detection is inherently challenging due to the absence of definitive
ground truth labels. In typical academic settings, students do not disclose whether they have used large
language models to assist with their work, making it dificult to construct a reliably labeled dataset.</p>
      <p>While one possible solution would involve instructing students to use LLMs and intentionally disguise
the resulting code, this approach risks creating an artificial environment. When participants are aware
their submissions will be analyzed for AI traces, they may overcompensate in their
modifications—potentially expending as much efort on concealment as they would on solving the problem unaided. As a
result, the behavior captured in such a setup may not accurately reflect real-world scenarios and could
lead to skewed conclusions.</p>
      <p>To address these challenges without introducing artificial bias, we adopt an unsupervised
evaluation approach. The system generates multiple LLM-based variants of each programming task and
cosine similarity is then computed between student submissions and these generated variants. Rather
than relying on binary labels, the distribution of similarity scores across all submissions is analyzed.
Submissions with high similarity scores are interpreted as likely influenced by LLM-generated content.</p>
      <sec id="sec-3-1">
        <title>3.1. Case Study: Example Assignment Analysis</title>
        <p>To further illustrate the efectiveness of the proposed detection system, this section analyzes a
representative programming task used in the evaluation. The full task description is provided below.</p>
        <p>Task description
AGH student city is covered with trees that have an extensive root system. This system is
represented by a graph G, where vertices represent trees and edges represent connections
between their root systems. To study the city's root system, students selected  trees and
inoculated them with  different fungus species, numbered from 0 to  − 1 .</p>
        <p>In one unit of time, a fungus can spread from a tree to all directly connected trees whose
roots were not previously infected by any fungus. If two or more fungus species reach an
uninfected tree in the same unit of time, the fungus with the smallest index wins and infects
that tree.</p>
        <p>The task is to implement the function getCountOfInfectedTrees(G: List[List[int]],
infectedTrees: List[int], fungusNr: int) -&gt; int, which determines how many trees will
ultimately be infected by the fungus with the number fungusNr. The function accepts the
following arguments:
• G: The graph represented as an adjacency list.
• infectedTrees: An array containing the numbers of the trees that were initially
inoculated with fungus.</p>
        <p>• fungusNr: The number of the fungus for which we want to count infected trees.</p>
        <p>The function should return the number of trees infected by fungus number fungusNr.</p>
        <p>The majority of submissions exhibit cosine similarity scores in the range of 0.84–0.94, indicating that
many student solutions share structural or semantic similarities with LLM-generated code. A small
number of submissions exceed a similarity of 0.94, suggesting a strong resemblance that may indicate
AI assistance, despite superficial variations. Interestingly, the ten lowest scores (around 0.60) represent
comparisons between LLM-generated answers and a single student submission containing only the
function definition without implementation.</p>
        <p>Listings 1 and 2 present a student submission and its corresponding GPT-4o-generated solution,
which yielded a cosine similarity score of 0.96. Although these implementations difer slightly in control
lfow and formatting, both employ a nearly identical breadth-first search strategy to simulate the spread
of fungal infection across the graph.</p>
        <p>Listing 1: Student submission with 96% cosine similarity to an GPT-4o-generated variant.
def getCountOfInfectedTrees(G: List[List[int]], infectedTrees: List[int],</p>
        <p>fungusNr: int) -&gt; int:
n = len(G)
owner = [-1] * n
queue = deque()
for fungusIndex, tree in enumerate(infectedTrees):
owner[tree] = fungusIndex
queue.append((0, fungusIndex, tree))
while queue:
time, fungusIndex, currentTree = queue.popleft()
for neighbor in G[currentTree]:
if owner[neighbor] == -1:
owner[neighbor] = fungusIndex
queue.append((time + 1, fungusIndex, neighbor))
return owner.count(fungusNr)</p>
        <p>Listing 2: Corresponding GPT-generated solution after 4th rephrasing (comments removed).</p>
        <p>A notable observation is the reuse of specific variable names—such as owner—which are not
particularly intuitive for the task at hand. The repeated use of such an idiosyncratic identifier in both
versions suggests a high likelihood of copying or direct influence from the LLM output. This example
demonstrates the strength of embedding-based similarity detection in capturing semantic and structural
similarities that go beyond surface-level modifications.</p>
        <p>Importantly, the system does not make automatic accusations or grading decisions. Instead, it
provides similarity metrics to support educators in reviewing potential AI-assisted solutions and
initiating conversations with students if necessary.</p>
        <p>In this particular case, an analysis of the student’s submission history reveals additional evidence
suggesting the use of a large language model. The entire implementation was submitted within a span
of just 10 minutes and triggered two critical errors on first two submissions: ERROR: name 'deque'
is not defined and ERROR: name 'fungusNr' is not defined. These errors are typical when
copying code from an LLM response without adapting it to the provided function signature or
including necessary imports. Notably, the function argument fungusNr was incorrectly replaced by
fungusNrToCount, further supporting the hypothesis that the submission was pasted from a generic
LLM output without adequate integration into the student’s codebase.</p>
        <p>In contrast, the solution shown in Listing 3 was confirmed to be independently written by a student
without the use of AI assistance. This conclusion was based on direct communication with the student
and a review of their submission history. While the core logic of both this and the suspected AI-assisted
solutions relies on breadth-first search (BFS), notable diferences exist in completeness, variable naming,
and implementation structure. These distinctions are suficient to account for the lower similarity score
observed.
discovered = [False] * len(G)
queue = deque()
f_counters = [0] * len(infectedTrees)
for i in range(len(infectedTrees)):
queue.append((infectedTrees[i], i))
f_counters[i] += 1
discovered[infectedTrees[i]] = True
while queue:
node, f_index = queue.popleft()
for neighbour in G[node]:
if discovered[neighbour] == False:</p>
        <p>discovered[neighbour] =
return f_counters[fungusNrToCount]</p>
        <sec id="sec-3-1-1">
          <title>Listing 3: Student-created solution confirmed to be written independently</title>
          <p>As shown in Figure 6, there is a clear separation in cosine similarity scores between independently
written code and submissions suspected to have been AI-assisted. Original work clustered in the
0.79–0.85 range, while potentially AI-generated submissions exhibited significantly higher similarity,
between 0.90 and 0.96. This supports the validity of the similarity threshold selected and demonstrates
the efectiveness of the embedding-based approach in flagging suspicious cases.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>This study presents a practical and scalable approach to detecting LLM-assisted code submissions in
programming assignments. Experimental results on real-world student submissions from a
universitylevel Algorithms and Data Structures course demonstrate the system’s ability to capture both structural
and semantic similarities, even when minor surface-level changes are introduced.</p>
      <p>The proposed system avoids the need for ground-truth labeling, which is inherently dificult to obtain.
Instead, it provides a data-driven, unsupervised methodology for supporting educators in evaluating
the integrity of programming assignments. Instructors can use this tool to prioritize manual review
of highly similar submissions, engage students in follow-up discussions, and better understand how
LLMs are influencing learning behaviors. For example, the identification of a student who submitted an
entire solution within minutes—along with LLM-typical syntax and missing imports—suggests that
detection systems can serve as valuable starting points for personalized pedagogical intervention.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Future Work</title>
      <p>In the current approach, only the final version of each student submission is used for similarity
computation. Future iterations will take into account all saved submissions, allowing the detection
method to capture the evolution of problem-solving and style shifts across attempts. In addition,
incorporating other types of similarity is planned in order to improve the robustness of the approach.</p>
    </sec>
    <sec id="sec-6">
      <title>A. Source Code Availability</title>
      <p>The source code and supporting materials used in this study are openly available in a public GitHub
repository: https://github.com/paulinagacek/llm-detection-gecoin.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used GPT-4o and Gemini 2.5 Flash in order to: Grammar
and spelling check. After using these tools, the author reviewed and edited the content as needed and
takes full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bubeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chandrasekaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Eldan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gehrke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Horvitz</surname>
          </string-name>
          , E. Kamar,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Palangi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <source>Sparks of artificial general intelligence: Early experiments with gpt-4</source>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2303.12712. arXiv:
          <volume>2303</volume>
          .
          <fpage>12712</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Qureshi</surname>
          </string-name>
          ,
          <article-title>Chatgpt in computer science curriculum assessment: An analysis of its successes and shortcomings</article-title>
          , in: 2023 9th International Conference on e-Society, e-Learning and e-Technologies,
          <string-name>
            <surname>ICSLT</surname>
          </string-name>
          <year>2023</year>
          , ACM,
          <year>2023</year>
          , p.
          <fpage>7</fpage>
          -
          <lpage>13</lpage>
          . URL: http://dx.doi.org/10.1145/3613944.3613946. doi:
          <volume>10</volume>
          .1145/ 3613944.3613946.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Paustian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Slinger</surname>
          </string-name>
          ,
          <article-title>Students are using large language models and ai detectors can often detect their use</article-title>
          ,
          <source>Frontiers in Education Volume 9 - 2024</source>
          (
          <year>2024</year>
          ). URL: https://www.frontiersin.org/journals/ education/articles/10.3389/feduc.
          <year>2024</year>
          .
          <volume>1374889</volume>
          . doi:
          <volume>10</volume>
          .3389/feduc.
          <year>2024</year>
          .
          <volume>1374889</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <article-title>Detecting ai-generated code assignments using perplexity of large language models</article-title>
          ,
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          <volume>38</volume>
          (
          <year>2024</year>
          )
          <fpage>23155</fpage>
          -
          <lpage>23162</lpage>
          . URL: https://ojs.aaai.org/index.php/AAAI/article/view/30361. doi:
          <volume>10</volume>
          .1609/aaai.v38i21.
          <fpage>30361</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Azoulay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hirst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reches</surname>
          </string-name>
          ,
          <article-title>Let's do it ourselves: Ensuring academic integrity in the age of chatgpt and beyond</article-title>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .36227/techrxiv.24194874.
          <year>v1</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Q. AI</given-names>
            ,
            <surname>Qodo</surname>
          </string-name>
          <string-name>
            <surname>documentation</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://docs.qodo.ai/qodo-documentation, accessed:
          <fpage>2025</fpage>
          - 06-08.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>