<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/3643795</article-id>
      <title-group>
        <article-title>Yet Ready For Vulnerability Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francesco Panebianco</string-name>
          <email>francesco.panebianco@polimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Isgrò</string-name>
          <email>andrea.isgro@mail.polimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Longari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Zanero</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Large Language Models, Software Security, Vulnerability Detection, Artificial Intelligence</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano</institution>
          ,
          <addr-line>Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>35</volume>
      <fpage>03</fpage>
      <lpage>8</lpage>
      <abstract>
        <p>The growing number of reported software vulnerabilities underscores the need for eficient detection methods, especially for resource-limited organizations. While traditional techniques like fuzzing and symbolic execution are efective, they require significant manual efort. Recent advances in Large Language Models (LLMs) show promise for zero-shot learning, leveraging pre-training on diverse datasets to detect vulnerabilities without ifne-tuning. This study evaluates quantized models (e.g., Mistral v0.3), code-specialized models (e.g., CodeQwen 1.5), and fine-tuned approaches like PDBERT. Zero-shot models perform poorly, with a precision below 0.46, and even PDBERT's high metrics (precision 0.91, specificity 0.99) are undermined by overfitting. These findings emphasize the limitations of current AI solutions and the necessity for approaches tailored to the specific problem.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>LGOBE
https://frank01001.com (F. Panebianco)</p>
      <p>CEUR</p>
      <p>
        ceur-ws.org
attempted to fine-tune language models and Natural Language Processing (NLP) classifiers for this
task [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ], it was shown that the encouraging results of many of these works may have been the result
of overfitting [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Overfitting is a phenomenon that occurs when a trained AI model captures patterns
in the training sample that are not representative of the overall distribution of the population. As a
result, the model performs poorly on unseen data.
      </p>
      <p>An alternative strategy to fine-tuning is zero-shot learning [ 12]. It employs an LLM’s generalization
capabilities to perform a task not included in the training objective. This study identifies the most
commonly used datasets and LLM architectures in recent publications focused on vulnerability detection.
Furthermore, it evaluates low-cost commercial solutions and quantized open-source models to determine
their suitability for the requirements of small and medium-sized businesses. The evaluation also includes
a recently published methodology named PDBERT [13], as representative of the performance of
finetuning approaches.</p>
      <p>Our analysis shows that quantized general-purpose models, such as Mistral v0.3, fail to reliably
recognize common weaknesses in code, achieving a precision of 0.46. Similarly, code-specialized models
like CodeQwen 1.5 exhibit even lower precision, reaching only 0.38. The same can be said for low-cost
commercial solutions like GPT-4o mini, which reaches a precision of 0.30. PDBERT, on the other hand,
has precision and specificity values of 0.91 and 0.99 respectively. Although the observed metrics may
typically indicate a reliable classifier, our analysis reveals that this performance stems from overfitting.
Specifically, we curated a set of vulnerable examples and their corresponding fixes. Regrettably, PDBERT
failed to recognize any of the vulnerable instances.</p>
      <p>We summarize our contributions as follows:
• We provide an overview of the current architectural and dataset choices in the literature on</p>
      <p>LLM-based vulnerability detection
• We evaluate the capabilities of low-budget LLM models on the task, highlighting performance
diferences across common weaknesses.
• Using a curated set of vulnerable functions, we demonstrate that a notable fine-tuned approach
exhibits overfitting to the training dataset.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Motivation</title>
      <p>This section provides an overview of key concepts and techniques pertinent to software security and
large language models (LLMs). Additionally, we review relevant literature in the field and present the
motivation behind our research.</p>
      <sec id="sec-2-1">
        <title>2.1. Vulnerabilities in Software</title>
        <p>Vulnerabilities in software applications are a threat to the security of infrastructures, organizations, and
individuals. These critical bugs result from errors committed by developers within the codebase. If these
bugs don’t trigger critical crashes or evident inconsistencies, it can be challenging for developers and
code reviewers to detect their presence. A software vulnerability is an implementation error that allows
users to perform malicious actions beyond the intended software specifications. Such vulnerabilities
include logical errors, which cause inconsistent system states; authentication and authorization flaws,
which enable unauthorized users to execute restricted actions; and memory corruption in binary
software, potentially leading to Remote Code Execution (RCE). RCE represents the most severe form of
attack, as it grants an attacker the ability to execute arbitrary operations on the target system.</p>
        <p>
          Fuzzing and symbolic (or concolic) execution are among the most prominent techniques for
automating vulnerability detection [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. These methodologies have demonstrated considerable success
in identifying numerous vulnerabilities over the years. However, they exhibit notable limitations.
First, their setup involves substantial manual efort and domain-specific expertise, rendering them
inaccessible to individuals lacking specialized knowledge. Second, these techniques often fail to uncover
vulnerabilities that are only triggered in deep code paths.
        </p>
        <p>
          Categorizing Vulnerabilities. Common Vulnerabilities and Exposures (CVEs) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] represent
standardized identifiers for publicly known cybersecurity vulnerabilities, facilitating efective communication
and remediation across diverse information security systems. Each CVE entry provides a unique
identiifer and concise description of a specific software or hardware vulnerability, enabling organizations to
prioritize and address risks systematically. The record for a CVE also includes a quantitative measure of
its severity. Instead, Common Weakness Enumerations (CWEs) [14] serve as a taxonomy of software and
hardware weaknesses that underlie vulnerabilities, providing a systematic framework for identifying,
categorizing, and mitigating the root causes of security flaws. Unlike CVEs, which address specific
instances of vulnerabilities, CWEs focus on generic patterns of error in design, implementation, or
configuration. These patterns can be captured by machine learning models to perform detection.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Large Language Models and Training Strategies</title>
        <p>
          Large Language Models (LLMs) are a recent breakthrough in Natural Language Processing (NLP),
powered by advances in deep learning architectures and the availability of large-scale datasets. This advance
was introduced by the Transformer architecture [15], which replaced recurrent and convolutional
models with self-attention mechanisms, enabling unprecedented scalability and contextual
understanding. This innovation paved the way for state-of-the-art models like GPT (Generative Pre-trained
Transformer) [16], culminating in the ChatGPT [17] revolution, which showcased the practicality and
societal impact of generative AI. LLMs demonstrate remarkable generalization capabilities, driven by
their training on vast, diverse datasets [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>Zero-Shot Learning. LLMs are already employed by software engineers for tasks such as code
generation and code analysis, playing an integral role in the software development process [ 18]. Efective
handling of code-related tasks requires both flexibility and robust generalization capabilities, as code
syntax merely serves as a medium to represent the underlying algorithm. The algorithm itself is rooted
in logical and formal reasoning. Algorithm design can therefore be framed as solving novel, often
unseen, tasks that are defined by given formal instructions.</p>
        <p>Zero-shot learning is a machine learning approach where an LLM is applied to perform tasks outside of
its explicit training objectives [19]. In contrast to conventional supervised learning, zero-shot learning
does not involve updating model weights through dataset-specific fitting. Instead, task adaptation
occurs through prompting. Instructions are provided to guide the model toward generating the desired
output format or response. Zero-shot classification is a specific instance of a zero-shot learning task,
in which the LLM is used as a classifier model. The model is instructed to output the classification
label as a result of some “reasoning” performed on the query input. A popular use-case of zero-shot
classification is LLM-as-a-judge [ 20], which provides a judgment on the given input.</p>
        <p>Commercial LLMs like GPT-4 and GPT-4 Turbo models have shown impressive capabilities on
coderelated tasks [21]. Still, the public release of models like LLAMA [22] has allowed the open-source
community to fine-tune specialized models on diferent tasks, including code generation. These include
StarCoder2 [23] and CodeQwen [24]. Alongside these specialized models, general-purpose LLMs like
LLama 3.1 have improved performance on code tasks over previous iterations and similar open-source
alternatives [22].</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Related Work</title>
        <p>
          Wu and Zhang et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] analyze seven use cases for ChatGPT (3.5 and 4) in the field of Software Security,
including vulnerability detection. Code afected by real-world CVEs is tested against the two versions
of the model. They discuss both successful scenarios and failure conditions, for which they provide
likely causes. They find that the newer iteration of ChatGPT (based on GPT 4) is much more likely to
identify the vulnerabilities. Zhou et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] survey the current landscape of LLMs applied to vulnerability
detection and repair. The survey highlights the dominance of encoder-only LLMs for vulnerability
detection. It explores key approaches like fine-tuning, zero-shot, and few-shot prompting, along with
techniques that combine program analysis to improve model performance. The paper also identifies
critical limitations, such as the lack of high-quality datasets, challenges with complex vulnerabilities,
and the narrow focus on function-level detection. Steenhoek et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] present a broad review of
stateof-the-art LLMs applied to the task of vulnerability detection in software. The research provides an
assessment of the detection performance, examining the identification of the type, location, and cause
of vulnerabilities. The study evaluates a range of prompting techniques, including zero-shot, n-shot
in-context, and advanced methods incorporating contrastive pairs and chain-of-thought reasoning from
static analysis and CVE descriptions. Zhang et al. [25] present an extensive empirical study investigating
the eficacy of pre-trained model-based automated software vulnerability repair techniques, focusing
on C/C++ code. The authors evaluate the performance of several pre-trained models against common
vulnerability datasets.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Motivation</title>
        <p>This study aims to assess the current capabilities of AI-based technologies for automatic vulnerability
detection. To achieve this, we begin by reviewing the state-of-the-art in terms of datasets and
models commonly used in the field. This review serves as a foundational reference for researchers
approaching this topic. Additionally, we evaluate the performance of publicly available LLM
solutions on the task of vulnerability detection, providing insights into their efectiveness for this
application. In relation to existing research, this paper seeks to identify current gaps in the use of AI
for automatic vulnerability detection. Furthermore, it ofers an empirical evaluation of the potential of
this technology, with a particular focus on its applicability and benefits for small and medium-sized
enterprises (SMEs). These organizations may be required to maintain in-house, often legacy, software
that they cannot aford to have assessed by security professionals. Through this examination, we
aim to contribute to a better understanding of both the challenges and opportunities that AI-driven
vulnerability detection presents for businesses of this scale.
3. Review of Vulnerability Detection Datasets and Models
Given the challenges identified in AI-based vulnerability detection, understanding the datasets and
models commonly used in this field is crucial for evaluating current approaches. To this end, we
examined datasets and model choices for a collection of recent works from both publications and
preprints. Appendix A covers the details of considered works.</p>
        <p>Datasets. Figure 1 shows the occurrences of a certain evaluation dataset in the considered set. The
vast majority of works rely on existing datasets for their evaluation. The most popular datasets are
BigVul [26] and Reveal [27]. Big-Vul is a large C/C++ code vulnerability dataset collected from open-source
Github projects. Reveal is instead collected from Chrome and Linux Kernel issue trackers. Devign [28]
and CVEfixes [ 29] follow among the most popular options. It can be observed that these datasets
are predominantly assembled by scraping commits and issue trackers from GitHub repositories of
open-source software. While some works opt for custom self-assembled datasets, samples in their
collection are gathered from similar sources. While this approach is practical given the scarcity of
curated samples for vulnerability detection tasks, it introduces potential experimental biases that must
be addressed. Specifically, there is a significant likelihood that the code from these open-source projects
may already be included in the training or pre-training datasets of the LLMs used. If an overlap exists,
it undermines the reliability of the evaluation metrics.</p>
        <p>
          LLM Models. In recent literature, a diverse set of large language models (LLMs) has been deployed for
software vulnerability detection, with varying levels of adoption and efectiveness. Among these, GPT-4
emerges as the most widely employed (see Figure 1), being a central focus in numerous studies [
          <xref ref-type="bibr" rid="ref8">8, 30,
31, 32, 33, 34, 35, 36, 37, 38</xref>
          ]. Its extensive use can be attributed to its large context length (up to 128k
tokens) and strong performance. GPT-3.5 shows significant prevalence [
          <xref ref-type="bibr" rid="ref8">8, 39, 38, 40, 41, 37, 42, 43</xref>
          ].
This sustained popularity is partly due to its longer availability, which has led to its incorporation into
many experiments and datasets, even though it ofers lower performance compared to newer models
Custom
Big-Vul
Reveal
Devign
Draper
        </p>
        <p>FormAI
Linux Kernel</p>
        <p>SARD
VulDeePecker</p>
        <p>GPT-4</p>
        <p>GPT-3.5
Mixtral-MoE</p>
        <p>LLAMA 2
WizardCoder</p>
        <p>Mistral</p>
        <p>Falcon</p>
        <p>CodeBERT</p>
        <p>
          Code LLAMA
like GPT-4. Other models are also seeing increasing but more selective adoption. For instance, LLAMA,
Mixtral, Mistral, and its variations are used [
          <xref ref-type="bibr" rid="ref8">8, 40, 32, 42</xref>
          ]. These models are chosen for their smaller
parameter sizes and open-source access. Similarly, some works adopt CodeGen, StarCoder, Falcon,
BERT and their variations (CodeBERT, VulBERTa) [
          <xref ref-type="bibr" rid="ref10 ref11">44, 45, 46, 47, 25, 13, 10, 48, 11</xref>
          ]. These models are
task-specific, thus limiting their general application. Figure 1 shows the distribution of the most popular
models used in considered works. In short, GPT-4 and GPT-3.5 dominate the field due to their high
performance and historical availability. Other models like LLAMA, Mistral, StarCoder, and Falcon are
instead adopted as open-source.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Assessment Methodology</title>
      <p>
        We evaluate the performance of LLMs on the task of vulnerability detection in binary software (C/C++),
with a focus on solutions accessible to small and medium enterprises. We select models that are either
open-source and quantized for deployment on low-end hardware or available via afordable API queries.
The evaluation uses code snippets from one of the most widely used vulnerability datasets, which are
input to the LLMs with a specifically crafted prompt. The prompt outlines a vulnerability detection task,
presenting a list of Common Weakness Enumerations (CWEs) that could apply to the code and asks
the model to determine if any of these CWEs are present. If a non-vulnerable sample is erroneously
predicted as vulnerable, the model is also asked to identify the CWE, providing insights into potential
biases toward specific weaknesses. Figure 2 visually shows our assessment pipeline. The paper addresses
three research questions:
RQ1: How efective are zero-shot solutions for vulnerability detection? To answer this, we assess several
popular models, including general-purpose, code-specific, open-source, and commercial solutions.
RQ2: Do current fine-tuned models sufer from overfitting, as previously observed by Risse and
Böhme [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]? We evaluate whether more recent and promising fine-tuned models exhibit similar
overfitting patterns.
      </p>
      <p>RQ3: Is there a tendency for models to over-recognize specific CWEs? This question examines whether
certain CWEs are detected more frequently or with higher accuracy, indicating a potential bias in the
model’s ability to generalize across diferent vulnerabilities.</p>
    </sec>
    <sec id="sec-4">
      <title>5. Experimental Evaluation</title>
      <p>We evaluate the suitability of low-budget LLMs for vulnerability detection for use in small and
mediumsized enterprises. Candidate models are the most recent open-source models, as well as corresponding
commercial alternatives. We quantize open-source models to 4-bit integers using the BitsAndBytes
Python library [49]. We pick two open-source general-purpose models, LLAMA 3.1 8B [22], Mistral v0.3
7B [50]. To represent code generation models, we choose CodeQwen1.5 7B [24]. As for the commercial
alternative, the representative is GPT-4o mini [51], the current cheapest model from OpenAI available
with API access. While we initially intended to include StarCoder 2 7B [52] as another open-source
code generation model, we were not able to obtain a reliable response structure compatible with the
classification task. Finally, our evaluation also includes PDBERT [ 13], one of the most promising
finetuned models for the vulnerability detection task. PDBERT is described in greater detail in Appendix
A. As the evaluation dataset, we choose Big-Vul [26], being one of the most widely-used datasets
among considered works. As mentioned in section 3, the open-source nature of this dataset makes the
evaluation potentially prone to bias on any tested LLM. We acknowledge this issue and present the
results as being produced by a “favorable testing environment”. Even in this setting, results on most
models are poor and we do not claim their usability in any critical setting. Furthermore, we observe
evidence of overfitting in the fine-tuned model, PDBERT. Due to the composition of this dataset, the
evaluation is focused on vulnerability detection in C/C++ code. We test all models on the same Big-Vul
test split, consisting of 31326 code snippets. 6684 of these samples contain a vulnerability, while the
remaining 24642 do not. The dataset associates a CWE ID to each vulnerable sample. Table 1 shows
a list of CWE IDs present in the dataset and their associated description. To avoid the requirement
of large context windows for used LLMs, we sampled the test split among snippets of no more than
5000 characters. Slight variations of the system prompt were tested at the prompt design stage. These
variations did not yield significantly diferent results.</p>
      <sec id="sec-4-1">
        <title>5.1. Experimental Results</title>
        <p>RQ1: How efective are zero-shot solutions for vulnerability detection?
Results indicate that LLAMA 3.1 achieves the highest recall among the evaluated models. However,
it also exhibits the lowest precision, with a value of 0.24. Mistral v0.3 shows improved precision,
reaching 0.46, but sufers from significantly reduced recall. Although it achieves a high True Negative
Rate (specificity), reliably classifying non-vulnerable code as clean, the low recall suggests that this
performance arises more from a predisposition to predict “non-vulnerable” rather than from a nuanced
understanding of code. CodeQwen 1.5, despite being trained for code generation, performs poorly in
vulnerability detection, with precision and recall values of 0.38 and 0.56, respectively, demonstrating
unreliability comparable to other models evaluated. GPT-4o mini, the sole commercial model included
in this analysis, reaches a mediocre precision of 0.3 and specificity of 0.46. Ultimately, no model
demonstrates excellence in zero-shot vulnerability detection.</p>
        <p>Answer to RQ1
Zero-shot learning on small general-purpose and code-specialized models achieves poor results on the
task of vulnerability detection. The performance of some of these models is comparable to that of a
random guessing classifier.</p>
        <sec id="sec-4-1-1">
          <title>RQ2: Do current fine-tuned models sufer from overfitting?</title>
          <p>Among the selected sample of models, PDBERT is the only one that underwent fine-tuning for the
detection task. The model yields high precision and specificity. These exceptionally high values raise
concerns about potential overfitting. Nonetheless, achieving such performance may suggest that the
pre-training objective based on program dependencies efectively enhances the model’s ability to fit
the training data [13]. We assess whether PDBERT learns meaningful patterns in vulnerable code by
testing it on a curated sample of code snippets, both vulnerable and non-vulnerable. Despite including
evident vulnerabilities, all samples were labeled as non-vulnerable. Detailed information on the samples
is provided in Appendix B. These results suggest that PDBERT’s performance on the Big-Vul test split
likely results from overfitting to the code style or other irrelevant patterns within the dataset.</p>
          <p>Answer to RQ2
PDBERT achieves high precision and specificity; however, further experiments raise concerns about
potential overfitting, rendering these metrics unreliable.</p>
          <p>RQ3: Is there a tendency for models to over-recognize specific CWEs?
When dealing with diferent types and distributions of CWE, it is important to consider the reliability
of the prediction for each code weakness. We evaluate which CWEs are more likely to be recognized
as vulnerable by the model. Additionally, when a false positive occurs, we follow up on the initial
prompt by asking the model which CWE-ID can describe the vulnerability. This was done on all models
except PDBERT, on which the evaluation is not applicable. Table 3 shows the detailed results in terms
of True Positive Rate (TPR), False Negative Rate (FNR), and False Positive Rate (FPR) for each model
and CWE-ID.</p>
          <p>LLAMA 3.1. Results show that the model maintains a generally balanced TPR across CWEs. The
low overall precision is likely determined by the spike in False Positives caused by CWE-119 (Bufer
Overflow). Since this is a popular type of weakness in code, it is likely the model is providing this
answer as a result of statistical imbalance. Another less frequent False Positive is CWE-190 (Integer
Overflow), covering 8% of False Positives. The model classifies 61% of clean code samples as vulnerable,
indicating an overly cautious or alarmist behavior on the task.</p>
          <p>Mistral v0.3. Its best performance is observed on CWE-787 (Out-of-bounds write), which is correctly
identified in only one out of four instances. The most missed is CWE-254(Seven Pernicious Kingdoms),
with a FNR of 0.95. The elevated error rate for this category may stem from its ambiguous definition,
as it includes a broad range of more specific weaknesses related to access control, passwords, and
cryptography. FPR is not a concern in this case, as the mode tends to over-classify as non-vulnerable.
CodeQwen 1.5. CodeQwen does not excel at recognizing any, nor does it have any bias toward frequent
weaknesses.</p>
          <p>PDBERT. The TPR and FNR appear balanced across CWEs, which is consistent with expectations given
that the model was trained on a subset of the Big-Vul dataset’s training split, which includes these
common weaknesses. For this model, the FPR metric is not available, as follow-up questions to classify
CWEs were outside the training objective.</p>
          <p>GPT-4o mini. It shows a notably low detection rate for most common weaknesses, with CWE-476
(NULL Pointer Dereference) being a marginal exception (0.41), though its performance on this CWE
is also suboptimal. The CWE also generates a slight tendency to false positives. False negatives are
common but particularly frequent for CWE-254 (Seven Pernicious Kingdoms), with an FNR of 0.71. The
reason behind this high FNR is likely related to Mistral’s. Four CWEs are never recognized, though
they are mostly “complex” weaknesses to detect.</p>
          <p>Answer to RQ3
Some models show bias toward specific CWEs, but none are consistently easy to classify. CWE-254 (Seven
Pernicious Kingdoms) causes false negatives in two models, though this is not seen in others.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusions</title>
      <p>Automated vulnerability detection remains an open research challenge with the potential to provide
cost-efective solutions for small and medium-sized organizations. Current approaches leveraging Large
Language Models (LLMs) focus on zero-shot learning or fine-tuning to address this task. We reviewed
the most common choices in datasets and models as a reference for researchers approaching the topic.
We evaluated zero-shot classification for C/C++ vulnerability detection using popular small-sized LLMs
to assess their out-of-the-box utility for code vulnerability analysis. Additionally, we extended our
evaluation to include PDBERT, a state-of-the-art fine-tuned model. While PDBERT achieves impressive
evaluation metrics on the Big-Vul test split, our secondary tests on curated samples reveal that these
results are indicative of overfitting rather than genuine learning. The experimental evaluation highlights
the limitations of the current LLM solutions in understanding vulnerable code.</p>
      <p>Limitations and Future Work. This study evaluates cost-efective LLM solutions, focusing on small
quantized models and low-cost commercial alternatives. This is compatible with the aims of the work,
which are to support small- and medium-sized organizations that may lack the resources for more
expensive solutions. Overfitting in fine-tuned models was investigated using a limited sample set, which
however reveals that the models failed to detect obvious patterns, thereby highlighting the unreliability
of the test metrics. To validate these findings, comprehensive evaluations using larger datasets are
required. However, a critical limitation is the absence of evaluation datasets that are disjoint from the
training data of any tested model. Such separation is essential for accurately assessing downstream
performance. However, achieving this remains challenging due to the continuous updates of LLM
training datasets sourced from internet scraping. Finally, future research should prioritize exploring
alternative model architectures, including Retrieval Augmented Generation (RAG). Current LLMs may
lack the necessary feature space to efectively represent and detect vulnerabilities, which are often
highly context-dependent and intricately associated with corrupted machine memory states.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>this work was partially supported by the Google.org Impact Challenge - Tech for Social Good Research
Grant (Tides Foundation). It was also partially supported by Project SETA (PNRR M4.C2.1.1 PRIN 2022
PNRR, Cod. P202233M9Z, CUP F53D23009120001, Avviso D.D 1409 14.09.2022) and Project FARE (PNRR
M4.C2.1.1 PRIN 2022, Cod. 202225BZJC, CUP D53D23008380006, Avviso D.D 104 02.02.2022). Both
projects are under the Italian NRRP MUR program funded by the European Union - NextGenerationEU.
Finally, the work was also partially supported by project SERICS (PE00000014) under the MUR National
Recovery and Resilience Plan funded by the European Union - NextGenerationEU.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>While preparing this work, the authors used GPT-4o, Gemini 2.0 Flash, and GPT-4o-mini for sentence
polishing and rephrasing. After using these services, the authors reviewed and edited the content as
needed and take full responsibility for the publication’s content.
there yet?, IEEE Transactions on Software Engineering 48 (2022) 3280–3296. doi: 10.1109/TSE.
2021.3087402.
[28] Y. Zhou, S. Liu, J. Siow, X. Du, Y. Liu, Devign: Efective vulnerability identification by learning
comprehensive program semantics via graph neural networks, in: H. Wallach, H. Larochelle,
A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing
Systems, volume 32, Curran Associates, Inc., 2019. URL: https://proceedings.neurips.cc/paper_
files/paper/2019/file/49265d2447bc3bbfe9e76306ce40a31f-Paper.pdf.
[29] G. Bhandari, A. Naseer, L. Moonen, Cvefixes: automated collection of vulnerabilities and their fixes
from open-source software, in: Proceedings of the 17th International Conference on Predictive
Models and Data Analytics in Software Engineering, PROMISE 2021, Association for Computing
Machinery, New York, NY, USA, 2021, p. 30–39. URL: https://doi.org/10.1145/3475960.3475985.
doi:10.1145/3475960.3475985.
[30] N. Tihanyi, T. Bisztray, R. Jain, M. A. Ferrag, L. C. Cordeiro, V. Mavroeidis, The FormAI Dataset:
Generative AI in Software Security through the Lens of Formal Verification, in: Proceedings of the
19th International Conference on Predictive Models and Data Analytics in Software Engineering,
PROMISE 2023, Association for Computing Machinery, New York, NY, USA, 2023, pp. 33–43.
URL: https://doi.org/10.1145/3617555.3617874. doi:10.1145/3617555.3617874, event-place: San
Francisco, CA, USA.
[31] G. Lu, X. Ju, X. Chen, W. Pei, Z. Cai, GRACE: Empowering LLM-based software vulnerability
detection with graph structure and in-context learning, Journal of Systems and Software 212 (2024)
112031. URL: https://www.sciencedirect.com/science/article/pii/S0164121224000748. doi:https:
//doi.org/10.1016/j.jss.2024.112031.
[32] Y. Sun, D. Wu, Y. Xue, H. Liu, W. Ma, L. Zhang, M. Shi, Y. Liu, LLM4Vuln: A Unified Evaluation
Framework for Decoupling and Enhancing LLMs’ Vulnerability Reasoning, 2024. URL: https:
//arxiv.org/abs/2401.16185, _eprint: 2401.16185.
[33] R. Meng, M. Mirchev, M. Böhme, A. Roychoudhury, Large language model guided protocol fuzzing,
in: Proceedings of the 31st Annual Network and Distributed System Security Symposium (NDSS),
2024.
[34] P. Liu, C. Sun, Y. Zheng, X. Feng, C. Qin, Y. Wang, Z. Li, L. Sun, Harnessing the Power of LLM to</p>
      <p>Support Binary Taint Analysis, 2023. URL: https://arxiv.org/abs/2310.08275, _eprint: 2310.08275.
[35] N. S. Mathews, Y. Brus, Y. Aafer, M. Nagappan, S. McIntosh, LLbezpeky: Leveraging Large
Language Models for Vulnerability Detection, 2024. URL: https://arxiv.org/abs/2401.01269, _eprint:
2401.01269.
[36] H. Li, Y. Hao, Y. Zhai, Z. Qian, Enhancing Static Analysis for Practical Bug Detection: An
LLM-Integrated Approach, Proc. ACM Program. Lang. 8 (2024). URL: https://doi.org/10.1145/
3649828. doi:10.1145/3649828, place: New York, NY, USA Publisher: Association for Computing
Machinery.
[37] B. Berabi, A. Gronskiy, V. Raychev, G. Sivanrupan, V. Chibotaru, M. Vechev, DeepCode AI Fix:
Fixing Security Vulnerabilities with Large Language Models, 2024. URL: http://arxiv.org/abs/2402.
13291. doi:10.48550/arXiv.2402.13291, arXiv:2402.13291 [cs].
[38] H. Li, Y. Hao, Y. Zhai, Z. Qian, Assisting Static Analysis with Large Language Models: A ChatGPT
Experiment, in: Proceedings of the 31st ACM Joint European Software Engineering Conference
and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, Association for
Computing Machinery, New York, NY, USA, 2023, pp. 2107–2111. URL: https://doi.org/10.1145/
3611643.3613078. doi:10.1145/3611643.3613078, event-place: San Francisco, CA, USA.
[39] T. K. Le, S. Alimadadi, S. Y. Ko, A Study of Vulnerability Repair in JavaScript Programs with
Large Language Models, in: Companion Proceedings of the ACM on Web Conference 2024,
WWW ’24, Association for Computing Machinery, New York, NY, USA, 2024, pp. 666–669. URL:
https://doi.org/10.1145/3589335.3651463. doi:10.1145/3589335.3651463, event-place: Singapore,
Singapore.
[40] Y. Nong, M. Aldeen, L. Cheng, H. Hu, F. Chen, H. Cai, Chain-of-Thought Prompting of Large
Language Models for Discovering and Fixing Software Vulnerabilities, 2024. URL: http://arxiv.org/
abs/2402.17230. doi:10.48550/arXiv.2402.17230, arXiv:2402.17230 [cs].
[41] D. Hidvégi, K. Etemadi, S. Bobadilla, M. Monperrus, CigaR: Cost-eficient Program
Repair with LLMs, 2024. URL: http://arxiv.org/abs/2402.06598. doi:10.48550/arXiv.2402.06598,
arXiv:2402.06598 [cs].
[42] R. Fang, R. Bindu, A. Gupta, D. Kang, Llm agents can autonomously exploit one-day vulnerabilities,
arXiv preprint arXiv:2404.08144 (2024).
[43] I. Bouzenia, P. Devanbu, M. Pradel, RepairAgent: An Autonomous, LLM-Based Agent for
Program Repair, 2024. URL: http://arxiv.org/abs/2403.17134. doi:10.48550/arXiv.2403.17134,
arXiv:2403.17134 [cs].
[44] N. T. Islam, M. B. Karkevandi, P. Najafirad, Code Security Vulnerability Repair Using Reinforcement
Learning with Large Language Models, 2024. URL: http://arxiv.org/abs/2401.07031. doi:10.48550/
arXiv.2401.07031, arXiv:2401.07031 [cs].
[45] J. Wang, L. Cao, X. Luo, Z. Zhou, J. Xie, A. Jatowt, Y. Cai, Enhancing Large Language Models
for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation, 2023. URL:
http://arxiv.org/abs/2310.16263. doi:10.48550/arXiv.2310.16263, arXiv:2310.16263 [cs].
[46] A. Shestov, R. Levichev, R. Mussabayev, E. Maslov, A. Cheshkov, P. Zadorozhny, Finetuning
Large Language Models for Vulnerability Detection, 2024. URL: http://arxiv.org/abs/2401.17010.
doi:10.48550/arXiv.2401.17010, arXiv:2401.17010 [cs].
[47] N. T. Islam, J. Khoury, A. Seong, M. B. Karkevandi, G. D. L. T. Parra, E. Bou-Harb, P. Najafirad,
LLM-Powered Code Vulnerability Repair with Reinforcement Learning and Semantic Reward,
2024. URL: http://arxiv.org/abs/2401.03374. doi:10.48550/arXiv.2401.03374, arXiv:2401.03374
[cs].
[48] M. A. Ferrag, M. Ndhlovu, N. Tihanyi, L. C. Cordeiro, M. Debbah, T. Lestable, N. S. Thandi,
Revolutionizing Cyber Threat Detection With Large Language Models: A Privacy-Preserving
BERT-Based Lightweight Model for IoT/IIoT Devices, IEEE Internet of Things Journal 2022 12
(2024) 23733–23750. doi:10.1109/ACCESS.2024.3363469.
[49] B. Foundation, Bitsandbytes, 2024. URL: https://github.com/bitsandbytes-foundation/bitsandbytes,
python Library for Quatizing LLMs.
[50] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F.
Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao,
T. Lavril, T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https://arxiv.org/abs/2310.06825.
arXiv:2310.06825.
[51] OpenAI, Gpt-4o mini, 2024. URL: https://openai.com/index/
gpt-4o-mini-advancing-cost-efficient-intelligence/, large language model.
[52] A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei,
T. Liu, M. Tian, D. Kocetkov, A. Zucker, Y. Belkada, Z. Wang, Q. Liu, D. Abulkhanov, I. Paul, Z. Li,
W.-D. Li, M. Risdal, J. Li, J. Zhu, T. Y. Zhuo, E. Zheltonozhskii, N. O. O. Dade, W. Yu, L. Krauß, N. Jain,
Y. Su, X. He, M. Dey, E. Abati, Y. Chai, N. Muennighof, X. Tang, M. Oblokulov, C. Akiki, M. Marone,
C. Mou, M. Mishra, A. Gu, B. Hui, T. Dao, A. Zebaze, O. Dehaene, N. Patry, C. Xu, J. McAuley,
H. Hu, T. Scholak, S. Paquet, J. Robinson, C. J. Anderson, N. Chapados, M. Patwary, N. Tajbakhsh,
Y. Jernite, C. M. Ferrandis, L. Zhang, S. Hughes, T. Wolf, A. Guha, L. von Werra, H. de Vries,
Starcoder 2 and the stack v2: The next generation, 2024. URL: https://arxiv.org/abs/2402.19173.
arXiv:2402.19173.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Analyzed Sample of Works</title>
      <p>
        This section details the body of work analyzed to determine the most popular datasets and LLM models
employed in current literature. Table 4 associates each work with the models and datasets it uses.
• Steenhoek et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]: Review of state-of-the-art LLMs for software vulnerability detection,
evaluating various prompting techniques.
• Tihanyi et al. [30]: Introduce ESBMC-AI integrating LLMs with BMC for vulnerability detection
and repair, leveraging the FormAI dataset.
• Le et al. [39]: Evaluate ChatGPT and Bard for JavaScript vulnerability repair using zero-shot
prompting across 20 vulnerabilities.
• Nong et al. [40]: Propose VSP, using Chain-of-Thought prompting for analyzing and patching
      </p>
      <p>
        C/C++ vulnerabilities.
• Li et al. [36]: Introduce LLift, integrating LLMs with static analysis for Use Before Initialization
detection in Linux kernel code.
• Shestov et al. [46]: Explore fine-tuning LLMs like WizardCoder for Java vulnerability detection
using eficient strategies like batch packing.
• Lu et al. [31]: Present GRACE, enhancing LLMs with graph structures (AST, PDG, CFG) for
improved C/C++ vulnerability detection.
• Mathews et al. [35]: Explore GPT-4 with Retrieval Augmented Generation for Android app
vulnerabilities using the Ghera benchmark.
• Fang et al. [42]: Claims GPT-4 exploits 87% of one-day vulnerabilities with CVE descriptions,
outperforming GPT-3.5 and scanners.
• Sun et al. [32]: Introduce LLM4Vuln, a framework for evaluating LLMs’ vulnerability reasoning,
isolating external aids.
• Liu et al. [13]: Introduce PDBERT for C/C++ vulnerability detection, leveraging pre-training on
2.28M functions on extracted data and control dependencies.
• Ferrag et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]: Present SecureFalcon, a lightweight model for C/C++ vulnerabilities, with
      </p>
      <p>
        FalconVulnDB to enhance training.
• Risse and Böhme [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]: Investigate ML4VD limitations like overfitting, proposing VulnPatchPairs
for improved C vulnerability evaluation.
• Hanif and Mafeis [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]: Introduce VulBERTa, a RoBERTa-based model pre-trained and fine-tuned
for C/C++ vulnerability detection.
• Wu and Zhang [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]: Examine ChatGPT’s strengths and weaknesses in seven software security
tasks across diferent versions.
      </p>
    </sec>
    <sec id="sec-9">
      <title>B. PDBERT Secondary Analysis</title>
      <p>In this section, we will go into detail on the curated set of code samples used for the secondary evaluation
on PDBERT. These functions are designed to test the ability of the model to actually recognize patterns
related to vulnerable functions. All of the tested samples (both vulnerable and non-vulnerable) were
predicted to be non-vulnerable by the model. Table 6 provides more details on each of the curated
samples.</p>
      <sec id="sec-9-1">
        <title>B.1. Code Samples</title>
        <p>We present a selection of code snippets provided to the model to demonstrate the simplicity of the
patterns it failed to identify. This inclusion also aims to support reproducibility and enable a deeper
analysis of the model’s behavior.</p>
        <p>Models Datasets
GPT-3.5, GPT-4, Gemini 1.0 Pro, Wizard- SVEN
Coder, Code LLAMA, Mixtral-MoE,
Mistral, StarCoder, LLAMA 2, StarChat-β,
MagiCoder
GPT-4o FormAI
GPT-3.5 Custom
GPT-3.5, GPT-4 Linux Kernel
GPT-3.5, LLAMA 2, Falcon SARD, Big-Vul
CodeGeeX, WizardCoder, CodeGen, Con- CVEFixes, VCMatch, Custom
traBERT
GPT-4 Reveal, Big-Vul, FFmpeg, QEMU
GPT-4 Ghera
GPT-4, GPT-3.5, OpenHermes 2.5 Mistral, Custom
LLAMA 2, Mixtral-MoE, Mistral, Nous
Hermes 2 Yi, OpenChat 3.5
GPT-4, Mixtral-MoE, Code LLAMA
CodeBERT
Falcon</p>
        <p>VulDeePecker, Draper, Reveal,
muVuldeepecker, Devign, D2A
Custom</p>
        <p>Wu and Zhang et al. GPT-3.5, GPT-4
1
2
3
4
5
6
7
8
9
10
1
2
3
4</p>
        <p>B.1.1. Sample 1 - Simple Heap Overflow
void process_client_request(char *input) {
char *heap_buffer = (char *)malloc(256);
if (heap_buffer == NULL) {
printf("Memory allocation failed\n");
return;
}
strcpy(heap_buffer, input);
printf("Data received: %s\n", heap_buffer);
free(heap_buffer);
}
B.1.2. Sample 4 - Simple Stack Bufer Overflow
void toast(char *input) {
char buffer[16];
int offset = 0;
offset = get_offset();
B.1.3. Sample 6 - Simple Use-After-Free
char* copy(char *input) {
char *heap_buffer = (char *)malloc(256);
if (heap_buffer == NULL) {
printf("Memory allocation failed\n");
return;
}
strncpy(heap_buffer, input, 256);
printf("Data received: %s\n", heap_buffer);
free(heap_buffer);
return heap_buffer;
buffer += offset;
strncpy(buffer, input, 16);
printf("Data received: %s\n", buffer);</p>
        <sec id="sec-9-1-1">
          <title>B.1.4. Sample 8 - Suspicious Code</title>
          <p>void unsafe_handling(char *input) {
char buffer[16];
memcpy(buffer, input, 16);
printf("Data received: %s\n", buffer);</p>
        </sec>
        <sec id="sec-9-1-2">
          <title>B.1.5. Sample 9 - Format String</title>
          <p>B.1.6. Sample 11 - Integer Overflow
void execute_transaction(Account *account, int num_items, int price_per_item) {
if (num_items &lt; 0 || price_per_item &lt; 0) {
printf("Invalid input\n");
return;
9</p>
          <p>Ref</p>
          <p>Simple Heap Overflow
2
3
4
5
6
7
8
9
10
11</p>
          <p>CWE-ID
CWE-119</p>
          <p>Not
Vulnerable</p>
          <p>Mixed
CWE-119</p>
          <p>Not
Vulnerable
CWE-416</p>
          <p>Not
Vulnerable</p>
          <p>Not
Vulnerable
CWE-134</p>
          <p>Not
Vulnerable
CWE-190</p>
        </sec>
        <sec id="sec-9-1-3">
          <title>Description</title>
          <p>This is a very recognizable heap overflow
(10 lines of code).</p>
          <p>This is the committed fix for the infamous
libwebp Heap Overflow in Hufman
Table handling. It was supposed to
evaluate whether the model recognizes the code
from the CVE rather than identifying
actual vulnerabilities, as it should classify the
fixed code as non-vulnerable if functioning
correctly.</p>
          <p>This is a recent CTF challenge containing
multiple weaknesses that make it
vulnerable.</p>
          <p>This is a very recognizable stack bufer
overflow (8 lines of code).</p>
          <p>This is a fixed version of sample 4.
This is a very recognizable use-after-free
(11 lines of code).</p>
          <p>This is a fixed version of sample 6.
This is a non-vulnerable code snippet that
uses names and keywords that may trick
the LLM into thinking the function is
vulnerable (e.g., ”unsafe”). The purpose of this
sample is to verify whether or not the LLM
associates function and variable names to
vulnerabilities instead of actual code
weaknesses.</p>
          <p>This is a very recognizable format string
vulnerability (6 lines of code).</p>
          <p>This is a fixed version of sample 9.
This is a very recognizable integer overflow
vulnerability (9 lines of code).</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>NIST</surname>
          </string-name>
          , National vulnerability database dashboard,
          <year>2025</year>
          . URL: https://nvd.nist.gov/general/ nvd-dashboard.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Digregorio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Bertolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Panebianco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polino</surname>
          </string-name>
          ,
          <article-title>Poster: libdebug, build your own debugger for a better (hello) world</article-title>
          ,
          <source>in: Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security</source>
          , CCS '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>4976</fpage>
          -
          <lpage>4978</lpage>
          . URL: https://doi.org/10.1145/3658644.3691391. doi:
          <volume>10</volume>
          .1145/3658644. 3691391.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Liu,</surname>
          </string-name>
          <article-title>An empirical study of vulnerability discovery methods over the past ten years</article-title>
          ,
          <source>Computers &amp; Security</source>
          <volume>120</volume>
          (
          <year>2022</year>
          )
          <article-title>102817</article-title>
          . URL: https:// www.sciencedirect.com/science/article/pii/S0167404822002115. doi:https://doi.org/10.1016/ j.cose.
          <year>2022</year>
          .
          <volume>102817</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond</article-title>
          ,
          <source>ACM Trans. Knowl. Discov. Data</source>
          <volume>18</volume>
          (
          <year>2024</year>
          ). URL: https://doi.org/10.1145/3649506. doi:
          <volume>10</volume>
          .1145/3649506, place: New York, NY, USA Publisher:
          <article-title>Association for Computing Machinery</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Biderman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Black</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Golding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hoppe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Foster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Phang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Thite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nabeshima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Presser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leahy</surname>
          </string-name>
          , The Pile:
          <article-title>An 800gb dataset of diverse text for language modeling</article-title>
          ,
          <source>arXiv preprint arXiv:2101.00027</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Bajaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiao</surname>
          </string-name>
          , et al.,
          <article-title>Exploring the limits of chatgpt in software security applications</article-title>
          ,
          <source>arXiv preprint arXiv:2312.05275</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <article-title>Large Language Model for Vulnerability Detection and Repair: Literature Review</article-title>
          and Roadmap,
          <year>2024</year>
          . URL: http://arxiv.org/abs/2404.02525, arXiv:
          <fpage>2404</fpage>
          .02525 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Steenhoek</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Rahman</surname>
            ,
            <given-names>M. K.</given-names>
          </string-name>
          <string-name>
            <surname>Roy</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          <string-name>
            <surname>Alam</surname>
            ,
            <given-names>E. T.</given-names>
          </string-name>
          <string-name>
            <surname>Barr</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection</article-title>
          ,
          <year>2024</year>
          . URL: http://arxiv. org/abs/2403.17218. doi:
          <volume>10</volume>
          .48550/arXiv.2403.17218, arXiv:
          <fpage>2403</fpage>
          .17218 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Hanif</surname>
          </string-name>
          , S. Mafeis,
          <article-title>VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection</article-title>
          , in: 2022
          <source>International Joint Conference on Neural Networks (IJCNN)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . URL: http: //arxiv.org/abs/2205.12424. doi:
          <volume>10</volume>
          .1109/IJCNN55064.
          <year>2022</year>
          .
          <volume>9892280</volume>
          , arXiv:
          <fpage>2205</fpage>
          .12424 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Ferrag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Battah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tihanyi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Debbah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lestable</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Cordeiro</surname>
          </string-name>
          ,
          <source>SecureFalcon: The Next Cyber Reasoning System for Cyber Security</source>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2307.06616. doi:
          <volume>10</volume>
          .48550/arXiv.2307.06616, arXiv:
          <fpage>2307</fpage>
          .06616 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Risse</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Böhme, Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection</article-title>
          ,
          <source>in: USENIX Security Symposium</source>
          <year>2024</year>
          ,
          <year>2024</year>
          , p.
          <fpage>19</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>int total = num_items * price_per_item; printf("Total price: %d\n"</article-title>
          , total);
          <fpage>account</fpage>
          -&gt;balance -=
          <source>total;</source>
          Index 1 WebP
          <string-name>
            <given-names>Heap</given-names>
            <surname>Bufer Overflow (CVE-</surname>
          </string-name>
          2023-4863)
          <article-title>SECCON 2024 /pwn challenge - free-free-free</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>