=Paper=
{{Paper
|id=Vol-3731/paper04
|storemode=property
|title=Predicting Source Code Vulnerabilities Using Deep Learning: A Fair Comparison on Real Data
|pdfUrl=https://ceur-ws.org/Vol-3731/paper04.pdf
|volume=Vol-3731
|authors=Vincenzo Carletti,Pasquale Foggia,Alessia Saggese,Mario Vento
|dblpUrl=https://dblp.org/rec/conf/itasec/CarlettiFSV24
}}
==Predicting Source Code Vulnerabilities Using Deep Learning: A Fair Comparison on Real Data==
<pdf width="1500px">https://ceur-ws.org/Vol-3731/paper04.pdf</pdf>
<pre>
                         Predicting Source Code Vulnerabilities Using Deep
                         Learning: A Fair Comparison on Real Data
                         Vincenzo Carletti1,* , Pasquale Foggia1,* , Alessia Saggese1 and Mario Vento1
                         1
                          Dept. of Computer Engineering, Electrical Engineering and Applied Mathematics, University of Salerno, Via Giovanni Paolo II,
                         132, 84084, Fisciano (SA), Italy


                                      Abstract
                                      In the context of software development, the detection of vulnerabilities within source code is a paramount concern,
                                      especially for programming languages like C and C++ that are widely used in mission-critical applications,
                                      operating systems and embedded software. Traditional approaches to detecting vulnerabilities in source code
                                      often struggle due to their reliance on hand-crafted rules and pattern matching, which can lead to high rates of
                                      false positives and require a considerable effort by human experts. Additionally, the evolving nature of software
                                      development practices and the increasing sophistication of cyber threats constantly challenge traditional systems,
                                      making them less and less useful over time. In this paper we explore the effectiveness of state-of-the-art deep
                                      learning methods in identifying vulnerabilities within C/C++ source code of real-world software projects. We
                                      have conducted a comprehensive analysis comparing basic deep learning methods used for text processing
                                      against more advanced architectures, including Transformers and Graph Neural Networks (GNNs), aiming to
                                      provide a reliable benchmark for evaluating vulnerability detection approaches. To this purpose we have prepared
                                      a large dataset, combining and normalizing data from several publicly available code datasets extracted from
                                      well-known open-source software projects, namely Big-Vul, DiverseVul, Devign and ReVeal. The results of the
                                      analysis provide insights about the complexity of the task at hand when faced in a realistic setup and suggest
                                      some challenges and promising research directions to use the most recent deep learning models.

                                      Keywords
                                      Vulnerability Detection, Deep Learning, Graph Neural Networks,


                         1. Introduction
                         Software vulnerabilities are, together with misconfigurations, among the most relevant issues in
                         cybersecurity due to the potential impact they can have in the security of systems and networks, leading
                         to significant financial losses, reputation damage, and also threats to physical security when the system
                         under attack is mission critical.
                            Statistics from the National Vulnerability Database (NVD) [1] indicate that in the period from 2021
                         to 2022, there were over 33, 000 publicly disclosed vulnerabilities; of these, over 7, 000 were rated as
                         having High or Critical severity according to the Common Vulnerability Scoring System (CVSS).
                            The search for vulnerabilities is commonly performed by domain experts, often with the help of
                         tools that perform a static or dynamic analysis of the program. Static analysis tools examine the source
                         code (or, less often, the binary code) of the program without executing it, looking for dangerous code
                         patterns [2, 3, 4, 5]. Dynamic analysis tools, instead, execute the program in a controlled environment,
                         detecting dangerous behaviors. Both static and dynamic analysis tools are traditionally based on the
                         use of hand-crafted rules, devised by domain experts, and may need a careful tuning to balance missed
                         detections and false positives. Hybrid approaches, that combine static and dynamic analysis, are also
                         available but they often suffer from the same drawbacks [6, 7].
                            In the current cybersecurity scenario, where malicious users incessantly devise new ways to exploit
                         known or previously undiscovered vulnerabilities (the so-called zero day vulnerabilities), companies
                         are strongly interested in the research and development of effective and accurate tools for software

                          ITASEC 2024: The Italian Conference on CyberSecurity
                         *
                           Corresponding author.
                          $ vcarletti@unisa.it (V. Carletti); pfoggia@unisa.it (P. Foggia); asaggese@unisa.it (A. Saggese); mvento@unisa.it (M. Vento)
                           0000-0002-9130-5533 (V. Carletti); 0000-0002-7096-1902 (P. Foggia); 0000-0003-4687-7994 (A. Saggese);
                          00000-0002-2948-741X (M. Vento)
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
vulnerability detection. For instance, big software manufacturers periodically promote bug hunting
campaigns where experts are invited, under reward, to search for software bugs and vulnerabilities.
Additionally, with software growing in complexity, it is becoming more challenging to find, manage
and patch bugs, making it difficult to keep the pace with the rapid rise of new vulnerabilities [8, 9].
   While rule-based approaches are struggling to face with such a challenging context, machine learning
methodologies have demonstrated to achieve remarkable performance [10]. Nevertheless, to accomplish
this goal they still need the involvement of domain experts to design a suitable representation of the
program in terms of measurable features, that can be fed as input to a machine learning algorithm.
This process, commonly named feature engineering, is time-consuming and its effectiveness strongly
relies on the knowledge of the expert on both the specific problem and the adopted machine learning
method [11, 12, 13]. For this reason, the scientific community has recently moved the focus to deep
learning approaches, capable of autonomously learning the best representation from the data. Such
methods reduce the need for domain expert involvement, and limit it mainly to data preparation and
validation phases, since their performance heavily depends on the quantity and quality of the data
and on the adopted training strategy [14]. In general, deep learning approaches have demonstrated
to be more effective and accurate as well as potentially capable to generalize on new unseen kinds of
vulnerabilities [15, 16, 10, 17].
   Among the deep learning models, several recent approaches are based on Graph Neural Networks
(GNNs) [18, 19, 20, 21, 22, 23], i.e. neural networks designed to process data structured as graphs instead
of vectors. The idea behind these methods lies on the fact that source code can be naturally represented
as a graph, for instance using Abstract Syntax Trees (AST) or Control Flow Graphs (CFG), and that
the adoption of graph-based representations allows to take into account both semantic and syntactic
structures [24].
   The current literature lacks standard benchmarks and datasets that are representative of the problem at
hand, as highlighted in a recent work by Chakraborty et al. [20]. Specifically, many datasets are designed
for educational or demonstrative purposes, so they are composed of artificially created vulnerable
functions. On the other hand, those datasets that contain real code are often noisy, biased and inaccurate
since the labelling is performed by using existing tools for static analysis without manually reviewing
their output, so they lead to machine learning models that are biased and excessively prone to false
positives. Furthermore, data are naturally unbalanced, with a large amount of negative samples (i.e.
non vulnerable code), so causing the model to shift towards false negatives.
   When the desired output is not limited to the presence or absence of a vulnerability, but includes
the determination of the kind of vulnerability, all these issues are amplified due to both a strongly
unbalanced distribution of the classes (some kinds of vulnerabilities are much rarer than others) and an
inherent degree of ambiguity in the classification. These problems bring others with them, first of all
the difficulty of having a fair and clear assessment and comparison of deep learning methods for the
problem at hand. Therefore, the main purpose of this paper is to provide an analysis aimed at assessing
if deep learning approaches can be effectively exploited to detect vulnerabilities in C/C++ source code
extracted from real software projects. Previous analysis of some state-of-the-art deep learning methods
have been proposed in [25, 20].
   Differently from them, we have built a significantly larger set of data composed of 396,130 samples,
by thoroughly revising and cleaning the data belonging to four public datasets of source code extracted
from open-source projects: Big-Vul [26], DiverseVul [27], Devign [25] and ReVeal [20]. In our analysis,
we have selected and compared basic deep learning methods with more complex neural network
architectures such as Transformers and GNNs, using a demanding and realistic experimental setup. In
particular, we evaluated these models in scenarios where the test set samples originated from software
projects whose code was not in the training set.
   This paper is organized as follows. In Section 2 we will provide an overview of state-of-the-art
approaches and representations proposed for source code analysis and vulnerability detection tasks. In
Section 3 we will present the experimental setup describing the dataset and introducing the methods
considered in the analysis. In Section 4 we will present and discuss the results obtained. Finally, in
Section 5 we will provide some conclusions and future directions.
2. Vulnerability Prediction Through Deep Learning
A vulnerability is an occurrence of a weakness in a piece of software, where a weakness is a mistake
(usually in the implementation or in the design of the code) that may make a security threat possible.
For example, accessing an array with an index outside of the allowed bounds is a weakness; the presence
of this weakness inside a particular subroutine of a program is a vulnerability. In order to facilitate the
communication among security practitioners, known weaknesses are listed in a community-maintained
catalog, the Common Weakness Enumeration (CWE) where each weakness has a unique identifier and
a detailed description. Also, the CVE Program (Common Vulnerabilities and Exposures) maintains a list
of publicly disclosed vulnerabilities, that are made available for consultation through several services;
among them, the National Vulnerability Database (NVD) [1].
   The problem of detecting a vulnerability can be formulated in different ways on the base of the
desired result: it can be considered as a binary classification task if we only need to know whether
a piece of code contains a vulnerability or not; it becomes a multi-label classification if, in addition,
we are interested in which particular weakness causes the vulnerability (for instance, reporting the
corresponding CWE identifier). In this Section, we will introduce the most recent approaches and
representations to address the task by considering both those derived from the Natural Language
Processing (NLP) field and those relying on graph-based methods.

2.1. Natural Language Processing Approaches
Traditionally, code analysis is addressed as a Natural Language Processing (NLP) problem where code
snippets are transformed in sequence of words, a.k.a. tokens. Each word is represented using a high-
dimensional vector space through word embedding methods [28]; thus, a code snippet corresponds to a
sequence of embedding vectors. In several methods, these sequences are then processed by a Recurrent
Neural Network (RNN) to learn dependencies among the elements in the sequence and to encode them
in an enriched vector representation; the latter is finally exploited by a neural classifier (for instance, a
Multi-Layer Perceptron) to predict if the original code snippet is vulnerable, and possibly to assign a
label representing the CWE identifier. In more recent methods, the first part of this process is performed
replacing RNNs with Transformers [25], that are advanced deep neural architectures for sequence-
processing, based on an attention mechanism [29], that allows the network to learn which are the most
relevant parts of the input sequence for the determination of the output. µVulDeePecker, introduced by
Zou et al. [30], is a notably and recent example of a method based on Long Short-Term Memory (LSTM),
a modified version of an RNN. The LSTM is used to merge and relate global and local features extracted
from the source code of a software. The latter is initially decomposed into code gadgets, i.e. semantically
related statements, not necessarily consecutive, that satisfy data and control dependency relations, to
capture global features. In addition to gadgets, the authors also introduce the concept of code attention,
inspired by the concept of regions of interest in image processing, that aims to capture regions of
code that can be relevant to detect a specific vulnerability, like API function call, control statements
or statements referring to libraries. More recently, the authors of [17], have proposed VulBERTa, a
transformer-based approach, designed to learn a deep representation of C/C++ code, capturing both
syntax and semantic elements. The neural network is built upon RoBERTa, a language model proposed
in [31] as an extension of the well-known BERT [25]. The proposed architecture is trained in two phases:
pre-training, where the network learns a first representation from the unlabeled code examples, using
RoBERTa; and fine-tuning, that refines this representation to perform the classification task jointly with
a Multilayer Perceptron (MLP) or a Text Convolutional Neural Network (TextCNN); the two versions
are named VulBERTa-MLP and VulBERTa-TextCNN, respectively.

2.2. Graph-Based Approaches
Although NLP approaches have been demonstrated to achieve remarkable performance in processing
software programs as text, they can neglect syntactic structures and relations among different parts
int foo ( int x ) {
    int y ;
    if ( x > 0) {
          y = x ∗ 2;
    } else {
          y = −x ;
    }
    return y ;
}


Figure 1: Example of CPG graph extracted from the source code on the left. The graph contains edges of AST in
black, CFG in red and PDG in blue. The data dependencies are represented as dashed lines and while control
dependencies with solid lines


of the source code. Differently, graph-based representations can be more effective in capturing both
semantic and syntactic structures [24]. This is motivated by the fact that a software program can be
naturally represented as a graph; actually, several graph-based representations can be used, depending
on the level of abstraction on which we want to focus. Indeed, each entity in the source code, from
single keywords and and individual statements to entire subroutines, can be represented as a node; and
every kind of relationships between such entities can be represented as edges. How we represent the
source code depends on the specific purpose and affects the way we intend entities and relationships in
the graph, therefore, different graph-based representations of the source code have been proposed in the
last decade for the purpose of automatic learning of code properties; among them the most commonly
adopted are Abstract Syntax Trees (AST) [32], Control Flow Graphs (CFGs) [33], Program Dependency
Graphs (PDGs) [34]. Since each of these representations has limits in its expressiveness, that can lead
to an inaccurate characterization of the vulnerability in complex scenarios, in [7] the authors propose
to merge all of them in the Code Property Graphs (CPGs) so as to provide a comprehensive graph to
analyze the source code at multiple levels of abstraction simultaneously. In Figure 1 we show how a
code snipped is represented by using a CPG graph.
   Over the last decades, several methods have been proposed to use machine learning methods on
graphs [35, 36, 37]. Nowadays, state-of-the-art approaches are based on GNNs [38], that are neural
networks able to work directly on graph data, without the need to previously construct a feature
vector summarizing the information contained in the graph. GNNs can be trained to produce a vector
representation associated to each node, to each edge and/or to the entire graph (i.e. a node,edge or graph
embedding), where the vectors encode both the semantic and the structural information contained in
the graph (e.g. the neighbors of a node); these embeddings can then be used by another neural network
to perform different tasks, such edge and node prediction or graph classification. The latter is the most
common formalization for the vulnerability prediction problem.
   The first method to face the task at hand using a GNN has been Devign [18]. The architecture of
this GNN is composed of three layers: a code embedding layer that turns the source code into a graph,
followed by a sequence of Graph RNN layers (a.k.a. Gated Graph Recurrent Layers), used to learn a new
vector representation for the nodes, and a final convolutional layer that processes the matrix containing
the node vectors produced by the previous layer.
   Successively, Chakraborty et al. [20] have proposed ReVeal. It firstly transforms the source code
into a CPG; the code snippets contained in the nodes of the CPGs are then embedded in vectors
using word2vec [28], a neural network designed for word embedding. For each node, the embeddings
of the corresponding tokens are concatenated, getting the initial node representation. Finally, the
obtained graph is fed into a GNN to get a vector representation of the entire graph, that is used for the
classification.

2.3. Hybrid Approaches
A further approach that can be found in the recent literature is to combine Transformers and GNNS. A
first noteworthy method, namely VELET, has been proposed by Ding et al. [21], with the aim to take
into account both local and global contexts of statements in C/C++ function source code. VELVET
firstly builds a CPG and then extracts a vector representations for the nodes using word2vec similarly
to ReVeal. Node embeddings are processed simultaneously by a GNN and a Transformer to update
node representations and their output is independently processed by a classifier. The prediction are
combined using a final stage named Embedding and Ranking that provides the label to the code.
   In [22] Hin et al. propose LineVD, an approach is similar to VELVET. The overall architecture mostly
differs in the final stage, indeed the outputs of the Transformer and the GNN are processed through a
neural network that also provide the final prediction. LiveVD is based on CodeBERT [39] a Transformer
trained to embed statements and functions as vectors, and a Graph Attention Network (GAT), a GNN
that exploits the attentions mechanism during the aggregation of node’s neighborhood to weight
differently each neighbor on the base of its relevance.


3. Experimental Framework
As introduced in Section 1, the main purpose of this work is to evaluate if deep learning methods
can be considered as an effective tool to predict source code vulnerabilities, indeed by analyzing the
state-of-the-art there are significant lacks that do not allow us to infer a clear answer. In this section we
describe how we have prepared the dataset and how we have used it to train and compare the selected
methods.

3.1. Dataset Preparation
Since the most relevant issue is the data, we firstly focused our effort in preparing a dataset that can
provide significant and comparable results, starting from those are publicly available. We excluded
those datasets containing synthetic or semi-synthetic functions, as well as those labeled exclusively
through automatic tools; the final choice has been to consider only samples belonging to Devign [18],
Big-Vul [26], DiverseVul [27], and ReVeal [20]. We named this new dataset the MIVIA Vulnerable
Functions Source Code (MVFSC); in Table 1, we have reported a detail of its composition.
   Devign is composed of functions from large-scale, real-world open-source projects, such as FFmpeg,
QEMU, Linux Kernel, and Wireshark. Each function has been manually labeled by a team of three
security experts over seven years to reduce false positives and negatives coming from automated
static analysis. However, only a small subset of the Devign, containing samples of the Linux Kernel
and the FFMpeg projects, is publicly available. Big-Vul has large and highly heterogeneous amount
of samples extracted from 348 different software projects, including Chromium, Linux Kernel, and
Android. Differently to Devign that has only a binary label (vulnerable or not), Big-Vul also includes,
for a subset of the samples, the CWE classification. Furthermore, since the vulnerabilities in Big-Vul
are documented in the CVE database, their authenticity is effectively confirmed. DiverseVul has been
built similarly to Big-Vul, containing mostly source code from real, large, and open-source projects by
attaching CWE to vulnerable code. The labelling has been performed through the analysis of security
issue websites and commits related to vulnerability fixes. According to the authors, DiverseVul includes
high-quality commits from various open-source projects, making it an effective tool for training models
capable of generalizing on unseen data. Finally, ReVeal has been prepared by extracting historical
vulnerabilities of two large and relevant open-source projects, namely the Debian and Chromium
projects. The dataset contains 22,734 samples among them former there are several type of weaknesses
such as buffer overflows, format string issues, and integer overflows.

Table 1
Composition of the dataset selected to build the one used for the experiments. Finally, we reported the
composition of the MVFSC
                    Dataset      Vulnerable Functions   Neutral Functions    Labelling
                    Devign              12,460               14,858           Binary
                    BigVul              11,823              253,096          Multi-class
                   DiverseVul           18,945              330,492          Multi-class
                     Reveal             2,240                20,494           Binary
                    MVFSC               31,985              364,145          Multi-class

   The preparation of this new dataset required a meticulous data processing. Since some datasets
may contains samples extracted from the same software projects, we have to firstly remote duplicated
functions to avoid redundancies or inconsistent labelling. The detection of duplicated functions has
been made by computing the hash for each of them, so functions with duplicate hashes has been made
unique under the assumption that duplicates originated from the same commit or project. In case of
duplications with different labels we have removed these samples for the dataset to avoid ambiguities.
Successively, with the aim of preparing the data to be properly processed by work embedding neural
networks, we have cleaned the functions by removing carriage returns, new line feeds, tabs and excess
spaces. This operation has been done very carefully to keep the original syntax and semantics of
the source code unchanged, while maintaining the structural and functional integrity of the code. To
prevent the models to be biased by useless information we also removed the comments within the code;
they may contains clarifications for humans as well as subjective, irrelevant, or potentially misleading
information. Since we also have to extract graph-based representation from the source code using
automatic code parsing tools the last step has been to remove the C/C++ functions that were not
properly recognized by these tools. For the sake of clarity, the training set is composed of the data
belonging to Devign, BigVul and DiverseVul, while ReVeal has been used as test set only.
   At the end of the data preprocessing, the training set results in 396,130 samples, of which 364,145
belong to the neutral class and 31,985 belong to the vulnerable class. Among the vulnerable samples,
15,656 are labeled with a specific CWE. The test set consists of a total of 19,762 samples, of which
17,786 are labeled as neutral and 1,916 as vulnerable. It is worths to point out that this dataset is highly
unbalanced, but being prepared from real software project it reasonably represent the real a-priori
distribution of the data, a fundamental requirement to properly train a machine learning model.
   Then a graph-based representation have been extracted from the source code samples using Joern [40],
a well-known open-source framework for static code analysis designed to extract graphs from large
software project. It processes the source code by decomposing it in an AST graph, where each node is a
block of code, then the representation is extended with additional information, such as dependencies
between variables and execution paths to get finally a CPG graph.
3.2. Deep Learning Approaches
As previously introduced, in the proposed analysis we have considered different kind of machine
learning approaches from basic recurrent neural networks, like LSTM to more complex architectures
such as Transformers and GNN. On the one hand we have taken into account basic methods to
assess the complexity of the problem at hand, thus to understand if common baseline approaches
are sufficiently effective; on the other hand, we aimed at understanding if attention mechanisms and
structural approaches, like GNN, can provide benefits when dealing with such a challenging problem.
   More specifically, we have used Word2Vec [28] (w2v) as source code embedding method for all the
considered approaches since, all of them, required to turn the text into vectors in order to process the
source code. Therefore, we started from a the pre-trained model of w2v and we have fine tuned it for
the task at hand so as to get more significant vector representations of the token that are commonly in
the C/C++ language.
   Once we have a sequence of vectors corresponding to the sequence of tokens in the source code,
we have build a neural network based on an LSTM and a fully connected network as output stage to
predict if a given function is vulnerable. As an alternative, to this architecture, we have also used a
Bidirectional LSTM (BiLSTM) to evaluate if taking into account the dependencies among the tokens
in both the directions can help to obtain a more discriminative representation to be fed to the fully
connected layer. Both the approaches have been trained end-to-end together with w2v.
   Another baseline approach that can be easily adopted to the problem at hand is TextCNN, a convolu-
tional neural network (CNN) proposed in [41, 42] for sentence classification. This approach, initially
designed for sentiment analysis has been recently applied also to malware code classification in [43].
It uses w2v to build a vector representation of the words in a sentence then the obtained vectors are
arranged in a matrix and processed by a CNN, similarly as it is performed for images, the a fully
connected layer provides the label.
   In addition to the previous models, we have analyzed how GNNs are capable to perform on the
task at hand. To this purposed we have considered to use a state-of-the-art message passing GNN,
namely GraphSage [42], to process the GPG graphs extracted from the source code. It worth to note
that in the latter the node contains the source code as text and it cannot be processed by a GNN directly.
Therefore, we used w2v to extract a vector representation for all the words in the code snippet related
to a node, then the feature vectors have been summarized into a single vector adopting two approaches:
an average pooling layer and a BiLSTM. The graph embedding produced by the GraphSage has been
then processed by a pooling layer to obtain a vector representation for the whole GPG graph end finally
fed to a fully connected stage to provide the classification.
   Finally, we have fine-tuned VulBERTa [17], using the proposed dataset to have results comparable
with the other approaches and evaluate if such a complex architecture can really provide benefits. In
the analysis we have taken into account both VulBERTa-MLP and VulBERTa-TextCNN. In addition,
inspired by the ReVeal architecture proposed in [20], we have also explored the possibility to use the
VulBERTa Transformer as an embedding network to produce a CPG graph. The obtained graphs have
been processed by a Gated Graph Neural Network (GGNN), a variant of GNNs, inspired by gating
methods used to enhance the stability and learning capabilities of recurrent neural networks (RNNs),
such as LSTMs. The used architecture has been designed with two GGNN layers separated by a TopK
pooling layer. The produced graph embedding is then processed by a fully connected layer.
   For all the neural networks, the output of the fully connected layer is passed to a soft-max function
and we have used a binary cross-entropy loss function.


4. Results
In Table 4 we report the results of the experimental analysis considering different well-known metrics.
In addition to accuracy (Eq. 1), defined as the number of correct predictions with respect to the total
amount of samples processed, we have also reported precision (Eq. 2), recall (Eq. 3), and F1-Score (Eq. 4).
Table 2
Results of the experimental analysis on the test set.
                Method                            Precision   Recall   F1-Score   Accuracy
                w2v + LSTM                        0.16         0.53      0.25       0.69
                w2v + BiLSTM                      0.17         0.59      0.27       0.68
                w2v + TextCNN                     0.24         0.53      0.33       0.83
                w2v + GraphSAGE                   0.55         0.82      0.65       0.93
                w2v + BiLSTM + GraphSAGE          0.27         0.84      0.40       0.76
                VulBERTa-TextCNN                  0.19         0.68      0.30       0.73
                VulBERTa-MLP                      0.57         0.67      0.55       0.70
                VulBERTa-GGNN                     0.60         0.75      0.59       0.72


The latter are relevant to problem at hand since the since the dataset is highly unbalanced, so using
only the accuracy can lead to misleading interpretations of the actual performance.
  For the sake of clarity, we have considered as a true positives (TPs) and true negatives (TNs) are
vulnerable and non-vulnerable samples classified as such respectively, while false positives (FPs) are
those non-vulnerable instances misclassified as vulnerable and false negatives (FNs) are vulnerable
samples labelled as non-vulnerable. In the following equations we report how the performance metrics
have been computed using the number of TPs, FPs, and FNs.

                                                       𝑇𝑃 + 𝑇𝑁
                                   Accuracy =                                                          (1)
                                                  𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
                                                           𝑇𝑃
                                           Precision =                                                 (2)
                                                         𝑇𝑃 + 𝐹𝑃
                                                          𝑇𝑃
                                             Recall =                                                  (3)
                                                        𝑇𝑃 + 𝐹𝑁
                                                     Precision · Recall
                                     F1-Score = 2 ·                                                    (4)
                                                    Precision + Recall
   The complexity of the task at hand when performed on real-world data is confirmed by the fact that
all the baseline deep learning approaches, that are commonly adopted for text processing, have achieved
a very low performance. Among these, the best one in terms of accuracy and F1-Score is TextCNN
with the word embedding obtained through w2v. From the results, it emerges that using the VulBERTA
Transformer instead of w2V does not provide any advantage to TextCNN, since there is drop on both
the metrics. This ineffectiveness of VulBERTA with respect to a simpler model like w2v can be related
to the limited amount of vulnerable samples. On the other hand, when using a simpler output neural
network like the MLP, VulBERTA we have obtained the most balanced performance between precision
and recall among all the considered non-structural approaches suggesting that the Transformer is able
to learn a good representation of the code.
   The main evidence from the results in Table 4 is that using a graph-based representation of the source
code allows to significantly improve the performance on the task. In all the cases it is possible to note
an increase of the recall, suggesting that the model is less prone to false negatives. In particular, in
the case of the methods w2v + GraphSAGE and VulBERTa-GGNN this improvement is not achieved
by affecting the precision, and thus, without notably increasing the number of false positives. Also in
the case of graph-based representation, the model achieving the best performance is the one using a
simpler world embedding layer, namely w2v + GraphSAGE, thus providing an additional evidence that
the availability of samples can be the problem.
   In conclusion, processing the source code as a text allows to specialize most of the recent NLP
approaches, but differently from other domains there is a relevant scarcity of data as well as a greater
complexity to produce good quality datasets. This problem impacts on the possibility of properly
training large models. In our analysis, even using a dataset that is considerably larger than those used
in some recent analysis, we have been not able to obtain a noteworthy performance improvement of
state-of-the-art Transformers when working on real data. Nevertheless, structural information provided
by graph-based representation can help to mitigate this problem by allowing to obtain more effective
deep learning models, but further analysis are required.


5. Conclusions
In this paper, we compare different deep-learning methods for predicting if a C/C++ function is
vulnerable. We have meticulously prepared a dataset of source code snippets by merging four publicly
available datasets of functions from well-known open-source projects. Due to the different labeling and
collection methods of each dataset, we cleaned and post-processed the samples to eliminate biases and
achieve a uniform labeling. We used the samples of three dataset to train the models, reserving those
belonging to the ReVeal dataset for testing, to ensure comparable and significant results in a realistic
system setup of vulnerability detection.
   This is the first evaluation of deep-learning techniques on such a large dataset of source code derived
from real software projects. The analysis confirmed the complexity of the task by highlighting the
ineffectiveness of baseline methods and showed that using structural representations of source code
can lead to more accurate models.
   Further in-depth analysis are necessary to assess deep-learning methods’ robustness, reliability and
accuracy. This will involve exploring different approaches and expanding the current dataset to make it
more representative for the problem. In addition, future analysis have to evaluate the capability of deep
learning methods to classify the CWEs.


References
 [1] National Institute of Standards and Technology (NIST), Common Vulnerabilities and Exposures
     (CVE), https://nvd.nist.gov/, accessed 2023.
 [2] D. A. Wheeler, Flawfinder, DWheeler, 2009. URL: https://www.dwheeler.com/flawfinder/, [Online;
     accessed 19-October-2023].
 [3] P. Arteau, Spotbugs, SpotBugs, 2023. URL: https://spotbugs.github.io, [Online; accessed 19 october
     2023].
 [4] T. Kamiya, S. Kusumoto, K. Inoue, Ccfinder: a multilinguistic token-based code clone detection
     system for large scale source code, IEEE Transactions on Software Engineering 28 (2002) 654–670.
 [5] S. Woo, S. Kim, H. Lee, H. Oh, Vuddy: A scalable approach for vulnerable code clone discovery, in:
     IEEE Symposium on Security and Privacy (SP), IEEE, San Jose, CA, USA, 2017, pp. 595–614.
 [6] S. M. Ghaffarian, H. R. Shahriari, Software vulnerability analysis and discovery using machine-
     learning and data-mining techniques: A survey, ACM Comput. Surv. 50 (2017) 1–36.
 [7] F. Yamaguchi, N. Golde, D. Arp, K. Rieck, Modeling and discovering vulnerabilities with code
     property graphs, in: 2014 IEEE Symposium on Security and Privacy, IEEE, 2014, pp. 590–604.
 [8] K. A. Farris, A. Shah, G. Cybenko, R. Ganesan, S. Jajodia, Vulcon: A system for vulnerability
     prioritization, mitigation, and management, ACM Transactions on Privacy and Security 21 (2018)
     1–28. doi:10.1145/3196884.
 [9] N. Alexopoulos, S. M. Habib, S. Schulz, M. Mühlhäuser, The tip of the iceberg: On the merits of
     finding security bugs, ACM Transactions on Privacy and Security 24 (2020) 1–33. doi:10.1145/
     3406112.
[10] R. Croft, D. Newlands, Z. Chen, An empirical study of rule-based and learning-based approaches
     for static application security testing, in: Association for Computing Machinery, New York, NY,
     USA, 2021, pp. 1–12.
[11] G. Lin, S. Wen, Q.-L. Han, J. Zhang, Y. Xiang, Software vulnerability detection using deep neural
     networks: A survey, Proceedings of the IEEE 108 (2020) 1825–1848.
[12] J. Wang, M. Huang, Y. Nie, J. Li, Static analysis of source code vulnerability using machine learning
     techniques: A survey, in: 2021 4th International Conference on Artificial Intelligence and Big Data
     (ICAIBD), IEEE, 2021, pp. 76–86.
[13] Z. Li, et al., Vuldeepecker: A deep learning-based system for vulnerability detection, in: Proc.
     NDSS, 2018, pp. 1–15.
[14] H. Wang, G. Ye, Z. Tang, S. Tan, S. Huang, D. Fang, Y. Feng, L. Bian, Z. Wang, Combining graph-
     based learning with automated data collection for code vulnerability detection, IEEE Transactions
     on Information Forensics and Security 16 (2021) 1943–1958.
[15] C. D. Sestili, W. S. Snavely, N. M. VanHoudnos, Towards security defect prediction with ai, https:
     //apps.dtic.mil/sti/pdfs/AD1090852.pdf, 2018. [Online; accessed 20 november 2023].
[16] D. Votipka, R. Stevens, E. Redmiles, J. Hu, M. Mazurek, Hackers vs. testers: A comparison of
     software vulnerability discovery processes, in: Proc. IEEE Symp. Secur. Privacy (SP), IEEE, 2018,
     pp. 374–391.
[17] H. Hanif, S. Maffeis, VulBERTa: Simplified source code pre-training for vulnerability detection, in:
     Proceedings of the International Joint Conference on Neural Networks, 2022.
[18] Y. Zhou, S. Liu, J. Siow, X. Du, Y. Liu, Devign: Effective vulnerability identification by learning
     comprehensive program semantics via graph neural networks, in: Advances in Neural Information
     Processing Systems, 2019, pp. 10197–10207.
[19] M. Allamanis, Graph Neural Networks on Program Analysis, Graph Neural Networks: Foundations,
     Frontiers, and Applications, 2021.
[20] S. Chakraborty, R. Krishna, Y. Ding, B. Ray, Deep learning based vulnerability detection: Are we
     there yet?, IEEE Transactions on Software Engineering 48 (2022) 3280–3296.
[21] Y. Ding, S. Suneja, Y. Zheng, J. Laredo, A. Morari, G. Kaiser, B. Ray, VELVET: a noVel Ensemble
     learning approach to automatically locate VulnErable sTatements, in: 2022 IEEE International
     Conference on Software Analysis, Evolution and Reengineering (SANER), 2022, pp. 959–970.
[22] D. Hin, A. Kan, H. Chen, M. A. Babar, Linevd: Statement-level vulnerability detection using graph
     neural networks, in: Proceedings - 2022 Mining Software Repositories Conference, MSR 2022,
     2022, pp. 596–607.
[23] L. Li, S. H. H. Ding, Y. Tian, B. C. M. Fung, P. Charland, W. Ou, L. Song, C. Chen, VulANa-
     lyzeR: Explainable binary vulnerability detection with multi-task learning and attentional graph
     convolution, ACM Trans. Priv. Secur. 26 (2023).
[24] M. Allamanis, M. Brockschmidt, M. Khademi, Learning to represent programs with graphs, in: 6th
     International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings,
     2018.
[25] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional Transformers
     for language understanding, 2018. URL: https://arxiv.org/abs/1810.04805. arXiv:1810.04805.
[26] J. Fan, Y. Li, S. Wang, T. N. Nguyen, A c/c++ code vulnerability dataset with code changes and cve
     summaries, in: 2020 IEEE/ACM 17th International Conference on Mining Software Repositories
     (MSR), 2020, pp. 508–512. doi:10.1145/3379597.3387501.
[27] Y. Chen, Z. Ding, L. Alowain, X. Chen, D. Wagner, DiverseVul: A new vulnerable source code
     dataset for deep learning based vulnerability detection, in: Proceedings of the 26th International
     Symposium on Research in Attacks, Intrusions and Defenses, RAID ’23, Association for Computing
     Machinery, New York, NY, USA, 2023, p. 654–668.
[28] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector
     space, 2013. arXiv:1301.3781.
[29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, K. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin,
     Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
     wanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 30,
     Curran Associates, Inc., 2017.
[30] D. Zou, S. Wang, S. Xu, Z. Li, H. Jin, µVulDeePecker: A deep learning-based system for multiclass
     vulnerability detection, IEEE Transactions on Dependable and Secure Computing 18 (2021)
     2224–2236.
[31] Y. Liu, M. Ott, N. Goyal, J. D., M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
     RoBERTa: A robustly optimized BERT pretraining approach, 2019. arXiv:1907.11692.
[32] B. Fluri, M. Wursch, M. PInzger, H. Gall, Change distilling:tree differencing for fine-grained
     source code change extraction, IEEE Transactions on Software Engineering 33 (2007) 725–743.
     doi:10.1109/tse.2007.70731.
[33] F. Allen, Control flow analysis, in: Proceedings of a Symposium on Compiler Optimization,
     Association for Computing Machinery, New York, NY, USA, 1970, p. 1–19.
[34] J. Ferrante, K. J. Ottenstein, J. D. Warren, The program dependence graph and its use in optimization,
     ACM Transactions on Programming Languages and Systems 9 (1987) 319–349.
[35] D. Conte, P. Foggia, C. Sansone, M. Vento, Thirty years of graph matching in pattern recognition,
     International Journal of Pattern Recognition and Artificial Intelligence 18 (2004) 265–298. doi:10.
     1142/s0218001404003228.
[36] P. Foggia, G. Percannella, M. Vento, Graph matching and learning in pattern recognition in the
     last 10 years, International Journal of Pattern Recognition and Artificial Intelligence 28 (2014)
     1450001. doi:10.1142/s0218001414500013.
[37] M. Vento, A long trip in the charming world of graphs for pattern recognition, Pattern Recognition
     48 (2014) 291–301. doi:10.1016/j.patcog.2014.01.002.
[38] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, P. S. Yu, A comprehensive survey on graph neural
     networks, IEEE Transactions on Neural Networks and Learning Systems 32 (2021) 4–24. doi:10.
     1109/tnnls.2020.2978386.
[39] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, M. Zhou,
     Codebert: A pre-trained model for programming and natural languages, in: Findings of the
     Association for Computational Linguistics: EMNLP 2020, 2020, pp. 1536–1547.
[40] Joern Team, Joern: The bug hunter’s workbench, 2023. URL: https://joern.io/, [Online; accessed 30
     december 2023].
[41] Y. Kim, Convolutional neural networks for sentence classification, arXiv preprint arXiv:1408.5882
     (2014).
[42] W. Hamilton, Z. Ying, J. Leskovec, Inductive representation learning on large graphs, in: I. Guyon,
     U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in
     Neural Information Processing Systems, volume 30, Curran Associates, Inc., 2017.
[43] Q. Wang, Q. Qian, Malicious code classification based on opcode sequences and textcnn network,
     Journal of Information Security and Applications 67 (2022) 103151. doi:10.1016/j.jisa.2022.
     103151.

</pre>