=Paper=
{{Paper
|id=Vol-3612/QuASoQ_2023_Paper_04
|storemode=property
|title=Identifying Vulnerable Functions from Source Code using Vulnerability Reports
|pdfUrl=https://ceur-ws.org/Vol-3612/QuASoQ_2023_Paper_04.pdf
|volume=Vol-3612
|authors=Rabaya Sultana Mim,Toukir Ahammed, Kazi Sakib
|dblpUrl=https://dblp.org/rec/conf/apsec/MimAS23
}}
==Identifying Vulnerable Functions from Source Code using Vulnerability Reports==
<pdf width="1500px">https://ceur-ws.org/Vol-3612/QuASoQ_2023_Paper_04.pdf</pdf>
<pre>
                                Identifying Vulnerable Functions from Source Code using
                                Vulnerability Reports
                                Rabaya Sultana Mim, Toukir Ahammed and Kazi Sakib
                                Institute of Information Technology, University of Dhaka, Dhaka, Bangladesh


                                                                          Abstract
                                                                          Software vulnerability represents a flaw within a software product that can be exploited to cause the system to violate
                                                                          its security. In the context of large and evolving software systems, developers find it challenging to identify vulnerable
                                                                          functions effectively when a new vulnerability is reported. Existing studies have underutilized vulnerability reports which
                                                                          can be a good source of contextual information in identifying vulnerable functions in source code. This study proposes an
                                                                          information retrieval based method called Vulnerable Functions Detector (VFDetector) for identifying vulnerable functions
                                                                          from source code and vulnerability reports. VFDetector ranks vulnerable functions based on the textual similarity between
                                                                          the vulnerability report corpora and the source code corpora. This ranking is achieved modifying conventional Vector Space
                                                                          Model to incorporate the size of a function which is known as the tweaked Vector Space Model (tVSM). As an initial study, the
                                                                          approach has been evaluated by analysing 10 vulnerability reports from six popular open-source projects. The result shows
                                                                          that VFDetector ranks the actual vulnerable function at first position in 40% cases. Moreover, it ranks the actual vulnerable
                                                                          function within rank 5 in 90% cases and within rank 7 for all analysed reports. Therefore, developers can use these results to
                                                                          implement successful patches on vulnerable functions more quickly .

                                                                          Keywords
                                                                          vulnerability identification, vulnerable function, vulnerability report, source code, vector space model


                                1. Introduction                                                                                                     program analysis which represent the source code seman-
                                                                                                                                                    tics as a graph, and then apply graph analysis methods
                                A software vulnerability is a flaw, weakness, or error in                                                           such as Graph Neural Networks (GNN) [8] to identify
                                a computer program or system that can be exploited by                                                               vulnerabilities. Although these graph-based approaches
                                malicious attackers to compromise its integrity, availabil-                                                         are more efficient at identifying vulnerabilities taking
                                ity, or confidentiality [1]. Software vulnerabilities make                                                          into account the semantic relationship of various lines
                                software systems increasingly vulnerable to attack and                                                              of source code but their scalability is substantially less
                                damage, which raises security concerns [2].                                                                         than that of text-based approaches.
                                   Developers need to spend a lot of time in identifying                                                               However, existing studies have underutilized vulnera-
                                vulnerable function from large codespace when a new                                                                 bility reports which can be a good source of contextual
                                vulnerability is reported. Identifying vulnerable func-                                                             information to detect vulnerability in source code. In
                                tions effectively is a perquisite of writing a patch for the                                                        this context, the current study aims to verify the role of
                                reported vulnerability. This is essential for enhancing                                                             vulnerability reports in identifying vulnerable function.
                                software security by addressing vulnerabilities to miti-                                                            Vulnerability report can contain contextual information
                                gate potential risks and threats more effectively at earliest                                                       about a vulnerability which may be used to identify vul-
                                time.                                                                                                               nerable functions. When a function is vulnerable against
                                   Existing studies have focused on detecting software                                                              a scenario some keywords should be shared between that
                                vulnerabilities employing text-based [3, 4, 5] or graph-                                                            function and the vulnerability report. These motivate the
                                based [6, 7] approaches. These approaches either treat                                                              authors to study whether vulnerable functions can be
                                source code as plain text or apply graph analysis by rep-                                                           identified by analysing the source code and vulnerability
                                resenting the source code as graph. In practice, prior                                                              report.
                                text-based studies treat source code as plain text and                                                                 For this purpose, this study proposes a technique of
                                apply static program analysis or natural language pro-                                                              automatic software vulnerable function identification
                                cessing. However, the performance of these approaches                                                               namely VFDetector. It takes all source code files of a
                                is not optimal for disregarding the source code seman-                                                              system as input. First, it extracts all source code func-
                                tics. On the other hand, graph based approaches conduct                                                             tions of that system. Then static analysis is performed
                                                                                                                                                    to extract the contents of those functions. Several text
                                QuASoQ 2023: 11th International Workshop on Quantitative
                                                                                                                                                    pre-processing analysis such as tokenization, stopwords
                                Approaches to Software Quality, December 04, 2023, Seoul, Korea
                                $ msse1730@iit.du.ac.bd (R. S. Mim); toukir@iit.du.ac.bd                                                            removal, multiword splitting, semantic meaning extrac-
                                (T. Ahammed); sakib@iit.du.ac.bd (K. Sakib)                                                                         tion and lemmatization are applied on these source code
                                                                    © 2023 Copyright for this paper by its authors. Use permitted under Creative
                                                                    Commons License Attribution 4.0 International (CC BY 4.0).                      along with the vulnerability report to produce code and
                                 CEUR
                                 Workshop
                                 Proceedings
                                               http://ceur-ws.org
                                               ISSN 1613-0073
                                                                    CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                                                                   66
report corpora. In addition, programming language spe-          out manual expert intervention. Recently, DL based tech-
cific keywords is removed for generating code corpora.          niques [11, 12, 13] has gained extensive use in detecting
Finally, to rank the vulnerable functions, similarity scores    source code vulnerabilities due to its ability to automati-
are measured between the code corpora of the functions          cally extract features from source code. DL based meth-
and report corpora by a modified version of Vector Space        ods can be categorized into text-based and graph-based
Model (tVSM) where larger methods get more weight               methods.
while ranking.                                                     Text based methods: The text-based approach in
   In experiments, as an initial study ten Common Vul-          vulnerability detection treats a program’s source code as
nerabilities and Exposures (CVE) reports are chosen ran-        text and employs natural language processing techniques
domly from six open source GitHub repositories. Based           to identify vulnerabilities. Russell et al. [3] introduced
on the commit link available in reports we crawled              the TokenCNN model, which utilizes lexical analysis to
the corresponding projects before the vulnerability was         acquire source code tokens and employs a Convolutional
patched. The result analysis shows that VFDetector ranks        Neural Network (CNN) to detect vulnerabilities.
the vulnerable functions at the first position in 40% cases,       Li et al. [4] proposed Vuldeepecker, a method that
whereas it ranks the actual vulnerable function within          collects code gadgets by slicing programs and transforms
top 5 in 90% cases and within top 7 in 100% cases.              them into vector representations, training a Bidirectional
   It is evident from the results that VFDetector performs      Long Short Term Memory (BLSTM) model for vulnera-
promisingly in detecting vulnerable functions against a         bility recognition.
vulnerability report in a large scale software systems. It         Zhou et al. [5] introduced µVulDeePecker, which en-
is also observed that in Top 5 and Top 7 ranking, the           hances Vuldeepecker by incorporating code attention
functions which ranks above the actual vulnerable func-         with control dependence to detect multi-class vulnera-
tion are the related functions of that vulnerability which      bilities. However, the performance of these text-based
acquires higher similarity. It guides a developer to patch      approaches is limited because they rely solely on static
those related functions too in order to mitigate that vul-      source code analysis and do not account for the program
nerability from the system.                                     semantics.
   The remainder of this paper is structured as follows:           Graph based methods: To address the limitations
Section 2 gives an overview of previous studies on vul-         of text-based methods, researchers have turned to dy-
nerability detection at file level or function level. Section   namic program analysis to convert a program’s source
3 describes our methodology for detecting vulnerable            code semantics into a graph representation facilitating
functions in a project. Section 4 reports our experimental      vulnerability detection through graph analysis. Zhou et
findings and the analysis thereof. Section 5 demonstrates       al. [6] introduced Devign which employs a graph neural
the threats to validity of our work. Section 6 motivates        network for vulnerability identification. This approach
future research directions and concludes this paper.            includes a convolutional module that efficiently extracts
                                                                critical features for graph-level classification from the
                                                                learned node representations. By pooling the nodes, a
2. Related Work                                                 comprehensive representation for graph-level classifica-
                                                                tion is achieved.
In recent years, the research community has directed
                                                                   Cheng et al. [7] introduced a different approach named
significant attention toward the issue of vulnerability
                                                                Deepwukong which divides the program dependency
detection, primarily due to the complex challenges it
                                                                graph into various subgraphs after distilling the program
presents. The existing body of literature has introduced
                                                                semantics based on program points of interest. These
numerous methodologies in response to these challenges.
                                                                subgraphs are then utilized to train a vulnerability de-
These methods can be classified into three distinct cate-
                                                                tector through a graph neural network. While these
gories based on the degree of automation: manual, semi-
                                                                graph-based techniques prove more effective in identify-
automatic, and fully automatic techniques.
                                                                ing vulnerabilities but it is important to note that their
   Manual techniques rely on human experts to create
                                                                scalability is worse than text based methods due to large
vulnerability patterns. However, all patterns can not be
                                                                number of graph nodes in complex program.
generated manually, which leads to reduced detection effi-
                                                                   Exploring the existing literature, it is evident that text-
ciency in practical scenarios. In contrast, semi-automatic
                                                                based methods lacks in incorporating program semantics
techniques involve human experts in the extraction of
                                                                while graph-based methods achieve high accuracy consid-
specific features like API symbols [9] and function calls
                                                                ering source code semantics but have scalability issues in
[10], which are then fed into traditional machine learn-
                                                                complex scenarios. Moreover, due to the underutilization
ing models for vulnerability detection. Full-automatic
                                                                of contextual information like vulnerability reports with
techniques utilize Deep Learning (DL) to automatically
                                                                source code existing methods fails to detect complicated
extract features and construct vulnerability patterns with-
                                                                vulnerabilities in real-world projects. Because whenever


                                                            67
a new vulnerability is reported in a system it is hard to         3.2. Source Code Corpora Generation
detect in which function the vulnerability exist as the
                                                                  Source code corpora consist of source code terms used
system consist of huge volume of functions. Before using
                                                                  to assess similarity with vulnerability report corpora.
vulnerable reports as a source of contextual information
                                                                  Therefore, the precision of code corpora generation di-
in existing methods, it is important to verify whether
                                                                  rectly impacts the precision of matching, consequently
vulnerable functions can be identified effectively using
                                                                  enhancing the accuracy of vulnerability localization. In
these reports. Moreover, identifying vulnerable functions
                                                                  this step all the folders are extracted from a system with
using vulnerability reports can play an effective role to
                                                                  their corresponding C/C++ files. From each of these files
minimize the search space in existing methods.
                                                                  all functions are extracted automatically in individual C
                                                                  files which ensures function level analysis. For Example:
3. Methodology                                                    CVE-2014-2038 of Linux version 3.13.5 consist of 15,675
                                                                  files which has total 229,682 functions.
This study proposes an approach which detects vulner-                 This stage generates a vector of lexical tokens by do-
able functions from huge volume of files of a large soft-         ing lexical analysis on every source code file. There are
ware system using vulnerability reports. The overall              unnecessary tokens in source code which do not contain
process of this approach consist of three distinct steps          any vulnerability related information. These tokens are
and those are Source Code Corpora Generation, Vulnera-            discarded from source code such as programming lan-
bility Report Corpora Generation, Ranking Vulnerable              guage specific keywords (e.g., int, if, float, switch, case,
Functions. Each of these steps encompasses a series of            struct), stop words (e.g., all, and, an, the). Many words in
tasks as illustrated in Figure 1. At first, all files and their   the source code may include multiple words. For exam-
corresponding functions are extracted from a particular           ple, the term "addRequest" consists of the keywords "add"
version of a software system. Then these source code is           and "Request". Mutiwords are separated using multi word
processed to create code corpora. Similarly vulnerability         identifier. Furthermore, statements are divided according
report is processed to produce report corpora. Finally,           to certain syntax-specific separators like ‘ , ’, ‘=’, ‘(’, ‘)’, ‘{’,
similarity between the report and code corpora is mea-            ‘}’, ‘/’, and so on. WordNet2 is used to derive each word’s
sured using tweaked Vector Space Model (tVSM) to rank             semantic meanings because a term might have more than
the vulnerable source code functions.                             one synonym. In specific cases, developers and Quality
                                                                  Assurance (QA) personnel may employ different termi-
3.1. Dataset                                                      nology, even though they are referring to the same sce-
                                                                  nario with equivalent meanings. For example, the term
We used the benchmark dataset Big-Vul1 developed by               ‘finalize’ may have multiple synonyms such as ‘conclude’
Fan et al. [14]. This dataset comprises reliable and com-         or ‘complete.’ When describing a situation, if a developer
prehensive code vulnerabilities which are directly linked         uses ‘finalize’ but QA opts for ‘conclude’, it’s challenging
to the publicly accessible CVE database. Notably, the cre-        for the system to identify these variances without consid-
ation of this dataset involved a significant investment of        ering the semantic meanings of these words. Therefore,
manual resources to ensure its high quality. Furthermore,         the extraction of semantic meaning is crucial in achieving
this dataset is noteworthy for its substantial scale, being       accurate rankings for vulnerable functions.
one of the most extensive vulnerability datasets avail-               The final stage of code corpora generation incorpo-
able. It is derived from a collection of 348 open-source          rates WordNet lemmatization, a technique that normal-
Github projects, encompassing a time span from 2002               izes words to their base or dictionary form. WordNet
to 2019, and covers 91 distinct Common Weakness Enu-              lemmatization utilizes the comprehensive WordNet lex-
meration (CWE) categories. This comprehensive dataset             ical database, organizing words into synonymous sets
comprises approximately 188,600 C/C++ functions, with             called synsets. This method identifies word lemmas based
5.6% of them identified as vulnerable (equivalent to 10,500       on the word’s part of speech and context within Word-
vulnerable functions). This dataset provides granular             Net, offering a more context-aware approach to lemma-
ground-truth information at the function level, specify-          tization. As a result, it considers a word’s meaning and
ing which functions within a codebase are susceptible to          contextual usage, allowing for precise reduction of words.
vulnerabilities.                                                  For instance, it transforms "running" to "run" and "better"
                                                                  to "good" based on their meanings and parts of speech,
                                                                  unlike standard lemmatization that typically relies on
                                                                  suffix removal.


   1
     https://github.com/ZeoVan/MSR_20_Code_vulnerability_
                                                                      2
CSV_Dataset                                                               https://wordnet.princeton.edu/


                                                              68
Figure 1: Overview of VFDetector


3.3. Vulnerability Report Corpora                            employ tVSM, which modifies the Vector Space Model
                                                             (VSM) by emphasising large-scale functions. In tradi-
A software vulnerability report contains information like
                                                             tional VSM, the cosine similarity is used to measure the
description about the vulnerability, severity rating, vul-
                                                             ranking score between the associated vector representa-
nerability identifier (CVE-ID), reference to additional
                                                             tions of a report corpus (r) and function (f), according to
sources which gives valuable insights about a software
                                                             Equation 1.
vulnerability issue. However, these reports can also in-
clude irrelevant terms such as stop words and words in                                                  𝑉⃗𝑟 · 𝑉⃗𝑓
various tenses (present, past, or future). To refine vulner-       𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑟, 𝑓 ) = 𝑐𝑜𝑠(𝑟, 𝑓 ) =                   (1)
                                                                                                       |𝑉𝑟 | · |𝑉⃗𝑓 |
                                                                                                        ⃗
ability reports, pre-processing is necessary. In the initial
stage of vulnerability report corpora creation, stop words      Here, 𝑉⃗𝑟 and 𝑉⃗𝑓 are the term vectors for the vulner-
are eliminated. We apply WordNet Lemmatizer, similar ability report (r) corpus and function (f) corpus respec-
to what’s used for source code corpora generation, to tively. Throughout the years, numerous adaptations of
generate refined report corpora containing only relevant the tf(t,d) function have been introduced with the aim of
terms.                                                       enhancing the VSM model’s effectiveness. These encom-
                                                             pass logarithmic, augmented, and Boolean modifications
3.4. Ranking Vulnerable Functions                            of the traditional VSM [15]. It has been noted that the
                                                             logarithmic version can yield improved performance, as
In this step, relevant vulnerable functions are ranked indicated by prior studies [16, 17, 18]. From that point of
based on the textual similarity between the query (re- view, tVSM modified Equation 1 and uses the logarithm
port corpus) and each of the function in the code corpus. of term frequency (tf) and iff(inverse function frequency)
Vulnerable functions are ranked by applying tVSM. We to give more importance on rare terms in the functions.


                                                          69
Thus tf and iff are calculated using Equation 2 and 3          calculate the length value for each source code function
respectively.                                                  based on the number of terms contained within the func-
                                                               tion. Here we apply the normalized value of ’#terms’
                 𝑡𝑓 (𝑡, 𝑓 ) = 1 + 𝑙𝑜𝑔𝑓𝑡𝑓                 (2)   as the argument for the exponential function 𝑒−𝑥 . The
                                                               normalization process is defined in Equation 7.
                             #𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠
               𝑖𝑓 𝑓 = 𝑙𝑜𝑔(               )               (3)      Let z denote a set of data, with 𝑧𝑚𝑎𝑥 and 𝑧𝑚𝑖𝑛 rep-
                                 𝑛𝑡                            resenting the maximum and minimum values of z term,
Here, 𝑓𝑡𝑓 represents the frequency of a term 𝑡 appearing       respectively. The normalized value for z term is deter-
in a function 𝑓 , #𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠 denotes the total count of       mined as:
functions within the search space, 𝑛𝑡 signifies the overall
                                                                                                  𝑧 − 𝑧𝑚𝑖𝑛
number of functions that include the term 𝑡. Thus in                         𝑁 𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒(𝑧) =                            (7)
equation 4 each term weight is calculated as follows:                                             𝑧 − 𝑧𝑚𝑎𝑥
                                                               Considering the above analysis, tVSM score is calculated
               𝑤𝑒𝑖𝑔ℎ𝑡𝑡∈𝑓 = (𝑡𝑓 )𝑡𝑓 × (𝑖𝑓 𝑓 )𝑡                  by multiplying the weight of each function, denoted as
                                #𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠              (4)   x(terms), with the cosine similarity score represented by
         = (1 + 𝑙𝑜𝑔𝑓𝑡𝑓 ) × 𝑙𝑜𝑔(             )
                                    𝑛𝑡                         cos(r, f), as described in Equation 8:
The VSM score is calculated using equation 5.
                                                                       𝑡𝑉 𝑆𝑀 (𝑟, 𝑓 ) = 𝑥(𝑡𝑒𝑟𝑚𝑠) × 𝑐𝑜𝑠(𝑟, 𝑓 )           (8)
                   (1 + log 𝑓𝑡𝑟 ) × (1 + log 𝑓𝑡𝑓 ) × 𝑖𝑓 𝑓 2 × Once the tVSM score for each function has been com-
              ∑︁
𝑐𝑜𝑠(𝑟, 𝑓 ) =
             𝑡∈𝑟∩𝑓                                            puted, a list of vulnerable functions is arranged in de-
              1                              1                scending order of scores. The function with the highest
 √︀∑︀                       2
                              × √︀∑︀                        2 similarity score is positioned at the top of the ranked list.
      (1 + log 𝑓𝑡𝑟 ) × 𝑖𝑓 𝑓           (1 + log 𝑓𝑡𝑓 ) × 𝑖𝑓 𝑓
                                                        (5)
                                                               4. Experiment and Result Analysis
   Traditional VSM tends to give preference to smaller
functions when ranking them, which can be problem-             This section provides information on the practical imple-
atic for large functions because they may receive lower        mentation, the criteria used for evaluation and experi-
similarity scores. Past research [19, 20, 21] has indicated    mental result analysis of this study.
that larger source code files are more likely to contain
vulnerabilities. Therefore, in the context of vulnerability
localization, it’s crucial to prioritize larger functions in
                                                               4.1. Implementation
our ranking. To address this issue, we introduce a func-     The proposed method is implemented in python (version
tion denoted as ’x’ (as shown in Equation 6) within the      3.11.5). The experiment was conducted on an Windows
tVSM model, aiming to account for the function’s length.     server equipped with an Intel(R) Core(TM) i5-10300H
                                                             CPU processor @3.0GHz and having 64GB of RAM. The
         𝑥(𝑡𝑒𝑟𝑚𝑠) = 1 − 𝑒−𝑁 𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒(#𝑡𝑒𝑟𝑚𝑠)             (6) implementation involves various python libraries and
Equation 6 represents a logistic function, specifically an NLTK (Natural Language Toolkit) libraries for text pro-
inverse logit function, designed to ensure that larger func- cessing and feature extraction. It takes function files
tions receive higher rankings. We employ Equation 6 to as input and provides ranking of suspicious vulnerable
                                                             functions as output.

Table 1
List of Analyzed Open Source Projects

  #   Project Name   CVE ID             Source Code
 1    Chrome         CVE-2011-3916      github.com/chromium/chromium/tree/f1a59e0513d63758588298e98500cac82ddccb67
 2    Radare2        CVE-2017-16359     github.com/radareorg/radare2/tree/1f5050868eedabcbf2eda510a05c93577e1c2cd5
 3    Linux          CVE-2013-6763      github.com/torvalds/linux/tree/f9ec2e6f7991e748e75e324ed05ca2a7ec360ebb
 4    Linux          CVE-2013-2094      github.com/torvalds/linux/tree/41ef2d5678d83af030125550329b6ae8b74618fa
 5    Linux          CVE-2014-2038      github.com/torvalds/linux/tree/a9ab5e840669b19aca2974e2c771a77df2876434
 6    ImageMagick    CVE-2017-15033     github.com/ImageMagick/ImageMagick/tree/c29d15c70d0eda9d7ffe26a0ccc181f4f0a07ca5
 7    Tcpdump        CVE-2017-13000     github.com/the-tcpdump-group/tcpdump/tree/a7e5f58f402e6919ec444a57946bade7dfd6b184
 8    Tcpdump        CVE-2018-14470     github.com/the-tcpdump-group/tcpdump/tree/aa3e54f594385ce7e1e319b0c84999e51192578b
 9    FFmpeg         CVE-2016-10190     github.com/FFmpeg/FFmpeg/tree/51020adcecf4004c1586a708d96acc6cbddd050a
 10   FFmpeg         CVE-2019-11339     github.com/FFmpeg/FFmpeg/tree/3f086a2f665f9906e0f6197cddbfacc2f4b093a1


                                                            70
Table 2
Summary of Tested Projects

   CVE ID            Total Commits Total Files Total Functions           Vulnerable Functions          VFDetector Ranking
   CVE-2017-13000        4,466          180          638              extract_header_length()                  1
   CVE-2018-14470        4,548          185          640                   babel_print_V2()                    1
   CVE-2017-15033       12,558          586         4,694                   ReadYUVImage()                     1
   CVE-2017-16359       16,362          965         9,197         store_versioninfo_gnu_verdef()               1
   CVE-2011-3916        93,104         4,929        15,042                   WebGLObject()                     2
   CVE-2019-11339       93,322         2,572        16,724          mpeg4_decode_studio_block()                2
   CVE-2016-10190       82,768         2,286        14,713                 http_buf_read()                     3
   CVE-2014-2038        413,259       15,674       229,682             nfs_can_extend_write()                  4
   CVE-2013-2094        362,534       18,358       257,550               perf_swevent_init()                   5
   CVE-2013-6763        401,141       19,260       273,898               uio_mmap_physical()                   7


4.2. Evaluation                                                  position out of total 273,898 functions. Upon manual
                                                                 inspection, we observed that the six functions preceding
To conduct this research we used the extensive Big-Vul
                                                                 the vulnerable function exhibit a higher similarity score
dataset which contains large scale vulnerability reports
                                                                 compared to the actual vulnerable function. The reason
of C/C++ code from open source GitHub projects. Other
                                                                 behind this can be the inter-connectedness of these six
C/C++ datasets can also be used. Based on the highest
                                                                 functions with the vulnerable function through function
number of vulnerabilities reported, we choosed top six
                                                                 calls. It is also noticeable that projects with less number
well known projects from this dataset which are Chrome,
                                                                 of functions ranks the vulnerable function in 1st position
Linux, Radare2, ImageMagick, Tcpdump and FFmpeg
                                                                 and with large number of functions the ranking decreases
as shown in Table 1. As the selected projects are open-
                                                                 slightly. The reason behind this is larger projects might
source in nature and are hosted on GitHub, serving as
                                                                 contain more associated functions which are needed to
the primary platform for storing code and managing
                                                                 be fixed in order to address a particular vulnerability.
version control. It allows us to extract all essential com-
                                                                    In summary, the experimental results show that VFDe-
mits for our analysis. Additional information about the
                                                                 tector can detect vulnerable functions from a huge vol-
repositories can be found in Table 2. As an initial study,
                                                                 ume of functions and can also suggest developers with
VFDetector was evaluated using ten vulnerability reports
                                                                 the related functions having highest similarity scores
from these six open-source projects which are chosen
                                                                 which might need to be patched to address the reported
randomly from the dataset. Table 1 lists the analysed
                                                                 vulnerability. Moreover, to the best of our knowledge
project name, CVE ID of report, and the source code link.
                                                                 we are the first to incorporate vulnerability reports in
   To measure the effectiveness of the proposed vulnera-
                                                                 software vulnerability detection from the concept that
bility detection method, we use the Top N Rank metric.
                                                                 vulnerability report’s description contain conceptual in-
This metric signifies the count of vulnerable functions
                                                                 formation about a reported vulnerability. Based on the
ranked in the top N (where N can be 1, 5, or 7) in the ob-
                                                                 promising results in this initial evaluation, the future
tained results. When assessing a reported vulnerability,
                                                                 work can be analyzing more vulnerable reports from
if the top N query results include at least one function
                                                                 diverse projects to make the approach comparable and
that corresponds to the location where the vulnerability
                                                                 generalizable.
needs to be addressed, we determine that the vulnerable
function is detected successfully. Table 2 includes ten
vulnerability reports from six open source projects with         5. Threats to Validity
their number of commits, total files, total functions, ac-
tual vulnerable functions name and finally VFDetector            In this section, we discussed the potential threats which
ranking in Top N ranked functions in output. The re-             may affect the validity of this study.
sults of Table 2 shows that among the ten CVE reports               Threats to external validity: The generalizability
VFDetector ranks the actual vulnerable function at the           of the acquired results poses a threat to external validity.
1st position for four (40%) reports which are CVE ID             The dataset that we used in our research was gathered
#13000, #14470, #15033, #16359. For five reports (50%)           from open-source. Open-source projects may contain
with CVE ID #3916, #2094, #2038, #10190 and #11339 it            data that differs from those created by software compa-
ranks the vulnerable function in Top 5 rank. It indicates        nies with sound management practices. Seven Apache
that nine (90%) reports are ranked in Top 5. For one             projects are examined in this study. More projects from
report CVE-2013-6763 of Linux Kernel version 3.12.1 it           other systems are needed to be evaluated for the gener-
ranks the vulnerable function in Top 7 rank i.e., in 7th         alisation. However, to overcome this threat large-scale


                                                             71
diversified projects with long change history is to be          References
chosen.
   Threats to internal validity: One limitation of our           [1] J. Han, D. Gao, R. H. Deng, On the effectiveness
approach is its reliance on sound programming practices              of software diversity: A systematic study on real-
when naming variables, methods, and classes. If a de-                world vulnerabilities, in: International Conference
veloper uses non-meaningful names, it could have an                  on Detection of Intrusions and Malware, and Vul-
adverse impact on the effectiveness of vulnerability de-             nerability Assessment, Springer, 2009, pp. 127–146.
tection.ay not fully represent the characteristics of the        [2] H. Alves, B. Fonseca, N. Antunes, Software met-
whole program. Additionally, our model is evaluated                  rics and security vulnerabilities: dataset and ex-
with C/C++ functions and it may encounter challenges                 ploratory study, in: 2016 12th European Depend-
in detecting vulnerabilities in other programming lan-               able Computing Conference (EDCC), IEEE, 2016,
guages.                                                              pp. 37–44.
   Threats to construct validity: We used the WordNet            [3] R. Russell, L. Kim, L. Hamilton, T. Lazovich, J. Harer,
database and lemmatizer of NLTK library as essential                 O. Ozdemir, P. Ellingwood, M. McConley, Auto-
components in text pre-processing to extract word se-                mated vulnerability detection in source code using
mantics and reduce words to their base forms. Since                  deep representation learning, in: 2018 17th IEEE
these resources are well known for their usefulness in               international conference on machine learning and
NLP, we relied on their accuracy. Moreover, vulnerabil-              applications (ICMLA), IEEE, 2018, pp. 757–762.
ity reports offer essential information that developers          [4] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng,
rely on to address and patch vulnerable functions. A bad             Y. Zhong, Vuldeepecker: A deep learning-based
vulnerability report delays the fixing process. It’s worth           system for vulnerability detection, in: Proceedings
noting that our approach is dependent on the quality of              of the 25th Annual Network and Distributed System
these reports. If a vulnerability report lacks sufficient            Security Symposium, 2018.
information or contains misleading details, it can have a        [5] D. Zou, S. Wang, S. Xu, Z. Li, H. Jin, 𝜇 vuldeepecker:
detrimental impact on the performance of VFDetector.                 A deep learning-based system for multiclass vulner-
                                                                     ability detection, IEEE Transactions on Dependable
                                                                     and Secure Computing 18 (2019) 2224–2236.
6. Conclusion                                                    [6] Y. Zhou, S. Liu, J. Siow, X. Du, Y. Liu, Devign: Effec-
                                                                     tive vulnerability identification by learning compre-
Once a new vulnerability is reported, developers need to             hensive program semantics via graph neural net-
know which files and particular which function should                works, Advances in neural information processing
be modified to fix the issues. This can be especially chal-          systems 32 (2019).
lenging in large software projects, where examining nu-          [7] X. Cheng, H. Wang, J. Hua, G. Xu, Y. Sui, Deep-
merous source code files can be time-consuming and                   wukong: Statically detecting software vulnerabili-
costly.                                                              ties using deep graph neural network, ACM Trans-
   In this paper, a software vulnerability detection tech-           actions on Software Engineering and Methodology
nique has been proposed named as VFDetector for detect-              (TOSEM) 30 (2021) 1–33.
ing relevant vulnerable functions based on vulnerability         [8] F. Yamaguchi, N. Golde, D. Arp, K. Rieck, Modeling
reports. Since detecting vulnerabilities from vulnerability          and discovering vulnerabilities with code property
report is an information retrieval process, we apply static          graphs, in: 2014 IEEE Symposium on Security and
analysis on both source code and vulnerability reports to            Privacy, IEEE, 2014, pp. 590–604.
create code and report corpora. Finally, VFDetector lever-       [9] F. Yamaguchi, M. Lottmann, K. Rieck, Generalized
ages a tweaked Vector Space Model (tVSM) to rank the                 vulnerability extrapolation using abstract syntax
source code functions based on the similarity. Our evalu-            trees, in: Proceedings of the 28th annual computer
ation conducted on six real-world open source projects               security applications conference, 2012, pp. 359–368.
show that VFDetector ranks vulnerable functions at the          [10] S. Neuhaus, T. Zimmermann, C. Holler, A. Zeller,
1st position in most cases.                                          Predicting vulnerable software components, in:
   In future, VFDetector can be applied to industrial                Proceedings of the 14th ACM conference on Com-
projects to access the generalization of the results in prac-        puter and communications security, 2007, pp. 529–
tice. Besides, dynamic analysis can be incorporated in               540.
this approach to improve detection performance. More-           [11] Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu, Z. Chen, Sy-
over, minimizing the search space in a function and pin-             sevr: A framework for using deep learning to de-
pointing statement-level vulnerabilities is also a potential         tect software vulnerabilities, IEEE Transactions
future scope.                                                        on Dependable and Secure Computing 19 (2021)
                                                                     2244–2258.


                                                            72
[12] Y. Wu, D. Zou, S. Dou, W. Yang, D. Xu, H. Jin, Vul-
     cnn: An image-inspired scalable vulnerability de-
     tection system, in: Proceedings of the 44th Interna-
     tional Conference on Software Engineering, 2022,
     pp. 2365–2376.
[13] Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu, Y. Zhang, Z. Chen,
     D. Li, Vuldeelocator: A deep learning-based system
     for detecting and locating software vulnerabilities,
     IEEE Transactions on Dependable and Secure Com-
     puting (2021).
[14] J. Fan, Y. Li, S. Wang, T. N. Nguyen, A c/c++ code
     vulnerability dataset with code changes and cve
     summaries, in: Proceedings of the 17th Interna-
     tional Conference on Mining Software Repositories,
     2020, pp. 508–512.
[15] H. Schütze, C. D. Manning, P. Raghavan, Introduc-
     tion to information retrieval, volume 39, Cambridge
     University Press Cambridge, 2008.
[16] W. B. Croft, D. Metzler, T. Strohman, Search en-
     gines: Information retrieval in practice, volume 520,
     Addison-Wesley Reading, 2010.
[17] S. Rahman, K. Sakib, An appropriate method rank-
     ing approach for localizing bugs using minimized
     search space., in: ENASE, 2016, pp. 303–309.
[18] S. Rahman, M. M. Rahman, K. Sakib, A statement
     level bug localization technique using statement
     dependency graph., in: ENASE, 2017, pp. 171–178.
[19] N. E. Fenton, N. Ohlsson, Quantitative analysis
     of faults and failures in a complex software sys-
     tem, IEEE Transactions on Software engineering
     26 (2000) 797–814.
[20] T. J. Ostrand, E. J. Weyuker, R. M. Bell, Predicting
     the location and number of faults in large software
     systems, IEEE Transactions on Software Engineer-
     ing 31 (2005) 340–355.
[21] H. Zhang, An investigation of the relationships
     between lines of code and defects, in: 2009 IEEE
     international conference on software maintenance,
     IEEE, 2009, pp. 274–283.


                                                            73

</pre>