Identifying Vulnerable Functions from Source Code using Vulnerability Reports Rabaya Sultana Mim, Toukir Ahammed and Kazi Sakib Institute of Information Technology, University of Dhaka, Dhaka, Bangladesh Abstract Software vulnerability represents a flaw within a software product that can be exploited to cause the system to violate its security. In the context of large and evolving software systems, developers find it challenging to identify vulnerable functions effectively when a new vulnerability is reported. Existing studies have underutilized vulnerability reports which can be a good source of contextual information in identifying vulnerable functions in source code. This study proposes an information retrieval based method called Vulnerable Functions Detector (VFDetector) for identifying vulnerable functions from source code and vulnerability reports. VFDetector ranks vulnerable functions based on the textual similarity between the vulnerability report corpora and the source code corpora. This ranking is achieved modifying conventional Vector Space Model to incorporate the size of a function which is known as the tweaked Vector Space Model (tVSM). As an initial study, the approach has been evaluated by analysing 10 vulnerability reports from six popular open-source projects. The result shows that VFDetector ranks the actual vulnerable function at first position in 40% cases. Moreover, it ranks the actual vulnerable function within rank 5 in 90% cases and within rank 7 for all analysed reports. Therefore, developers can use these results to implement successful patches on vulnerable functions more quickly . Keywords vulnerability identification, vulnerable function, vulnerability report, source code, vector space model 1. Introduction program analysis which represent the source code seman- tics as a graph, and then apply graph analysis methods A software vulnerability is a flaw, weakness, or error in such as Graph Neural Networks (GNN) [8] to identify a computer program or system that can be exploited by vulnerabilities. Although these graph-based approaches malicious attackers to compromise its integrity, availabil- are more efficient at identifying vulnerabilities taking ity, or confidentiality [1]. Software vulnerabilities make into account the semantic relationship of various lines software systems increasingly vulnerable to attack and of source code but their scalability is substantially less damage, which raises security concerns [2]. than that of text-based approaches. Developers need to spend a lot of time in identifying However, existing studies have underutilized vulnera- vulnerable function from large codespace when a new bility reports which can be a good source of contextual vulnerability is reported. Identifying vulnerable func- information to detect vulnerability in source code. In tions effectively is a perquisite of writing a patch for the this context, the current study aims to verify the role of reported vulnerability. This is essential for enhancing vulnerability reports in identifying vulnerable function. software security by addressing vulnerabilities to miti- Vulnerability report can contain contextual information gate potential risks and threats more effectively at earliest about a vulnerability which may be used to identify vul- time. nerable functions. When a function is vulnerable against Existing studies have focused on detecting software a scenario some keywords should be shared between that vulnerabilities employing text-based [3, 4, 5] or graph- function and the vulnerability report. These motivate the based [6, 7] approaches. These approaches either treat authors to study whether vulnerable functions can be source code as plain text or apply graph analysis by rep- identified by analysing the source code and vulnerability resenting the source code as graph. In practice, prior report. text-based studies treat source code as plain text and For this purpose, this study proposes a technique of apply static program analysis or natural language pro- automatic software vulnerable function identification cessing. However, the performance of these approaches namely VFDetector. It takes all source code files of a is not optimal for disregarding the source code seman- system as input. First, it extracts all source code func- tics. On the other hand, graph based approaches conduct tions of that system. Then static analysis is performed to extract the contents of those functions. Several text QuASoQ 2023: 11th International Workshop on Quantitative pre-processing analysis such as tokenization, stopwords Approaches to Software Quality, December 04, 2023, Seoul, Korea $ msse1730@iit.du.ac.bd (R. S. Mim); toukir@iit.du.ac.bd removal, multiword splitting, semantic meaning extrac- (T. Ahammed); sakib@iit.du.ac.bd (K. Sakib) tion and lemmatization are applied on these source code © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). along with the vulnerability report to produce code and CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 66 report corpora. In addition, programming language spe- out manual expert intervention. Recently, DL based tech- cific keywords is removed for generating code corpora. niques [11, 12, 13] has gained extensive use in detecting Finally, to rank the vulnerable functions, similarity scores source code vulnerabilities due to its ability to automati- are measured between the code corpora of the functions cally extract features from source code. DL based meth- and report corpora by a modified version of Vector Space ods can be categorized into text-based and graph-based Model (tVSM) where larger methods get more weight methods. while ranking. Text based methods: The text-based approach in In experiments, as an initial study ten Common Vul- vulnerability detection treats a program’s source code as nerabilities and Exposures (CVE) reports are chosen ran- text and employs natural language processing techniques domly from six open source GitHub repositories. Based to identify vulnerabilities. Russell et al. [3] introduced on the commit link available in reports we crawled the TokenCNN model, which utilizes lexical analysis to the corresponding projects before the vulnerability was acquire source code tokens and employs a Convolutional patched. The result analysis shows that VFDetector ranks Neural Network (CNN) to detect vulnerabilities. the vulnerable functions at the first position in 40% cases, Li et al. [4] proposed Vuldeepecker, a method that whereas it ranks the actual vulnerable function within collects code gadgets by slicing programs and transforms top 5 in 90% cases and within top 7 in 100% cases. them into vector representations, training a Bidirectional It is evident from the results that VFDetector performs Long Short Term Memory (BLSTM) model for vulnera- promisingly in detecting vulnerable functions against a bility recognition. vulnerability report in a large scale software systems. It Zhou et al. [5] introduced µVulDeePecker, which en- is also observed that in Top 5 and Top 7 ranking, the hances Vuldeepecker by incorporating code attention functions which ranks above the actual vulnerable func- with control dependence to detect multi-class vulnera- tion are the related functions of that vulnerability which bilities. However, the performance of these text-based acquires higher similarity. It guides a developer to patch approaches is limited because they rely solely on static those related functions too in order to mitigate that vul- source code analysis and do not account for the program nerability from the system. semantics. The remainder of this paper is structured as follows: Graph based methods: To address the limitations Section 2 gives an overview of previous studies on vul- of text-based methods, researchers have turned to dy- nerability detection at file level or function level. Section namic program analysis to convert a program’s source 3 describes our methodology for detecting vulnerable code semantics into a graph representation facilitating functions in a project. Section 4 reports our experimental vulnerability detection through graph analysis. Zhou et findings and the analysis thereof. Section 5 demonstrates al. [6] introduced Devign which employs a graph neural the threats to validity of our work. Section 6 motivates network for vulnerability identification. This approach future research directions and concludes this paper. includes a convolutional module that efficiently extracts critical features for graph-level classification from the learned node representations. By pooling the nodes, a 2. Related Work comprehensive representation for graph-level classifica- tion is achieved. In recent years, the research community has directed Cheng et al. [7] introduced a different approach named significant attention toward the issue of vulnerability Deepwukong which divides the program dependency detection, primarily due to the complex challenges it graph into various subgraphs after distilling the program presents. The existing body of literature has introduced semantics based on program points of interest. These numerous methodologies in response to these challenges. subgraphs are then utilized to train a vulnerability de- These methods can be classified into three distinct cate- tector through a graph neural network. While these gories based on the degree of automation: manual, semi- graph-based techniques prove more effective in identify- automatic, and fully automatic techniques. ing vulnerabilities but it is important to note that their Manual techniques rely on human experts to create scalability is worse than text based methods due to large vulnerability patterns. However, all patterns can not be number of graph nodes in complex program. generated manually, which leads to reduced detection effi- Exploring the existing literature, it is evident that text- ciency in practical scenarios. In contrast, semi-automatic based methods lacks in incorporating program semantics techniques involve human experts in the extraction of while graph-based methods achieve high accuracy consid- specific features like API symbols [9] and function calls ering source code semantics but have scalability issues in [10], which are then fed into traditional machine learn- complex scenarios. Moreover, due to the underutilization ing models for vulnerability detection. Full-automatic of contextual information like vulnerability reports with techniques utilize Deep Learning (DL) to automatically source code existing methods fails to detect complicated extract features and construct vulnerability patterns with- vulnerabilities in real-world projects. Because whenever 67 a new vulnerability is reported in a system it is hard to 3.2. Source Code Corpora Generation detect in which function the vulnerability exist as the Source code corpora consist of source code terms used system consist of huge volume of functions. Before using to assess similarity with vulnerability report corpora. vulnerable reports as a source of contextual information Therefore, the precision of code corpora generation di- in existing methods, it is important to verify whether rectly impacts the precision of matching, consequently vulnerable functions can be identified effectively using enhancing the accuracy of vulnerability localization. In these reports. Moreover, identifying vulnerable functions this step all the folders are extracted from a system with using vulnerability reports can play an effective role to their corresponding C/C++ files. From each of these files minimize the search space in existing methods. all functions are extracted automatically in individual C files which ensures function level analysis. For Example: 3. Methodology CVE-2014-2038 of Linux version 3.13.5 consist of 15,675 files which has total 229,682 functions. This study proposes an approach which detects vulner- This stage generates a vector of lexical tokens by do- able functions from huge volume of files of a large soft- ing lexical analysis on every source code file. There are ware system using vulnerability reports. The overall unnecessary tokens in source code which do not contain process of this approach consist of three distinct steps any vulnerability related information. These tokens are and those are Source Code Corpora Generation, Vulnera- discarded from source code such as programming lan- bility Report Corpora Generation, Ranking Vulnerable guage specific keywords (e.g., int, if, float, switch, case, Functions. Each of these steps encompasses a series of struct), stop words (e.g., all, and, an, the). Many words in tasks as illustrated in Figure 1. At first, all files and their the source code may include multiple words. For exam- corresponding functions are extracted from a particular ple, the term "addRequest" consists of the keywords "add" version of a software system. Then these source code is and "Request". Mutiwords are separated using multi word processed to create code corpora. Similarly vulnerability identifier. Furthermore, statements are divided according report is processed to produce report corpora. Finally, to certain syntax-specific separators like ‘ , ’, ‘=’, ‘(’, ‘)’, ‘{’, similarity between the report and code corpora is mea- ‘}’, ‘/’, and so on. WordNet2 is used to derive each word’s sured using tweaked Vector Space Model (tVSM) to rank semantic meanings because a term might have more than the vulnerable source code functions. one synonym. In specific cases, developers and Quality Assurance (QA) personnel may employ different termi- 3.1. Dataset nology, even though they are referring to the same sce- nario with equivalent meanings. For example, the term We used the benchmark dataset Big-Vul1 developed by ‘finalize’ may have multiple synonyms such as ‘conclude’ Fan et al. [14]. This dataset comprises reliable and com- or ‘complete.’ When describing a situation, if a developer prehensive code vulnerabilities which are directly linked uses ‘finalize’ but QA opts for ‘conclude’, it’s challenging to the publicly accessible CVE database. Notably, the cre- for the system to identify these variances without consid- ation of this dataset involved a significant investment of ering the semantic meanings of these words. Therefore, manual resources to ensure its high quality. Furthermore, the extraction of semantic meaning is crucial in achieving this dataset is noteworthy for its substantial scale, being accurate rankings for vulnerable functions. one of the most extensive vulnerability datasets avail- The final stage of code corpora generation incorpo- able. It is derived from a collection of 348 open-source rates WordNet lemmatization, a technique that normal- Github projects, encompassing a time span from 2002 izes words to their base or dictionary form. WordNet to 2019, and covers 91 distinct Common Weakness Enu- lemmatization utilizes the comprehensive WordNet lex- meration (CWE) categories. This comprehensive dataset ical database, organizing words into synonymous sets comprises approximately 188,600 C/C++ functions, with called synsets. This method identifies word lemmas based 5.6% of them identified as vulnerable (equivalent to 10,500 on the word’s part of speech and context within Word- vulnerable functions). This dataset provides granular Net, offering a more context-aware approach to lemma- ground-truth information at the function level, specify- tization. As a result, it considers a word’s meaning and ing which functions within a codebase are susceptible to contextual usage, allowing for precise reduction of words. vulnerabilities. For instance, it transforms "running" to "run" and "better" to "good" based on their meanings and parts of speech, unlike standard lemmatization that typically relies on suffix removal. 1 https://github.com/ZeoVan/MSR_20_Code_vulnerability_ 2 CSV_Dataset https://wordnet.princeton.edu/ 68 Figure 1: Overview of VFDetector 3.3. Vulnerability Report Corpora employ tVSM, which modifies the Vector Space Model (VSM) by emphasising large-scale functions. In tradi- A software vulnerability report contains information like tional VSM, the cosine similarity is used to measure the description about the vulnerability, severity rating, vul- ranking score between the associated vector representa- nerability identifier (CVE-ID), reference to additional tions of a report corpus (r) and function (f), according to sources which gives valuable insights about a software Equation 1. vulnerability issue. However, these reports can also in- clude irrelevant terms such as stop words and words in 𝑉⃗𝑟 · 𝑉⃗𝑓 various tenses (present, past, or future). To refine vulner- 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑟, 𝑓 ) = 𝑐𝑜𝑠(𝑟, 𝑓 ) = (1) |𝑉𝑟 | · |𝑉⃗𝑓 | ⃗ ability reports, pre-processing is necessary. In the initial stage of vulnerability report corpora creation, stop words Here, 𝑉⃗𝑟 and 𝑉⃗𝑓 are the term vectors for the vulner- are eliminated. We apply WordNet Lemmatizer, similar ability report (r) corpus and function (f) corpus respec- to what’s used for source code corpora generation, to tively. Throughout the years, numerous adaptations of generate refined report corpora containing only relevant the tf(t,d) function have been introduced with the aim of terms. enhancing the VSM model’s effectiveness. These encom- pass logarithmic, augmented, and Boolean modifications 3.4. Ranking Vulnerable Functions of the traditional VSM [15]. It has been noted that the logarithmic version can yield improved performance, as In this step, relevant vulnerable functions are ranked indicated by prior studies [16, 17, 18]. From that point of based on the textual similarity between the query (re- view, tVSM modified Equation 1 and uses the logarithm port corpus) and each of the function in the code corpus. of term frequency (tf) and iff(inverse function frequency) Vulnerable functions are ranked by applying tVSM. We to give more importance on rare terms in the functions. 69 Thus tf and iff are calculated using Equation 2 and 3 calculate the length value for each source code function respectively. based on the number of terms contained within the func- tion. Here we apply the normalized value of ’#terms’ 𝑡𝑓 (𝑡, 𝑓 ) = 1 + 𝑙𝑜𝑔𝑓𝑡𝑓 (2) as the argument for the exponential function 𝑒−𝑥 . The normalization process is defined in Equation 7. #𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠 𝑖𝑓 𝑓 = 𝑙𝑜𝑔( ) (3) Let z denote a set of data, with 𝑧𝑚𝑎𝑥 and 𝑧𝑚𝑖𝑛 rep- 𝑛𝑡 resenting the maximum and minimum values of z term, Here, 𝑓𝑡𝑓 represents the frequency of a term 𝑡 appearing respectively. The normalized value for z term is deter- in a function 𝑓 , #𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠 denotes the total count of mined as: functions within the search space, 𝑛𝑡 signifies the overall 𝑧 − 𝑧𝑚𝑖𝑛 number of functions that include the term 𝑡. Thus in 𝑁 𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒(𝑧) = (7) equation 4 each term weight is calculated as follows: 𝑧 − 𝑧𝑚𝑎𝑥 Considering the above analysis, tVSM score is calculated 𝑤𝑒𝑖𝑔ℎ𝑡𝑡∈𝑓 = (𝑡𝑓 )𝑡𝑓 × (𝑖𝑓 𝑓 )𝑡 by multiplying the weight of each function, denoted as #𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠 (4) x(terms), with the cosine similarity score represented by = (1 + 𝑙𝑜𝑔𝑓𝑡𝑓 ) × 𝑙𝑜𝑔( ) 𝑛𝑡 cos(r, f), as described in Equation 8: The VSM score is calculated using equation 5. 𝑡𝑉 𝑆𝑀 (𝑟, 𝑓 ) = 𝑥(𝑡𝑒𝑟𝑚𝑠) × 𝑐𝑜𝑠(𝑟, 𝑓 ) (8) (1 + log 𝑓𝑡𝑟 ) × (1 + log 𝑓𝑡𝑓 ) × 𝑖𝑓 𝑓 2 × Once the tVSM score for each function has been com- ∑︁ 𝑐𝑜𝑠(𝑟, 𝑓 ) = 𝑡∈𝑟∩𝑓 puted, a list of vulnerable functions is arranged in de- 1 1 scending order of scores. The function with the highest √︀∑︀ 2 × √︀∑︀ 2 similarity score is positioned at the top of the ranked list. (1 + log 𝑓𝑡𝑟 ) × 𝑖𝑓 𝑓 (1 + log 𝑓𝑡𝑓 ) × 𝑖𝑓 𝑓 (5) 4. Experiment and Result Analysis Traditional VSM tends to give preference to smaller functions when ranking them, which can be problem- This section provides information on the practical imple- atic for large functions because they may receive lower mentation, the criteria used for evaluation and experi- similarity scores. Past research [19, 20, 21] has indicated mental result analysis of this study. that larger source code files are more likely to contain vulnerabilities. Therefore, in the context of vulnerability localization, it’s crucial to prioritize larger functions in 4.1. Implementation our ranking. To address this issue, we introduce a func- The proposed method is implemented in python (version tion denoted as ’x’ (as shown in Equation 6) within the 3.11.5). The experiment was conducted on an Windows tVSM model, aiming to account for the function’s length. server equipped with an Intel(R) Core(TM) i5-10300H CPU processor @3.0GHz and having 64GB of RAM. The 𝑥(𝑡𝑒𝑟𝑚𝑠) = 1 − 𝑒−𝑁 𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒(#𝑡𝑒𝑟𝑚𝑠) (6) implementation involves various python libraries and Equation 6 represents a logistic function, specifically an NLTK (Natural Language Toolkit) libraries for text pro- inverse logit function, designed to ensure that larger func- cessing and feature extraction. It takes function files tions receive higher rankings. We employ Equation 6 to as input and provides ranking of suspicious vulnerable functions as output. Table 1 List of Analyzed Open Source Projects # Project Name CVE ID Source Code 1 Chrome CVE-2011-3916 github.com/chromium/chromium/tree/f1a59e0513d63758588298e98500cac82ddccb67 2 Radare2 CVE-2017-16359 github.com/radareorg/radare2/tree/1f5050868eedabcbf2eda510a05c93577e1c2cd5 3 Linux CVE-2013-6763 github.com/torvalds/linux/tree/f9ec2e6f7991e748e75e324ed05ca2a7ec360ebb 4 Linux CVE-2013-2094 github.com/torvalds/linux/tree/41ef2d5678d83af030125550329b6ae8b74618fa 5 Linux CVE-2014-2038 github.com/torvalds/linux/tree/a9ab5e840669b19aca2974e2c771a77df2876434 6 ImageMagick CVE-2017-15033 github.com/ImageMagick/ImageMagick/tree/c29d15c70d0eda9d7ffe26a0ccc181f4f0a07ca5 7 Tcpdump CVE-2017-13000 github.com/the-tcpdump-group/tcpdump/tree/a7e5f58f402e6919ec444a57946bade7dfd6b184 8 Tcpdump CVE-2018-14470 github.com/the-tcpdump-group/tcpdump/tree/aa3e54f594385ce7e1e319b0c84999e51192578b 9 FFmpeg CVE-2016-10190 github.com/FFmpeg/FFmpeg/tree/51020adcecf4004c1586a708d96acc6cbddd050a 10 FFmpeg CVE-2019-11339 github.com/FFmpeg/FFmpeg/tree/3f086a2f665f9906e0f6197cddbfacc2f4b093a1 70 Table 2 Summary of Tested Projects CVE ID Total Commits Total Files Total Functions Vulnerable Functions VFDetector Ranking CVE-2017-13000 4,466 180 638 extract_header_length() 1 CVE-2018-14470 4,548 185 640 babel_print_V2() 1 CVE-2017-15033 12,558 586 4,694 ReadYUVImage() 1 CVE-2017-16359 16,362 965 9,197 store_versioninfo_gnu_verdef() 1 CVE-2011-3916 93,104 4,929 15,042 WebGLObject() 2 CVE-2019-11339 93,322 2,572 16,724 mpeg4_decode_studio_block() 2 CVE-2016-10190 82,768 2,286 14,713 http_buf_read() 3 CVE-2014-2038 413,259 15,674 229,682 nfs_can_extend_write() 4 CVE-2013-2094 362,534 18,358 257,550 perf_swevent_init() 5 CVE-2013-6763 401,141 19,260 273,898 uio_mmap_physical() 7 4.2. Evaluation position out of total 273,898 functions. Upon manual inspection, we observed that the six functions preceding To conduct this research we used the extensive Big-Vul the vulnerable function exhibit a higher similarity score dataset which contains large scale vulnerability reports compared to the actual vulnerable function. The reason of C/C++ code from open source GitHub projects. Other behind this can be the inter-connectedness of these six C/C++ datasets can also be used. Based on the highest functions with the vulnerable function through function number of vulnerabilities reported, we choosed top six calls. It is also noticeable that projects with less number well known projects from this dataset which are Chrome, of functions ranks the vulnerable function in 1st position Linux, Radare2, ImageMagick, Tcpdump and FFmpeg and with large number of functions the ranking decreases as shown in Table 1. As the selected projects are open- slightly. The reason behind this is larger projects might source in nature and are hosted on GitHub, serving as contain more associated functions which are needed to the primary platform for storing code and managing be fixed in order to address a particular vulnerability. version control. It allows us to extract all essential com- In summary, the experimental results show that VFDe- mits for our analysis. Additional information about the tector can detect vulnerable functions from a huge vol- repositories can be found in Table 2. As an initial study, ume of functions and can also suggest developers with VFDetector was evaluated using ten vulnerability reports the related functions having highest similarity scores from these six open-source projects which are chosen which might need to be patched to address the reported randomly from the dataset. Table 1 lists the analysed vulnerability. Moreover, to the best of our knowledge project name, CVE ID of report, and the source code link. we are the first to incorporate vulnerability reports in To measure the effectiveness of the proposed vulnera- software vulnerability detection from the concept that bility detection method, we use the Top N Rank metric. vulnerability report’s description contain conceptual in- This metric signifies the count of vulnerable functions formation about a reported vulnerability. Based on the ranked in the top N (where N can be 1, 5, or 7) in the ob- promising results in this initial evaluation, the future tained results. When assessing a reported vulnerability, work can be analyzing more vulnerable reports from if the top N query results include at least one function diverse projects to make the approach comparable and that corresponds to the location where the vulnerability generalizable. needs to be addressed, we determine that the vulnerable function is detected successfully. Table 2 includes ten vulnerability reports from six open source projects with 5. Threats to Validity their number of commits, total files, total functions, ac- tual vulnerable functions name and finally VFDetector In this section, we discussed the potential threats which ranking in Top N ranked functions in output. The re- may affect the validity of this study. sults of Table 2 shows that among the ten CVE reports Threats to external validity: The generalizability VFDetector ranks the actual vulnerable function at the of the acquired results poses a threat to external validity. 1st position for four (40%) reports which are CVE ID The dataset that we used in our research was gathered #13000, #14470, #15033, #16359. For five reports (50%) from open-source. Open-source projects may contain with CVE ID #3916, #2094, #2038, #10190 and #11339 it data that differs from those created by software compa- ranks the vulnerable function in Top 5 rank. It indicates nies with sound management practices. Seven Apache that nine (90%) reports are ranked in Top 5. For one projects are examined in this study. More projects from report CVE-2013-6763 of Linux Kernel version 3.12.1 it other systems are needed to be evaluated for the gener- ranks the vulnerable function in Top 7 rank i.e., in 7th alisation. However, to overcome this threat large-scale 71 diversified projects with long change history is to be References chosen. Threats to internal validity: One limitation of our [1] J. Han, D. Gao, R. H. Deng, On the effectiveness approach is its reliance on sound programming practices of software diversity: A systematic study on real- when naming variables, methods, and classes. If a de- world vulnerabilities, in: International Conference veloper uses non-meaningful names, it could have an on Detection of Intrusions and Malware, and Vul- adverse impact on the effectiveness of vulnerability de- nerability Assessment, Springer, 2009, pp. 127–146. tection.ay not fully represent the characteristics of the [2] H. Alves, B. Fonseca, N. Antunes, Software met- whole program. Additionally, our model is evaluated rics and security vulnerabilities: dataset and ex- with C/C++ functions and it may encounter challenges ploratory study, in: 2016 12th European Depend- in detecting vulnerabilities in other programming lan- able Computing Conference (EDCC), IEEE, 2016, guages. pp. 37–44. Threats to construct validity: We used the WordNet [3] R. Russell, L. Kim, L. Hamilton, T. Lazovich, J. Harer, database and lemmatizer of NLTK library as essential O. Ozdemir, P. Ellingwood, M. McConley, Auto- components in text pre-processing to extract word se- mated vulnerability detection in source code using mantics and reduce words to their base forms. Since deep representation learning, in: 2018 17th IEEE these resources are well known for their usefulness in international conference on machine learning and NLP, we relied on their accuracy. Moreover, vulnerabil- applications (ICMLA), IEEE, 2018, pp. 757–762. ity reports offer essential information that developers [4] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, rely on to address and patch vulnerable functions. A bad Y. Zhong, Vuldeepecker: A deep learning-based vulnerability report delays the fixing process. It’s worth system for vulnerability detection, in: Proceedings noting that our approach is dependent on the quality of of the 25th Annual Network and Distributed System these reports. If a vulnerability report lacks sufficient Security Symposium, 2018. information or contains misleading details, it can have a [5] D. Zou, S. Wang, S. Xu, Z. Li, H. Jin, 𝜇 vuldeepecker: detrimental impact on the performance of VFDetector. A deep learning-based system for multiclass vulner- ability detection, IEEE Transactions on Dependable and Secure Computing 18 (2019) 2224–2236. 6. Conclusion [6] Y. Zhou, S. Liu, J. Siow, X. Du, Y. Liu, Devign: Effec- tive vulnerability identification by learning compre- Once a new vulnerability is reported, developers need to hensive program semantics via graph neural net- know which files and particular which function should works, Advances in neural information processing be modified to fix the issues. This can be especially chal- systems 32 (2019). lenging in large software projects, where examining nu- [7] X. Cheng, H. Wang, J. Hua, G. Xu, Y. Sui, Deep- merous source code files can be time-consuming and wukong: Statically detecting software vulnerabili- costly. ties using deep graph neural network, ACM Trans- In this paper, a software vulnerability detection tech- actions on Software Engineering and Methodology nique has been proposed named as VFDetector for detect- (TOSEM) 30 (2021) 1–33. ing relevant vulnerable functions based on vulnerability [8] F. Yamaguchi, N. Golde, D. Arp, K. Rieck, Modeling reports. Since detecting vulnerabilities from vulnerability and discovering vulnerabilities with code property report is an information retrieval process, we apply static graphs, in: 2014 IEEE Symposium on Security and analysis on both source code and vulnerability reports to Privacy, IEEE, 2014, pp. 590–604. create code and report corpora. Finally, VFDetector lever- [9] F. Yamaguchi, M. Lottmann, K. Rieck, Generalized ages a tweaked Vector Space Model (tVSM) to rank the vulnerability extrapolation using abstract syntax source code functions based on the similarity. Our evalu- trees, in: Proceedings of the 28th annual computer ation conducted on six real-world open source projects security applications conference, 2012, pp. 359–368. show that VFDetector ranks vulnerable functions at the [10] S. Neuhaus, T. Zimmermann, C. Holler, A. Zeller, 1st position in most cases. Predicting vulnerable software components, in: In future, VFDetector can be applied to industrial Proceedings of the 14th ACM conference on Com- projects to access the generalization of the results in prac- puter and communications security, 2007, pp. 529– tice. Besides, dynamic analysis can be incorporated in 540. this approach to improve detection performance. More- [11] Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu, Z. Chen, Sy- over, minimizing the search space in a function and pin- sevr: A framework for using deep learning to de- pointing statement-level vulnerabilities is also a potential tect software vulnerabilities, IEEE Transactions future scope. on Dependable and Secure Computing 19 (2021) 2244–2258. 72 [12] Y. Wu, D. Zou, S. Dou, W. Yang, D. Xu, H. Jin, Vul- cnn: An image-inspired scalable vulnerability de- tection system, in: Proceedings of the 44th Interna- tional Conference on Software Engineering, 2022, pp. 2365–2376. [13] Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu, Y. Zhang, Z. Chen, D. Li, Vuldeelocator: A deep learning-based system for detecting and locating software vulnerabilities, IEEE Transactions on Dependable and Secure Com- puting (2021). [14] J. Fan, Y. Li, S. Wang, T. N. Nguyen, A c/c++ code vulnerability dataset with code changes and cve summaries, in: Proceedings of the 17th Interna- tional Conference on Mining Software Repositories, 2020, pp. 508–512. [15] H. Schütze, C. D. Manning, P. Raghavan, Introduc- tion to information retrieval, volume 39, Cambridge University Press Cambridge, 2008. [16] W. B. Croft, D. Metzler, T. Strohman, Search en- gines: Information retrieval in practice, volume 520, Addison-Wesley Reading, 2010. [17] S. Rahman, K. Sakib, An appropriate method rank- ing approach for localizing bugs using minimized search space., in: ENASE, 2016, pp. 303–309. [18] S. Rahman, M. M. Rahman, K. Sakib, A statement level bug localization technique using statement dependency graph., in: ENASE, 2017, pp. 171–178. [19] N. E. Fenton, N. Ohlsson, Quantitative analysis of faults and failures in a complex software sys- tem, IEEE Transactions on Software engineering 26 (2000) 797–814. [20] T. J. Ostrand, E. J. Weyuker, R. M. Bell, Predicting the location and number of faults in large software systems, IEEE Transactions on Software Engineer- ing 31 (2005) 340–355. [21] H. Zhang, An investigation of the relationships between lines of code and defects, in: 2009 IEEE international conference on software maintenance, IEEE, 2009, pp. 274–283. 73