Benchmarking the vulnerability detection capabilities of software analysis tools Elena Baninemeh1 , Slinger Jansen1,2 1 Utrecht University, Utrecht, The Netherlands 2 Lappeenranta University, Lappeenranta, Finland Abstract Code cloning and copy-pasting code fragments is common practice in software engineering. If security vulnerabilities exist in a cloned code segment, those vulnerabilities may spread in the related software, potentially leading to security incidents. Code similarity is one effective approach to detect vulnerabilities hidden in software projects. However, due to the complexity, size, and diversity of source code, current methods suffer from low accuracy, and poor performance. Moreover, most existing clone detection techniques focus on a limited set of programming languages in the detection process. We propose to solve these problems using SearchSECO, a software analysis tool that detects vulnerabilities in multiple programming languages. Keywords Software vulnerability, code clone detection, software security, open-source software 1. Introduction The rapidly growing demands for software lead to the increasing popularity of code reuse, including existing code templates and components. Open-source software (OSS) has become one of the best solutions to improve both the efficiency and the quality of development at the meanwhile reducing cost. However, a considerable number of vulnerabilities in OSS programs would naturally lead to many software vulnerabilities caused by code cloning, which poses a severe threat to system security [1]. In fact, OSS has increased the rate of vulnerabilities because, as the name implies, the code is open-source and available to everyone. Most software developers copy the code from other software systems and reuse them without significant modification. This type of reuse code is called code cloning [2]. Code cloning is expected to rise, especially with tools such as GitHub co-pilot, which uses code templates and auto-completion features to support software engineers. Information about known vulnerabilities is published through different resources such as the National Vulnerability Database (NVD) in the form of Common Vulnerabilities and Exposures (CVE). Existing techniques for vulnerable code clone detection fall into two categories: code similarity and functional similarity. In code similarity approaches, the target source code is BENEVOL’22: The 21st Belgium-Netherlands Software Evolution Workshop, Mons, 12-13 September 2022 Envelope-Open e.baninemeh@uu.nl (E. Baninemeh); slinger.jansen@uu.nl (S. Jansen) GLOBE https://www.slingerjansen.nl/ (S. Jansen) Orcid 0000-0002-5201-1321 (E. Baninemeh); 0000-0003-3752-2868 (S. Jansen) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) compared against a set of known vulnerable code samples and determined to be vulnerable if a threshold of similarity is met. Code similarity approaches are typically classified based on four types of detection coverage. type-1 (identical), type-2 (syntactically equivalent), type-3 (syntactically similar), and type-4 (semantically similar) [3]. On the other hand, functional similarity approaches seek to generate abstract functional patterns of code which model vulnerable behavior. However, due to the complexity of building such a pattern, these techniques are typically specialized to only a small class of vulnerabilities or a particular source code project, rendering them ineffective as general-purpose vulnerable code clone detection techniques [4]. In this work, we introduce SearchSECO, a code-similarity technique capable of identifying modified vulnerable code clones while remaining generic to type-1 and type-2 and supporting multiple languages. Additionally, we built a database by mining vulnerable and patched source code from GitHub. In this paper, we present the main two processes, including vulnerabilities collection and vulnerabilities detection. 2. Research Approach SearchSECO is a large database of methods of the top rated projects (with “stars”) on Github. SearchSECO clones a git project, extracts a number of versions, and extracts the files and authors from those versions. The method’s abstract syntax tree is extracted and a representation of this abstract syntax tree is hashed. SearchSECO currently parses Java, Javascript, C/C++, and Python. SearchSECO is itself a project on Github and can be found via: https://github.com/ SecureSECO/SearchSECOController. Furthermore, the database can be accessed through a portal: https://secureseco.science.uu.nl/portal/. In this portal visitors can enter their own project link and email address. After the project has been processed and matched, the visitor receives a report of the matches in the SearchSECO database and can determine if there are any potentially vulnerable fragments in their project. Currently (June 14th 2022), the database contains 16 million unique methods from approximately 100 thousand projects from Github. The database until recently only had matching capability, but as it is the meta-data that makes the method database interesting, we have started by matching vulnerability data from vulnerability database and directly from open source project. In this paper, our research objective is to benchmark SearchSECO’s performance to other tools. It must be noted that software engineers using SearchSECO are not time constrained. However, they do care about accurate feedback about their projects and therefore we benchmark SearchSECO against other available state-of-the-art tools in terms of precision, rather than performance speed. The main research question in this study is as follows:(MRQ) How effective is the vulnerability detection feature of SearchSECO compared to other vulnerability detection approaches? We formulated the following research questions to address the MRQ: 𝑆𝑅𝑄1 : Is detection reporting in vulnerability detection approaches sufficient for benchmarking? 𝑆𝑅𝑄2 : How can SearchSECO be compared accurately to other vulnerability detection approaches? 𝑆𝑅𝑄3 : How is the scalability of SearchSECO in detecting vulnerabilities compared to state-of-the-art approaches? 𝑆𝑅𝑄4 : How effective is SearchSECO in detecting the latest vulnerabilities published by CVE and GitHub? We employed a literature study using the snowballing method, combined with document Table 1 An overview of the four research methods used in this study with their corresponding research questions (RQ) Research Method MRQ RQ1 RQ2 RQ3 RQ4 Literature study � � � � � Document analysis � � Replication study � � � � Benchmark study � � � analysis and replication study, and performing an experiment to compare SearchSECO with other vulnerability detection tools. Table 1 shows the mapping between the research questions and the research methods. The preferred literature study method is snowballing. Wohlin [5] presents several guidelines for this method which will use during the literature study. Document analysis is a systematic procedure for reviewing or evaluating documents, including manuscripts and illustrations, that have been published without a researcher’s intervention [6]. Document analysis is one of the analytical methods in qualitative research that requires data investigation and interpretation to elicit meaning, gain understanding, and develop empirical knowledge [7]. The preferred literature study method is snowballing. Wohlin [5] presents several guidelines for this method which will use during the literature study. Document analysis is a systematic procedure for reviewing or evaluating documents, including manuscripts and illustrations, that have been published without a researcher’s intervention [6]. Document analysis is one of the analytical methods in qualitative research that requires data investigation and interpretation to elicit meaning, gain understanding, and develop empirical knowledge [7]. Like many other empirical disciplines, replication has been seen as an essential means of assessing reliability and confidence in empirical findings. A key component of experimentation is replication. To consolidate a body of knowledge built upon experimental results, they must be extensively verified. This verification is carried out by replicating an experiment to check if its results can be reproducible [8]. We aim to use the ACM SIGSOFT Empirical Standards 1 for benchmarking procedure. 3. SearchSECO In this section, we describe the design and implementation of our proposed approach Search- SECO for method-level vulnerability detection. We aim to accurately discover the code clones between a set of vulnerable codes and a target program using the code clone detection technique. In SearchSECO, we focus on the detection of vulnerable code fragments accurately, Scaling to a large code base, and Supporting multiple languages. 1 https://github.com/acmsigsoft/EmpiricalStandards/blob/master/docs/Benchmarking.md 3.1. Vulnerabilities collection process We collected vulnerability data of each project from two sources: NVD and public Git repositories on GitHub. NVD is a vulnerability database built upon and fully synchronized with the CVE list. In addition to a large amount of vulnerability data, it also provides enhanced information (e.g., vulnerability type, references to solutions) for each record. GitHub provides a larger quantity and wider variety of code, which can help us supplement the vulnerability dataset. We built the SearchSECO vulnerability database in the following steps: • We crawled all of the vulnerability entries in the CVE database and NVD, such as the descriptive information for each vulnerability. Specifically, we parse the Github web pages to extract CVE Details such as vulnerable lines and hash commits. • The NVD receives its vulnerability listings directly from the CVE. Therefore, vulnera- bilities that are not reported to the CVE, so they would not publish in the NVD. Hence, beside extracting vulnerabilities from NVD, we have to extract vulnerabilities from Github (See figure 1). First, SearchSECO clone the repository by using the ”git clone repository” command. Then, it will search for the commits regarding CVEs for each repository by using the ” git log –grep=“CVE-20” command. This process of collecting vulnerable code and extraction required data is fully automated, Figure 1: This figure shows the vulnerability collection to store in SearchSECO. 3.2. Vulnerabilities detection process In this section, we describe our approach to vulnerability detection, which is a scalable approach to code clone detection. The types of code clones have to be clarified in order to explain the process. Four different types of code clones are, Type-1: Exact clones, Type-2: Renamed clones, Type-3: Restructured clones, and Type-4: Semantic clones. Type-1: Identical code fragments, but may have some variations in whitespace, layout, and comments. Type-2: Syntactically equivalent fragments with some variations in identifiers, literals, types, whitespace, layout, and comments. Type-3: Syntactically similar code with inserted, deleted, or updated statements. Type-4: Semantically equivalent but syntactically different code. We designed SerachSECO to detect Type-1 and Type-2 clones because our goal is to reduce false positives and negatives and increase scalability. 3.2.1. Code clone detection Figure 2 Shows all the steps and the process of SearchSECO. SearchSECO preprocesses a target program and generates a hash value by the MD5 algorithm. And then, it detects code clones by comparing two or more hashes. By generating a hash value consisting of vulnerable functions and comparing the stored hashes in SearchSECO with a generated hash from the target program, SearchSECO will declare vulnerable code clones in the target program. 3.2.2. Prepossessing The following steps will perform in the preprocessing when SearchSECO receives the code fragment or project to detect vulnerabilities. 1. Method extraction: The process start with retrieving functions from a given program by using a robust parser. 2. Abstraction and normalization: we used an abstraction and normalization feature, so every formal parameters, local variables, data types, and function calls that appear in the body of a function are replaced with symbols such as FPARAM in level 1, LVAR in level 2, DTYPE in level 3, and FUNCCALL in level 4. 3. Generating hash value: In this step, the hash value generate based on the MD5 algorithm. 4. Related work Many approaches have been proposed to detect the vulnerabilities brought by code clones. Kim et al. [9] proposed VUDDY, a highly efficient method for detecting vulnerable code cloning, which is achieved by leveraging function-level granularity and a length-filtering technique that reduces the number of signature comparisons. However, it does not support common code modification methods such as word order modification and redundant code insertion, which causes its limitation in practice. VFDETECT [10] proposed an approach based on an innovative fingerprint model to detect vulnerable code. VulPecker [11] developed a technique that identifies a vulnerability-to-similarity-algorithm mapping. This way, each algorithm can be applied to the vulnerabilities to which they are best suited. However, this approach is still limited by the underlying accuracy of the similarity algorithms and only achieves a recall score of 60%, meaning many vulnerable clones were left undetected. VCIPR [12] is a scalable system for vulnerability detection in unpatched source code. That uses a fast, token-based approach to detect vulnerabilities at function level granularity. Akram and Luo [13] developed a quantitative vulnerability detection technique based on the code clone detection technique at the source code level. They retrieved vulnerable source code Figure 2: This figure shows the vulnerability collection to store in SearchSECO. files from the various web source code repositories by tracking the patch file of vulnerabilities. Then, the vulnerable source code files are retrieved using common vulnerabilities and exposures (CVE) numbers. ReDeBug [14] is a technique that does use the information in both the vulnerable code and the patched code. ReDeBug performs sequence-based matching utilizing the diff files associated with a particular vulnerability. A diff file contains the lines that were explicitly modified during the transition of the code from vulnerable to patched, as well as some context code within close textual proximity. This allows ReDeBug to detect some type-3 clones; however, if the code modification is near the location of the lines modified during the patch process, this technique will fail to detect the vulnerable clone. 5. Conclusion In this study, we propose a vulnerability detection tool to benchmark with different approaches and methodology from state-of-the-art research on vulnerability detection. We aim to design our approach for scalable and accurate detection of vulnerable code clones. Moreover, we aim to address an automated way to collect vulnerable functions and implement SearchSECO to demonstrate its efficacy and effectiveness to detect numerous vulnerable clones from a large code base with unprecedented scalability and accuracy. References [1] J. Guo, H. Li, Z. Wang, L. Zhang, C. Wang, A novel vulnerable code clone detector based on context enhancement and patch validation, Wireless Communications and Mobile Computing 2022 (2022). [2] M. Mondal, C. K. Roy, K. A. Schneider, Identifying code clones having high possibilities of containing bugs, in: 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC), IEEE, 2017, pp. 99–109. [3] C. K. Roy, J. R. Cordy, R. Koschke, Comparison and evaluation of code clone detection techniques and tools: A qualitative approach, Science of computer programming 74 (2009) 470–495. [4] F. P. Viertel, W. Brunotte, D. Strüber, K. Schneider, Detecting security vulnerabilities using clone detection and community knowledge., in: SEKE, 2019, pp. 245–324. [5] C. Wohlin, Guidelines for snowballing in systematic literature studies and a replication in software engineering, in: Proceedings of the 18th international conference on evaluation and assessment in software engineering, 2014, pp. 1–10. [6] G. A. Bowen, Document analysis as a qualitative research method, Qualitative research journal (2009). [7] J. Corbin, A. Strauss, Basics of qualitative research: Techniques and procedures for devel- oping grounded theory, Sage publications, 2014. [8] N. Juristo, O. S. Gómez, Replication of software engineering experiments, in: Empirical software engineering and verification, Springer, 2010, pp. 60–88. [9] S. Kim, H. Lee, Software systems at risk: An empirical study of cloned vulnerabilities in practice, Computers & Security 77 (2018) 720–736. [10] Z. Liu, Q. Wei, Y. Cao, Vfdetect: A vulnerable code clone detection system based on vulnerability fingerprint, in: 2017 IEEE 3rd Information Technology and Mechatronics Engineering Conference (ITOEC), IEEE, 2017, pp. 548–553. [11] Z. Li, D. Zou, S. Xu, H. Jin, H. Qi, J. Hu, Vulpecker: an automated vulnerability detection system based on code similarity analysis, in: Proceedings of the 32nd Annual Conference on Computer Security Applications, 2016, pp. 201–213. [12] J. Akram, L. Qi, P. Luo, Vcipr: vulnerable code is identifiable when a patch is released (hacker’s perspective), in: 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST), IEEE, 2019, pp. 402–413. [13] J. Akram, P. Luo, Sqvdt: A scalable quantitative vulnerability detection technique for source code security assessment, Software: Practice and Experience 51 (2021) 294–318. [14] J. Jang, A. Agrawal, D. Brumley, Redebug: finding unpatched code clones in entire os distributions, in: 2012 IEEE Symposium on Security and Privacy, IEEE, 2012, pp. 48–62.