=Paper=
{{Paper
|id=Vol-3095/paper4
|storemode=property
|title=An Analysis of C/C++ Datasets for Machine Learning-Assisted Software Vulnerability Detection
|pdfUrl=https://ceur-ws.org/Vol-3095/paper4.pdf
|volume=Vol-3095
|authors=Daniel Grahn,Junjie Zhang
}}
==An Analysis of C/C++ Datasets for Machine Learning-Assisted Software Vulnerability Detection==
An Analysis of C/C++ Datasets for Machine Learning-Assisted Software Vulnerability Detection Daniel Grahn ∗and Junjie Zhang Wright State University Abstract As machine learning-assisted vulnerability detection research matures, it is critical to understand the datasets being used by existing papers. In this paper, we explore 7 C/C++ datasets and evaluate their suitability for machine learning-assisted vulnerability detection. We also present a new dataset, named Wild C, containing over 10.3 million individual open- source C/C++ files – a sufficiently large sample to be reasonably considered representative of typical C/C++ code. To facilitate comparison, we tokenize all of the datasets and perform the analysis at this level. We make three primary contributions. First, while all the datasets differ from our Wild C dataset, some do so to a greater degree. This includes divergence in file lengths and token usage frequency. Additionally, none of the datasets contain the entirety of the C/C++ vocabulary. These missing tokens account for up to 11% of all token usage. Second, we find all the datasets contain duplication with some containing a significant amount. In the Juliet dataset, we describe augmentations of test cases making the dataset susceptible to data leakage. This augmentation occurs with such frequency that a random 80/20 split has roughly 58% overlap of the test with the training data. Finally, we collect and process a large dataset of C code, named Wild C. This dataset is designed to serve as a representative sample of all C/C++ code and is the basis for our analyses. 1 Introduction As long as software has been written, it has contained vulnerabilities. Many vulnerabilities are introduced and removed without exploitation. Some are far more consequential. To avoid exploitation and the accompanying costs and embarrassment of an exposed vulnerability, orga- nizations have adopted a range of risk management measures. You can’t fix a vulnerability you don’t know exists. Thus, a cornerstone of any cybersecurity risk management is vulnerability detection. Vulnerability detection has traditionally been a hands-on and resource-intensive process. Manual code reviews divert programmers from fixing bugs, adding new features, and performing other tasks. Source code analysis is prone to false positives that still require manual review. More advanced solutions, such as fuzzing or dynamic analysis can be difficult to set up and similarly resource-intensive. Frequently, 3rd party software is accepted as safe without detailed review. As with many fields, machine learning offers the tantalizing hope of a future where vul- nerability detection is performed, in large part, by intelligent artificial agents. While not yet realized, this promise has motivated large amounts of research in the area. Much of this research is based on a small collection of vulnerability detection datasets within the C/C++ language family. Despite the critical nature of data in machine learning, little attention has been given to the datasets themselves. ∗ dan.grahn@wright.edu Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 An Analysis of C/C++ Datasets for MLAVD Daniel Grahn, Junjie Zhang Table 1: Selected C/C++ Vulnerability Datasets Dataset License Granularity Compiles? Cases # of Vulns Ref Big-Vul MIT Functions ✗ 348 Projects 3, 754 [19] Draper VDISC CC-BY 4.0 Functions ✗ 1.27M Functions 87, 804 [54] IntroClass BSD Scripts ✓ 6 Assignments 998 [33] Juliet 1.3 CC0 1.0 Scripts ✓ 64, 099 Cases 64, 099 [9] ManyBugs BSD Projects ✓ 5.9M Lines 185 [33] SVCP4C GPLv3 Files ✗ 2378 Files 9, 983 Taxonomy. . . MIT Scripts ✓ 1, 164 Cases 873 [29] Wild C/C++ CC-BY 4.0 Files ✗ 12.1M Files Unknown ✓ – Yes. ✗ – No. Vulnerability detection datasets are quite different from other types of machine learning datasets because they require cybersecurity experts to provide labels. Thus, datasets cannot be easily crowd-sourced using tools such as Mechanical Turk and are far more expensive to produce. Many dataset producers have found ways to avoid this problem, but their methods run the risk of introducing biases into the data. These biases may result in a model that fails to generalize. If the datasets portray a limited view of how C/C++ code is written, they may not understand the full diversity of the language. For example, a natural-language model trained only on the collected works of Dr. Seuss would not be expected to perform well on Shakespeare, Twitter, or any other number of sources. It is these biases and any additional shortcomings that we seek to uncover. In this paper, we explore 7 vulnerability datasets in the C/C++ language family. These datasets were selected based on their usage and to provide a variety of perspectives on machine learning-assisted vulnerability detection. The datasets can be categorized along two dimensions. The first is granularity or the level at which the information is sampled: functions, files, scripts, and projects. Function-level datasets contain only the signatures and contents of functions. File- level contain the contents of a single file. Unless the file happens to be independent, they are typically not compilable. Scripts are single- or multi-file programs with a single purpose, such as demonstrating a vulnerability. Projects contain the entirety of an application derived from a publicly accessible repository. The second dimension is whether the contents are compilable. Functions and files are typically not compilable while scripts and projects are. Our paper makes three contributions. First, we analyze the representivity of each of the datasets. We find that datasets drawn from existing code-bases are more representative than hand-crafted datasets. Second, we analyze the duplicativeness of the datasets. We find that all of them contain duplication with some containing a significant amount. Finally, we collect and process a large dataset of C code, named Wild C. This dataset is designed to serve as a representative sample of all C/C++ code and is the basis for our analyses. 2 Datasets 2.1 Big-Vul Big-Vul, published in Fan et al. [19], is available as a repository of scripts and CSV files [19]. The dataset was collected by crawling the Common Vulnerabilities and Exposures (CVE) database [46] and linking the CVEs with open-source GitHub projects. Using commit information, the authors extracted code changes related to the CVE. The resulting CSV files contain extracted Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 An Analysis of C/C++ Datasets for MLAVD Daniel Grahn, Junjie Zhang Table 2: Work Referencing Each Dataset Dataset Used By Big-Vul [12, 40, 45] Draper VDISC [7, 65, 69, 6, 63] IntroClass [76, 74, 48, 31, 3, 28, 32, 27, 60, 24] Juliet 1.3 [37, 35, 70, 17, 58, 14, 16, 22, 34, 1, 78, 2, 26, 72, 30, 61, 62, 56] [13, 41, 25, 47, 42, 20, 75, 36, 66, 4, 38, 73, 63, 5, 57, 15, 53, 67] ManyBugs [43, 55, 68, 39, 74, 10, 44, 52, 49, 21, 71] SVCP4C [51] Taxonomy. . . [18, 11] Table 3: Dataset Metrics A comparison of the datasets to Wild C. Hist. Dist. contains the energy distance between distributions of file length for each dataset and Wild C. Lower values indicate closer distributions. % contains the percentage of tokens/bigrams which are missing by count. Use % contains the percentage of token/bigrams which are missing weighted by their usage in Wild C. Dataset Hist. Dist. Missing Tokens Missing Bigrams Token % Diff Bigram % Diff Count % Use % Count % Use % Median Mean Median Mean Big-Vul 0.958 8 6.1 0.002 6, 063 74.0 0.055 34.5 48.1 56.3 244.1 Draper VDISC 0.878 2 1.5 0.001 4, 788 58.4 0.054 41.5 49.0 68.2 226.3 IntroClass 1.115 92 70.8 11.547 8, 066 98.4 42.051 81.4 316.1 94.7 734.9 Juliet 0.651 43 33.1 0.317 7, 637 93.2 4.651 82.9 612.2 92.8 1, 341.1 ManyBugs 0.219 11 8.5 0.018 5, 408 66.0 0.106 50.0 86.0 89.6 2, 363.6 SVCP4C 0.459 23 17.7 0.061 6, 654 81.2 0.320 40.9 59.7 72.3 498.0 Taxonomy... 1.198 74 56.9 4.954 7, 989 97.5 19.326 93.9 432.8 92.4 635.6 130 Total Tokens 8, 274 Total Bigrams in Wild C functions before and after the commit that fixed the vulnerability. The scripts are included for reproducibility of this process, but we were unable to get them to execute properly. Thankfully, a 10GB CSV containing all of the processed data is available for download. 2.2 SonarCloud Vulnerable Code Prospector for C (SVCP4C) Raducu et al. [51] take a different approach to collecting vulnerable code. Instead of relying on the existing datasets provided by the NIST or CVE database, it draws from open-source projects whose code is processed using the SonarCloud vulnerability scanner [59]. This is performed directly through the SonarCloud API which allows public access to scrape-friendly vulnerability data. SVCP4C is technically a tool for collecting data. However, the authors do provide a dataset in the paper. This is the data that we review. All files in the dataset contain vulnerabilities and comments detailing the vulnerable lines. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 An Analysis of C/C++ Datasets for MLAVD Daniel Grahn, Junjie Zhang 2.3 Juliet 1.3 Juliet is the largest hand-created1 C/C++ vulnerability dataset with entire programs [9]. The dataset is available in C/C++, C#, and Java variants. Each has a large number of test cases, but C/C++ is the largest with 64, 099. The test cases are divided by CWE, although some cases contain multiple CWEs. Each test case can be compiled into a separate program or combined into a monolithic binary. Compilation options allow the test cases to be compiled into safe or vulnerable versions with minimal code changes. Some test cases are only compilable on Windows machines, but the majority are cross-platform. In a brief survey, we found at least 23 papers that used the Juliet dataset directly. Ad- ditionally, Juliet is a major component of the National Institute of Standards and Technology (NIST) Software Assurance Reference Dataset (SARD) [8]. When large datasets are drawn from the SARD, they are likely relying upon Juliet in some way. Because of this prevalence, Juliet deserves an extra level of scrutiny. 2.4 ManyBugs & IntroClass ManyBugs and IntroClass are a pair of datasets presented by Le Goues et al. [33]. These datasets are designed to be a benchmark for automated repair methods. ManyBugs contains 185 defects across 9 open-source programs. These defects were collected from version control. In total, it has 5.9 million lines of code and 10, 000+ test cases. IntroClass consists of 998 defects from student submissions of six programming assignments. It includes input/output test cases for each programming assignment. 2.5 A Taxonomy of Buffer Overflows A Taxonomy of Buffer Overflows is unique because it attempts to create a structured taxonomy of buffer overflows based on 22 attributes. The result is 291 different buffer overflows. For each type, three flawed examples (overflow just outside, somewhat outside, and far outside) and a non-vulnerable version are included. This results in a total of 873 vulnerabilities. Due to the diversity of vulnerabilities in this dataset, it provides a distinctive opportunity for testing a vulnerability detection method against a full range of possibilities. Taxonomy is included as part of the NIST SARD. 2.6 Draper Vulnerability Detection in Source Code (VDISC) The Draper VDISC dataset was produced as part of the Defense Advanced Research Projects Agency’s (DARPA) Mining and Understanding Software Enclaves (MUSE) project [54]. To build the dataset the authors collected code from the Debian Linux distribution and public Git repositories from GitHub. They split the code into functions and using a custom minimal lexer then removed duplicate functions. The strict process used by the authors for removing duplicates resulted in only 10.8% of the collected functions being included in the dataset. The authors labeled the remaining functions by using three open-source static source-code analyzers: Clang, Cppcheck, and Flawfinder. Because each of these tools has disparate outputs, the authors mapped the results into their corresponding CWEs. Despite including code from the Juliet dataset in their internal dataset, the authors do not include it in the publicly released version. 1 Juliet is generated using custom software, but the test cases have been created by hand. The software is not publicly available. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 An Analysis of C/C++ Datasets for MLAVD Daniel Grahn, Junjie Zhang 3 Wild C For this paper, we want to compare the datasets to realistic C/C++ source code. It is beyond the scope of this (or any) paper to collect all C/C++ source code. Instead, we created a dataset named Wild C from GitHub repositories [23]. To collect these repositories, we made use of GitHub’s public search API using a simple scraping algorithm. At the time of writing, GitHub had limitations on their API that made collection challenging. First, the search endpoint we used is rate-limited to 5, 000 requests per hour. This limits the queries to one every 0.72 seconds on average. Because cloning repositories takes some time, we did not encounter this problem in practice. However, a simple solution would be to perform rate-limiting on the client side. Second, GitHub will only return 1, 000 results per search query. This means that our search queries must be limited to under 1, 000 results. We accomplished this by searching for repositories with less than or equal to a certain number of stars and sorting the results by the number of stars (descending). We then iterate over the search results until we encounter a page ending with a repository starred fewer times than our current search maximum. Instead of requesting another page, we change the search to lower the maximum number of stars. Using this method, we were able to collect 36, 568 repositories with at least 10 stars each. While there are many repositories with less than 10 stars, we found that they contained far less code and were likely to have a "spike" of commits followed by little-to-no activity. This indicates that most of these repositories are likely to be one-off projects, programming assignments, and similar. The collected repositories contain 9, 068, 351 C and 3, 098, 624 C++ files for a total of 12, 166, 975 source code files. In addition to using 10 stars as a cutoff to prevent diminish- ing returns, we also use it as a soft metric to assess approval by outside reviewers. The code collected is effectively a sample of C/C++ that is present in public repositories with some degree of community acceptance. There are a few areas where Wild C may not be entirely represen- tative. First, it may favor code that complies with community standards which are strongly encouraged on GitHub. Second, it may favor less buggy code as many of the projects may have active communities. Finally, code in private repositories may differ from public repositories due to the code’s functionality being necessarily private or due to the intrinsic privacy of the code. No one sees the hidden bad practices. Despite these potential areas of divergence, we believe that the collection methods and the size of the dataset indicate it is sufficiently close to a truly representative sample of C/C++ for our purposes. We next extracted tokens from each file. For ease of use, the C/C++ and tokens were packaged into a collection of parquet files. While the dataset is licensed as CC-BY 4.0, the individual source files are licensed under their original repositories. We have released this dataset for public consumption. To the best of our knowledge, it is the first public dataset of C/C++ code and paired tokens of this size. 4 Preprocessing Comparing the datasets requires that they be in a consistent format. This is a difficult task since they are not available in a standard format. Some datasets contain whole software projects, others single files, others individual functions. Ideally, we would be able to compile all of the files. With compiled files, we could compare their source, assembly, and binary format. However, only a few of the datasets are compilable. Thus, we will limit our comparison to source code. As the first step, we downloaded the datasets and extracted all code into C-files. This worked best when the datasets already contained whole projects or whole files. When the datasets contained functions, we extracted each function into a separate file. While this results in invalid C-files, it allows us to trace later steps directly to the function. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 An Analysis of C/C++ Datasets for MLAVD Daniel Grahn, Junjie Zhang Table 4: Token CSV Columns Column Description uuid Generated UUID for referencing purposes dataset Source dataset file_name Source filename token_num Index of the token in the file char_start Character at which the token starts, relative to file start char_end Character at which the token ends, relative to file start token_text Raw text of the token token_type Type of token as specified by the grammar channel ANTLR internal for handling different categories of input line Line on which the token starts line_char Character on which the token starts, relative to line start Figure 1: Histogram of tokens per file broken Figure 2: Kernel density estimates for each down by dataset using a kernel density esti- dataset’s bigram usage frequency. Missing bi- mate. Tokens are present on a log-scale. grams are excluded. With all the code in C-files, we tokenize the source using ANother Tool for Language Recog- nition, also known as ANTLR [50]. ANTLR is a generic parser generator that has an existing context-free grammar for C. Each of the C-files was converted to tokens in a CSV format. The CSV files contain columns listed in Table 4. These CSV files are the basis for all of the comparisons. 5 Results 5.1 Number of Tokens Per File For a machine learning model to generalize, the distribution of data should remain consistent from training to inference. The further the distance between these distributions, the less likely the model is to generalize. Our first comparison is the number of tokens per file aggregated for each dataset. In other words, this allows us to compare the file lengths across different datasets. Figure 1 plots the kernel density estimate for each dataset. The x-axis is the number of tokens in a given file and the Y-axis is the estimated density of files that contain the specific number of Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 An Analysis of C/C++ Datasets for MLAVD Daniel Grahn, Junjie Zhang Table 5: Top Token Usage Outliers Compared to Wild C Big-Vul Draper VDISC IntroClass Juliet Type % Diff Type % Diff Type % Diff Type % Diff 0 explicit 425.1 register 255.9 % 3200.8 wchar_t 34435.5 1 char16_t 219.5 this 196.0 AndAnd 2505.7 namespace 3233.5 2 register 212.1 delete 140.6 / 1203.3 delete 2744.1 3 static_cast 179.4 double 111.1 else 645.2 using 2041.2 4 :: 152.9 inline −100.0 <= 606.5 Directive 766.6 5 const_cast 115.7 constexpr −99.4 < 528.3 short 736.5 6 new 100.1 static_assert −99.2 == 450.9 CharLiteral 709.6 7 throw −97.9 decltype −97.5 += 810.3 do 704.6 8 namespace −97.5 char16_t −97.0 >= 327.2 char 635.7 9 operator −97.3 protected −96.6 while 288.4 new 571.5 ManyBugs SVCP4C Taxonomy. . . Merged Type % Diff Type % Diff Type % Diff Type % Diff 0 extern 2010.3 CharLiteral 406.0 do 5711.3 extern 1434.1 1 typedef 574.2 register 279.5 char 2373.4 wchar_t 926.7 2 wchar_t 332.0 char 190.7 <= 2293.0 typedef 397.2 3 CharLiteral 322.0 AndAnd 185.5 CharLiteral 2185.5 CharLiteral 284.5 4 <= 298.5 alignof 162.3 while 1843.8 register 253.9 5 signed 295.3 != 160.5 int 1176.4 <= 220.6 6 register 262.2 export 158.1 typedef 1054.6 signed 199.8 7 double 234.3 == 143.4 += 810.3 double 188.1 8 ... 191.2 wchar_t 142.1 ] 695.0 char 155.4 9 char 165.1 OrOr 135.4 [ 695.0 ... 145.6 tokens. As is evident, the vulnerability datasets are quite different from Wild C. We quantified this using energy distance[64] between each histogram and the histogram for Wild C and present the results in Table 3 (Hist. Dist. column). While one could hope for better agreement, the results are expected. ManyBugs is collected from several large open-source projects, similar to Wild C, and has the closest agreement. Conversely, Taxonomy of Buffer Overflows contains minimalist examples of buffer overflows which causes a spike in the KDE around 100 tokens per file and the maximum energy distance. While Juliet has a better than average distance, it lacks some of the longer files found in Wild C. 5.2 Token Usage by Dataset Next, we compare the total usage of each token by dataset to its usage in Wild C. This measures how frequently each dataset uses the tokens. One of the most important observations is that each dataset is missing some of the token types. While Draper VDISC misses only two uncommon tokens (alignas, noexcept), Taxonomy of Buffer Overflows misses tokens such as Not, /, and this. Juliet misses tokens such as +=, continue, and enum. As shown in Table 3, the datasets such as Juliet and SVCP4C have a significant number of missing tokens. But these tokens are not frequently used in Wild C and thus don’t cause a large increase in the Use %. To account Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 An Analysis of C/C++ Datasets for MLAVD Daniel Grahn, Junjie Zhang for the disparate lengths of the files, the token frequencies were normalized by the most-frequent token. This was Identifier for all of the datasets. Usage of tokens is subject to extreme outliers as shown in Table 5. IntroClass uses the % token 3, 200% more than Wild C, ManyBugs uses extern 2, 010% more, and Taxonomy of Buffer Overflows uses do 5, 711% more. However, the furthest outlier belongs to Juliet which uses wchar_t an astounding 34, 435% more than Wild C. The wchar_t data type is found in 29, 264 test cases. A review of Juliet indicates that the dramatic increase is likely due to how Juliet creates test cases. Juliet has many test cases that are near-identical with slight tweaks to their relevant data types. This is explored further in Section 5.4 5.3 Bigram Usage by Dataset Extending the analysis of token types, we next compare the frequency of usage for bigrams of tokens. Bigrams are commonly used in natural language processing to provide context that individual tokens lack. We continue to normalize by the most frequent bigram per dataset. An upper bound on the total number of bigrams, derived from the 130 tokens, is 16, 900. Because many of those bigrams would be invalid in C/C++, we do not have an exact total for the number of possible bigrams. In our datasets, we observe 8, 195 unique bigrams. These results are shown in Table 3. The number of bigrams present in Wild C that are missing in the datasets is far larger than the number of tokens. Draper VDISC, the dataset with the most bigrams, contains only 42.6% of the bigrams in Wild C. However, that still represents only 0.054% of bigram usage. Two datasets stand out when compared to each other. SVCP4C is missing 81.2% of the bigrams, but these are only used 0.320% of the time. Juliet slightly increases the number of missing bigrams to 93.2% but drastically increased the missing usage percentage to 4.651%. Figure 2 shows the kernel density estimate for the bigram usage of each dataset. From this perspective, we can see a strong dividing line between collected and generated datasets. Juliet, IntroClass, and Taxonomy of Buffer Overflows are all created specifically for vulnerability detection or bug fixing. They have no less than 4.6% missing bigrams and are separated into a cluster of three furthest away from the Wild C distribution. ManyBugs, SVCP4C, Draper VDISC, and Big-Vul all represent source code drawn from open-source codebases. Each is missing less than 0.5% of bigrams and is closer to the overall distribution of Wild C. Notably, the proximity to Wild C appears to be correlated to the size of the dataset. Draper VDISC contains 1, 274, 466 files and is most similar to Wild C. It is followed by ManyBugs with 223, 052 files, Big-Vul with 142, 937 files, and SVCP4C with 11, 376 files respectively. 5.4 Juliet Data Leakage Analysis The augmentation of test cases in Juliet has implications for using it or the SARD, of which Juliet is a significant subset, as a source for training and test data. Randomly splitting the Juliet dataset, regardless of whether stratified by CWE, will introduce data leakage between training and test sets. Consider the task of detecting faces in an image. If the dataset was augmented by changing the hair or eye color of faces, splitting the dataset randomly would cause near-duplicate images to be placed in the test and train datasets. Data leakage of this type could lead to significantly inflated test performance, a failure to generalize, and more. To evaluate the extent of this potential data leakage, we first identified the test groups. Figure 3 shows the augmentation which was used to build Juliet. While there are 100, 883 files in Juliet, there are approximately 61, 000 unique test groups. On average, each test group has 1.64 augmentations (ranging between 1 and 5). The majority of files exist in test groups that contain two files. With these test groups identified, we performed 500 random splits of the Juliet files using a standard 80/20 ratio. For each of these splits, we determined how many files from the test set had Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 An Analysis of C/C++ Datasets for MLAVD Daniel Grahn, Junjie Zhang Figure 3: Distribution of augmentations by Figure 4: Test and train files with matches in number of test cases and number of files. Aug- the opposite set. This data leakage is likely to mentations made pre-train/test split are likely cause overfitting if a model is trained on the to produce data-leakage. Juliet dataset without mitigation. Figure 5: Histogram of near-duplicates based on the number of files in each size group of near-duplicates with vertical bars indicating groups of 2 and 100. Draper VDISC and Big-Vul exhibit deduplication. Strong duplication is seen in Juliet, ManyBugs, SVCP4C, and Taxonomy of Buffer Overflows. augmentations that existed in the training set and vice versa. Figure 4 shows the distribution of these numbers. Without accounting for the augmentation, splitting the Juliet files results in a mean of 58.3% overlap of the test split with the train split and 22.1% of the train split with the test split. 5.5 Near-Duplicate Files As a final analysis, we measured the number of near-duplicate files in each dataset. While the most precise method for finding near-duplicates would be based on the raw contents of the file, that may miss semantically similar files. To account for this, we again based our near- duplicate detection on the token types. We used MinHash with Locality-Sensitive Hashing from the datasketch library to find near-duplicates with a Jacquard similarity threshold of 0.99 [77]. Near-duplicates come in groups. Most of the time it’s not two files that are identical to each other. Rather, there is a group of files that share similar attributes. Analyzing these groups is somewhat complicated. A group of 10 duplicates is far more consequential for a dataset with 100 files than a dataset of 1 million files. However, normalizing by the total number of files in the dataset makes it difficult to determine how many files are in any given group. We attempt to balance these tensions in Figure 5. This figure shows the cumulative density functions (CDF) for the percent of files and percent of groups over as the group size increases. The X-axes contain Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 An Analysis of C/C++ Datasets for MLAVD Daniel Grahn, Junjie Zhang Table 6: Near-Duplicate File Information All datasets exhibit duplication and suffer from data leakage between random test/train splits. Dataset Unique Groups % of Dataset Test Split % Test w/Match % Train w/Match Big-Vul 91, 300 63.87% 0.10 45.84% 23.01% Draper VDISC 931, 804 73.12% 0.01 36.10% 5.29% IntroClass 28 45.16% 0.20 70.14% 43.27% Juliet 7, 933 7.84% 0.10 98.00% 82.60% ManyBugs 8, 197 3.67% 0.10 99.70% 91.19% SVCP4C 1, 104 9.71% 0.20 99.77% 86.05% Taxonomy. . . 61 5.24% 0.20 99.91% 93.66% Wild C 2, 343, 364 21.97% 0.01 85.16% 36.25% the group size presented on a log scale and normalized by the total number of files. Each dataset has the same axes limits and two vertical bars. The solid vertical bar indicates where a group with 2 files would be placed on the X-axis. Similarly, the dotted vertical bar indicates where a group of 100 files would be placed. Starting with Wild C, we can see that there is some amount of duplication in wild source code. This is logical for at least two reasons: (1) programmers frequently share source code that gets copied and remixed; (2) discrete sections of source code are likely to repeat tasks. The largest group of near-duplicate files in Wild C has 132, 389 files which are constant-definition files. The next largest group only has 25,751 and group sizes reduce quickly from there. The plot of the number of files for Draper VDISC is similar to the number of groups for Wild C. This is a clear indication of the efficacy of their duplicate removal process. The difference in raw numbers can be explained by their slightly stricter approach to detecting duplicates. Of note, the plot for Big-Vul exhibits similar evidence of deduplication. This deduplication is not mentioned in the paper. Figure 5 also shows that nearly all of the files from SVCP4C, Juliet, ManyBugs, and Tax- onomy, and IntroClass have at least one duplicate. Our analysis revealed the root cause of this duplication: • IntroClass draws from a limited number of assignments. • Taxonomy is hand generated with intentional duplication to demonstrate good/bad code. • ManyBugs includes multiple copies of the same applications. • Juliet 1.3 includes augmentations for many vulnerability examples. Because SVCP4C uses SonarCloud vulnerability detection to label their dataset, some amount of duplication is expected. The algorithms used by SonarCloud are likely to pick up common patterns within source-code whether they are true or false positives. Draper VDISC likely suffered a similar problem with duplication before their duplicate removal process. Table 6 reverses the analysis and provides the number of "near-unique" files and that number as a percentage of the original dataset. This is analogous to the number of groups of near- duplicates for each dataset. It also provides the mean percentage of test samples with a near- duplicate in the training data and the mean percentage of training samples with a near-duplicate in the test data for 500 random splits of the indicated split size. It’s important to note that while the method of calculating the metric is the same as for Section 5.4, the means of identifying duplicates are different. This leads to a metric for Juliet than previously provided. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 An Analysis of C/C++ Datasets for MLAVD Daniel Grahn, Junjie Zhang 6 Conclusion 6.1 Contributions Our work makes three significant contributions: (1) analysis of representivity, (2) analysis of duplicativeness, and (3) availability of Wild C. Our work shows that there are significant differ- ences between the selected datasets and wild C/C++ code. As a result, some of the datasets may have limited usefulness for machine learning-assisted software vulnerability detection. The IntroClass and Taxonomy of Buffer Overflows datasets are not well suited for this task. They have over 97% of bigrams missing with a significant portion of those being in common usage. Because of this, it would be possible to have high performance on these datasets while learning less than 68% of the C/C++ language. They also exhibit significant differences in length, token usage, and bigram usage. Based on our analysis, we assess that they do not contain enough diversity to be a thorough test set nor size to be a training set. Conversely, Big-Vul, SVCP4C, and ManyBugs proved to be reasonably close to Wild C. They had among the fewest missing tokens & bigrams and lowest token & bigram usage difference. However, all three had a high degree of duplication and different drawbacks. Big-Vul contains only 3, 754 vulnerabilities and is not compilable because it only contains functions. While it appears to have been deduplicated, a significant amount of near-duplicates remain. It may be a suitable test dataset if the method uses file- or function-level information, pending further analysis of deduplication. SVCP4C has only 1, 104 unique groups after deduplication, a reduction of 90.29%. The collection method also means that any model trained on SVCP4C will be learning to emulate SonarCloud rather than learning the ground truth of vulnerabilities. For these two reasons, we recommend future work using SVCP4C address duplication and collection biases before usage. ManyBugs does have a slight edge over SVCP4C and Big-Vul because it contains entire projects that are compilable. Despite this, it had the most duplication with unique groups making up only 3.67% of the original dataset. A model trained on the dataset as provided may learn to “spot the difference" between projects rather than identifying vulnerabilities. However, ManyBugs has potential as a test dataset because it contains large, real-world projects. We recommend that before using ManyBugs, the duplication be addressed. Based on our evaluation of the metrics, Draper VDISC appears to be a promising dataset for training and testing machine learning models. It has a permissive license, contains 87, 804 vulnerabilities, and has 1.27 million functions. Unfortunately, it is not compilable and is not able to be used with methods that require intermediate, assembly, or binary representations. We have two outstanding concerns regarding the use of this dataset. First, the collection method is similar to SVCP4C. In this case, the authors used multiple static analysis tools to identify vulnerabilities and combined the results. Analysis of SVCP4C showed that the code identified by SonarCloud was very similar. Because the authors of Draper VDISC deduplicated their dataset before releasing the dataset, we were unable to analyze the similarity of the dataset before deduplication. It is possible that using the intersection of static analysis tools led to a higher level of duplication. Additionally, any model trained on Draper VDISC is ultimately learning the tools rather than the underlying ground truth. Second, our near-duplicate detection identified 26.88% of the dataset as near-duplicates despite the authors performing deduplication. As the authors detail, their deduplication strategy was strict. A near-duplicate detection strategy may lead to a more useful dataset. While we assess that Draper VDISC has strong potential, we recommend future work address the above-mentioned concerns. A significant contribution in this paper is the discussion of test case augmentation within Juliet. While others have stated their concerns [65], we believe this is the first empirical analysis of the drawbacks of using Juliet as a training and/or test set. Many of the papers making use of Juliet or the NIST SARD (it’s parent dataset) do not address this augmentation or describe steps to remove it [63, 57, 5, 15, 53, 41, 25, 47, 42, 20, 75, 36, 66, 4, 38, 73, 67]. Because of Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 An Analysis of C/C++ Datasets for MLAVD Daniel Grahn, Junjie Zhang the high potential for data leakage if augmentations are not removed, we believe the evidence supports using caution when reviewing metrics based on Juliet as they may not be an accurate reflection of their accuracy on real-world code. For future work using Juliet as a training and/or test dataset, we recommend that appropriate measures be taken to mitigate the potential for data leakage and that those measures be clearly stated to avoid ambiguity. Finally, we are pleased to provide the Wild C dataset to the public. There are a wide variety of potential uses for this dataset. Due to its size and composition, it is suitable as a representative sample of the overall distribution of C/C++ source code. This is a critical factor for our analysis and enables the dataset to be used as a precursor for additional tasks. With some processing, it is possible to extract any file- or function-level information and build a task- specific dataset. Potential tasks include, but are not limited to: comment prediction, function name recommendation, code completion, and variable name recommendation. There is also potential for automatic bug insertion to provide an expanded vulnerability detection dataset. Wild C is available at https://github.com/mla-vd/wild-c. 6.2 Future Work The are many areas where this work could be expanded. First, we only considered C/C++ datasets. This is the most commonly used language family for machine learning-assisted software vulnerability detection, but it is not the only one. Datasets from other languages exist and deserve similar analysis. Second, we only compared the datasets in their entirety. Further analysis may compare the difference between the safe and vulnerable subsets of the datasets with each other and wild C/C++ code. While this has the potential to elucidate useful differences between safe and vulnerable code, more likely it will further highlight the problems with the existing datasets. Additionally, further work is needed to determine how much deduplication of Big-Vul, SVCP4C, and ManyBugs would reduce the number of vulnerabilities each. Perhaps the most pressing need from future research is the creation of vulnerability-detection benchmarks. Juliet has been used for this purpose in previous papers, but our analysis brings that usage into question. Given the diversity of the dataset types among those selected (e.g., files, functions, programs) it is unlikely that a single dataset could serve as a universal training dataset similar to those available for computer vision tasks. This does not mean that a benchmark is infeasible. Such a benchmark should meet at least five requirements. (1) It must be drawn from real-world code. As illustrated in Section 5.3, there is a distinct and quantifiable difference between synthetic and natural code. The barriers to labeling real-world code are likely far lower than bringing synthetic code into the real-world distribution of usage. (2) It must be compilable. This will enable it to support methods that work on assembly, binaries, or otherwise require compilable code. (3) It should exercise a sufficient diversity of C/C++. This will allow the dataset to avoid issues with missing tokens/bigrams and ensure that the model understands the language. Further testing is needed to determine how much diversity is necessary. (4) It should be difficult enough to act as a viable benchmark. A benchmark that is too easy will quickly outlive its usefulness. This difficulty should not only include the depth and likelihood of a vulnerability but the amount of code “noise" surrounding the vulnerabilities. (5). It should be deduplicated. As shown in Section 5.5, even Wild C is subject to a large degree of duplication. This should be removed to ensure that the model isn’t biased towards the features present in duplicates. Given the limited datasets that exist today, much work is needed in the field of machine learning-assisted vulnerability detection. While the methods being applied to the datasets are promising, we assess that the limiting factor may be the datasets themselves. However, many machine learning tasks seemed out of reach just a few years ago. Machine learning researchers have performed astounding tasks in many areas and we expect to count this as one in the future. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 An Analysis of C/C++ Datasets for MLAVD Daniel Grahn, Junjie Zhang References [1] E. Alikhashashneh, R. Raje, and J. Hill. Using software engineering metrics to evaluate the quality of static code analysis tools. In 2018 1st International Conference on Data Intelligence and Security (ICDIS), pages 65–72. IEEE, 2018. [2] R. Amankwah, J. Chen, A. A. Amponsah, P. K. Kudjo, V. Ocran, and C. O. Anang. Fast bug detection algorithm for identifying potential vulnerabilities in juliet test cases. In 2020 IEEE 8th International Conference on Smart City and Informatization (iSCI), pages 89–94, 2020. doi: 10.1109/iSCI50694.2020.00021. [3] L. A. Amorim, M. F. Freitas, A. Dantas, E. F. de Souza, C. G. Camilo-Junior, and W. S. Martins. A new word embedding approach to evaluate potential fixes for automated program repair. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2018. [4] S. Arakelyan, C. Hauser, E. Kline, and A. Galstyan. Towards learning representations of binary executable files for security tasks. arXiv:2002.03388 [cs, stat], 2020. [5] S. Arakelyan, S. Arasteh, C. Hauser, E. Kline, and A. Galstyan. Bin2vec: learning rep- resentations of binary executable programs for security tasks. Cybersecurity, 4(1):1–14, 2021. [6] Z. Bilgin. Code2image: Intelligent code analysis by computer vision techniques and appli- cation to vulnerability prediction. arXiv preprint arXiv:2105.03131, 2021. [7] Z. Bilgin, M. A. Ersoy, E. U. Soykan, E. Tomur, P. Çomak, and L. Karaçay. Vulnerability prediction from source code using machine learning. IEEE Access, 8:150672–150684, 2020. [8] P. Black. A software assurance reference dataset: Thousands of programs with known bugs, 2018. [9] P. E. Black and P. E. Black. Juliet 1.3 Test Suite: Changes From 1.2. US Department of Commerce, National Institute of Standards and Technology, 2018. [10] S. Chakraborty, Y. Li, M. Irvine, R. Saha, and B. Ray. Entropy guided spectrum based bug localization using statistical language model. arXiv preprint arXiv:1802.06947, 2018. [11] J. Chen and X. Mao. Bodhi: Detecting buffer overflows with a game. In 2012 IEEE Sixth International Conference on Software Security and Reliability Companion, pages 168–173, 2012. doi: 10.1109/SERE-C.2012.35. [12] Z. Chen, S. Kommrusch, and M. Monperrus. Neural transfer learning for repairing security vulnerabilities in c code. arXiv preprint arXiv:2104.08308, 2021. [13] X. Cheng, H. Wang, J. Hua, G. Xu, and Y. Sui. Deepwukong: Statically detecting software vulnerabilities using deep graph neural network. ACM Trans. Softw. Eng. Methodol., 30 (3), Apr. 2021. ISSN 1049-331X. doi: 10.1145/3436877. URL https://doi-org.ezproxy. libraries.wright.edu/10.1145/3436877. [14] M.-j. Choi, S. Jeong, H. Oh, and J. Choo. End-to-end prediction of buffer overruns from raw source code via neural memory networks. arXiv preprint arXiv:1703.02458, 2017. [15] R. Croft, D. Newlands, Z. Chen, and M. A. Babar. An empirical study of rule-based and learning-based approaches for static application security testing. arXiv preprint arXiv:2107.01921, 2021. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 An Analysis of C/C++ Datasets for MLAVD Daniel Grahn, Junjie Zhang [16] B. H. Dang. A practical approach for ranking software warnings from multiple static code analysis reports. In 2020 SoutheastCon, volume 2, pages 1–7. IEEE, 2020. [17] R. Demidov and A. Pechenkin. Application of siamese neural networks for fast vulnerability detection in mips executable code. In Proceedings of the Future Technologies Conference, pages 454–466. Springer, 2019. [18] S. Dhumbumroong and K. Piromsopa. Boundwarden: Thread-enforced spatial memory safety through compile-time transformations. Science of Computer Programming, 198, 2020. [19] J. Fan, Y. Li, S. Wang, and T. N. Nguyen. A c/c++ code vulnerability dataset with code changes and cve summaries. In Proceedings of the 17th International Conference on Mining Software Repositories, pages 508–512, 2020. [20] H. Feng, X. Fu, H. Sun, H. Wang, and Y. Zhang. Efficient vulnerability detection based on abstract syntax tree and deep learning. In IEEE INFOCOM 2020 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pages 722–727, 2020. doi: 10.1109/INFOCOMWKSHPS50562.2020.9163061. [21] X. Gao, B. Wang, G. J. Duck, R. Ji, Y. Xiong, and A. Roychoudhury. Beyond tests: Pro- gram vulnerability repair via crash constraint extraction. ACM Transactions on Software Engineering and Methodology (TOSEM), 30(2):1–27, 2021. [22] Y. Gao, L. Chen, G. Shi, and F. Zhang. A comprehensive detection of memory corruption vulnerabilities for c/c++ programs. In 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), pages 354–360. IEEE, 2018. [23] github. Github, 2020. URL https://github.com/. [24] T. Helmuth. General program synthesis from examples using genetic programming with parent selection based on random lexicographic orderings of test cases. PhD thesis, University of Massachusetts Amherst, 2015. [25] S. Jeon and H. K. Kim. Autovas: An automated vulnerability analysis system with a deep learning approach. Computers & Security, 106:102308, 2021. [26] J. Kang and J. H. Park. A secure-coding and vulnerability check system based on smart-fuzzing and exploit. Neurocomputing, 256:23–34, 2017. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2015.11.139. URL https://www.sciencedirect.com/ science/article/pii/S0925231217304113. Fuzzy Neuro Theory and Technologies for Cloud Computing. [27] Y. Ke, K. T. Stolee, C. Le Goues, and Y. Brun. Repairing programs with semantic code search (t). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 295–306. IEEE, 2015. [28] A. Koyuncu. Boosting Automated Program Repair for Adoption By Practitioners. PhD thesis, University of Luxembourg, Luxembourg, 2020. [29] K. Kratkiewicz and R. Lippmann. A taxonomy of buffer overflows for evaluating static and dynamic software testing tools. In Proceedings of Workshop on Software Security Assurance Tools, Techniques, and Metrics, volume 500, pages 44–51, 2006. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 An Analysis of C/C++ Datasets for MLAVD Daniel Grahn, Junjie Zhang [30] J. A. Kupsch, E. Heymann, B. Miller, and V. Basupalli. Bad and good news about us- ing software assurance tools. Software: Practice and Experience, 47(1):143–156, 2017. doi: https://doi.org/10.1002/spe.2401. URL https://onlinelibrary.wiley.com/doi/ abs/10.1002/spe.2401. [31] X.-B. D. Le, D.-H. Chu, D. Lo, C. Le Goues, and W. Visser. S3: syntax-and semantic- guided repair synthesis via programming by examples. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pages 593–604, 2017. [32] X. B. D. Le, F. Thung, D. Lo, and C. Le Goues. Overfitting in semantics-based automated program repair. Empirical Software Engineering, 23(5):3007–3033, 2018. [33] C. Le Goues, N. Holtschulte, E. K. Smith, Y. Brun, P. Devanbu, S. Forrest, and W. Weimer. The manybugs and introclass benchmarks for automated repair of c programs. IEEE Trans- actions on Software Engineering, 41(12):1236–1256, 2015. [34] J. Lee, S. Hong, and H. Oh. Memfix: static analysis-based repair of memory deallocation errors for c. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 95–106, 2018. [35] Y. Lee, H. Kwon, S.-H. Choi, S.-H. Lim, S. H. Baek, and K.-W. Park. Instruction2vec: effi- cient preprocessor of assembly code to detect software weakness with cnn. Applied Sciences, 9(19):4086, 2019. [36] Y. J. Lee, S.-H. Choi, C. Kim, S.-H. Lim, and K.-W. Park. Learning binary code with deep learning to detect software weakness. In KSII The 9th International Conference on Internet (ICONI) 2017 Symposium, 2017. [37] H. Li, H. Kwon, J. Kwon, and H. Lee. A scalable approach for vulnerability discovery based on security patches. In International Conference on Applications and Techniques in Information Security, pages 109–122. Springer, 2014. [38] Y. Li, S. Ji, C. Lv, Y. Chen, J. Chen, Q. Gu, and C. Wu. V-fuzz: Vulnerability-oriented evolutionary fuzzing. arXiv:1901.01142 [cs], 2019. [39] Y. Li, S. Wang, and T. N. Nguyen. Fault localization with code coverage representa- tion learning. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pages 661–673. IEEE, 2021. [40] Y. Li, S. Wang, and T. N. Nguyen. Vulnerability detection with fine-grained interpretations. arXiv preprint arXiv:2106.10478, 2021. [41] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong. Vuldeepecker: A deep learning-based system for vulnerability detection. arXiv preprint arXiv:1801.01681, 2018. [42] G. Lin, W. Xiao, J. Zhang, and Y. Xiang. Deep learning-based vulnerable function de- tection: A benchmark. In International Conference on Information and Communications Security, pages 219–232. Springer, 2019. [43] T. Lutellier. Machine Learning for Software Dependability. PhD thesis, University of Wa- terloo, 2020. [44] T. Lutellier, H. V. Pham, L. Pang, Y. Li, M. Wei, and L. Tan. Coconut: combining context-aware neural translation models using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis, pages 101–114, 2020. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 An Analysis of C/C++ Datasets for MLAVD Daniel Grahn, Junjie Zhang [45] M. J. Michl. Analyse sicherheitsrelevanter Designfehler in Software hinsichtlich einer De- tektion mittels Künstlicher Intelligenz. PhD thesis, Technische Hochschule, 2021. [46] C. Mitre. Common vulnerabilities and exposures, 2005. [47] H. N. Nguyen, S. Teerakanok, A. Inomata, and T. Uehara. The comparison of word em- bedding techniques in rnns for vulnerability detection. ICISSP 2021, 2021. [48] V. P. L. Oliveira, E. F. Souza, C. Le Goues, and C. G. Camilo-Junior. Improved crossover operators for genetic programming for program repair. In International Symposium on Search Based Software Engineering, pages 112–127. Springer, 2016. [49] V. P. L. Oliveira, E. F. de Souza, C. Le Goues, and C. G. Camilo-Junior. Improved representation and genetic operators for linear genetic programming for automated program repair. Empirical Software Engineering, 23(5):2980–3006, 2018. [50] T. Parr. Antlr, 2021. URL https://www.antlr.org/. [51] R. Raducu, G. Esteban, F. J. Rodríguez Lera, and C. Fernández. Collecting vulnerable source code from open-source repositories for dataset generation. Applied Sciences, 10(4): 1270, 2020. [52] J. Renzullo, W. Weimer, and S. Forrest. Multiplicative weights algorithms for parallel automated software repair. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 984–993. IEEE, 2021. [53] A. Ribeiro, P. Meirelles, N. Lago, and F. Kon. Ranking warnings from multiple source code static analyzers via ensemble learning. In Proceedings of the 15th International Symposium on Open Collaboration, pages 1–10, 2019. [54] R. Russell, L. Kim, L. Hamilton, T. Lazovich, J. Harer, O. Ozdemir, P. Ellingwood, and M. McConley. Automated vulnerability detection in source code using deep representation learning. In 2018 17th IEEE international conference on machine learning and applications (ICMLA), pages 757–762. IEEE, 2018. [55] R. K. Saha, Y. Lyu, H. Yoshida, and M. R. Prasad. Elixir: Effective object-oriented program repair. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 648–659. IEEE, 2017. [56] M. Saletta and C. Ferretti. A neural embedding for source code: Security analysis and cwe lists. In 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), pages 523–530, 2020. doi: 10.1109/DASC-PICom-CBDCom-CyberSciTech49142.2020. 00095. [57] A. Savchenko, O. Fokin, A. Chernousov, O. Sinelnikova, and S. Osadchyi. Deedp: vulnera- bility detection and patching based on deep learning. Theoretical and Applied Cybersecurity, 2, 2020. [58] C. D. Sestili, W. S. Snavely, and N. M. VanHoudnos. Towards security defect prediction with ai. arXiv preprint arXiv:1808.09897, 2018. [59] SonarSource. Sonarcloud, 2008. URL https://sonarcloud.io/. [60] E. Soremekun, L. Kirschner, M. Böhme, and A. Zeller. Locating faults with program slicing: an empirical analysis. Empirical Software Engineering, 26(3):1–45, 2021. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 An Analysis of C/C++ Datasets for MLAVD Daniel Grahn, Junjie Zhang [61] G. Stergiopoulos, P. Petsanas, P. Katsaros, and D. Gritzalis. Automated exploit detection using path profiling: The disposition should matter, not the position. In 2015 12th In- ternational Joint Conference on e-Business and Telecommunications (ICETE), volume 04, pages 100–111, 2015. [62] G. Stergiopoulos, P. Katsaros, and D. Gritzalis. Execution path classification for vulnera- bility analysis and detection. E-Business and Telecommunications. ICETE 2015. Commu- nications in Computer and Information Science, 585, 2016. [63] S. Suneja, Y. Zheng, Y. Zhuang, J. Laredo, and A. Morari. Learning to map source code to software vulnerability using code-as-a-graph. arXiv preprint arXiv:2006.08614, 2020. [64] G. Szekely. E-statistics: The energy of statistical samples. Preprint, 01 2003. [65] A. Tanwar, K. Sundaresan, P. Ashwath, P. Ganesan, S. K. Chandrasekaran, and S. Ravi. Predicting vulnerability in large codebases with deep code representation. arXiv preprint arXiv:2004.12783, 2020. [66] A. Tanwar, H. Manikandan, K. Sundaresan, P. Ganesan, Sathish, and S. Ravi. Multi-context attention fusion neural network for software vulnerability identification. arXiv pre-print server, 2021. doi: Nonearxiv:2104.09225. URL https://arxiv.org/abs/2104.09225. [67] J. Tian, J. Zhang, and F. Liu. Bbreglocator: A vulnerability detection system based on bounding box regression. In 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), pages 93–100. IEEE, 2021. [68] L. Trujillo, O. M. Villanueva, and D. E. Hernandez. A novel approach for search-based program repair. IEEE Software, 38(4):36–42, 2021. [69] J.-A. Văduva, I. Culi, A. RADOVICI, R. RUGHINIS, and S. DASCALU. Vulnerabil- ity analysis pipeline using compiler based source to source translation and deep learning. eLearning & Software for Education, 1, 2020. [70] N. Visalli, L. Deng, A. Al-Suwaida, Z. Brown, M. Joshi, and B. Wei. Towards automated se- curity vulnerability and software defect localization. In 2019 IEEE 17th International Con- ference on Software Engineering Research, Management and Applications (SERA), pages 90–93. IEEE, 2019. [71] W. Weimer, J. Davidson, S. Forrest, C. Le Goues, P. Pal, and E. Smith. Trusted and resilient mission operations. Technical report, University of Michigan Ann Arbor United States, 2020. [72] E. C. Wikman. Static analysis tools for detecting stack-based buffer overflows. Master’s thesis, Naval Postgraduate School, 2020. [73] Y. Wu, J. Lu, Y. Zhang, and S. Jin. Vulnerability detection in c/c++ source code with graph representation learning. In 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC), pages 1519–1524, 2021. doi: 10.1109/CCWC51732. 2021.9376145. [74] Y. Xu, B. Huang, X. Zou, and L. Kong. Predicting effectiveness of generate-and-validate patch generation systems using random forest. Wuhan University Journal of Natural Sci- ences, 23(6):525–534, 2018. [75] H. Yan, S. Luo, L. Pan, and Y. Zhang. Han-bsvd: a hierarchical attention network for binary software vulnerability detection. Computers & Security, page 102286, 2021. ISSN 0167-4048. doi: 10.1016/j.cose.2021.102286. URL https://dx.doi.org/10.1016/j.cose. 2021.102286. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 An Analysis of C/C++ Datasets for MLAVD Daniel Grahn, Junjie Zhang [76] G. Yang, Y. Jeong, K. Min, J.-w. Lee, and B. Lee. Applying genetic programming with similar bug fix information to automatic fault repair. Symmetry, 10(4):92, 2018. [77] E. Zhu, V. Markovtsev, aastafiev, W. Łukasiewicz, ae foster, J. Martin, Ekevoo, K. Mann, K. Joshi, S. Thakur, S. Ortolani, Titusz, V. Letal, Z. Bentley, and fpug. ekzhu/datasketch: Improved performance for MinHash and MinHashLSH, Dec. 2020. URL https://doi.org/ 10.5281/zenodo.4323502. [78] K. Zhu, Y. Lu, and H. Huang. Scalable static detection of use-after-free vulnerabilities in binary code. IEEE Access, 8:78713–78725, 2020. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021