LLM agents for vulnerability identification and verification of CVEs Tadesse ZeMicheal1,* , Hsin Chen1 , Shawn Davis1 , Rachel Allen1 , Michael Demoret1 and Ashley Song1 1 NVIDIA Abstract Vulnerability management in containerized systems is a labor-intensive and time-consuming process, particularly when dealing with many containers. This process involves the collection, comprehension, and synthesis of various pieces of information to ascertain whether immediate remediation is necessary upon the identification of a new common vulnerability and exposure (CVE). If analysts conclude remediation is not required, they assign an exemption justification status category from the standardized Vulnerability Exploitability eXchange (VEX) reasoning. This is a manual and time-consuming task. To address this issue, we propose a multi-component system using Large Language Models (LLM) that automates vulnerability management, verification, and VEX justification. The system uses a Plan-and- Execute-style LLM system for vulnerability impact analysis. The process begins with an LLM planner that generates a context-sensitive task checklist with up-to-date CVE intel. This checklist is then executed by an LLM agent equipped with Retrieval-Augmented Generation (RAG) capabilities and tool usage. The gathered information and the agent’s findings are subsequently summarized and categorized by additional LLMs to provide a final verdict. The system eliminates the need for manual verification of CVEs in target containers by leveraging container Software Bill of Materials (SBOM), source code, and documentation as input. Experimental results on both synthetic and real-world datasets demonstrate that the proposed system achieves high accuracy rates in capturing false-triggered CVEs, and final justification summary in par with human labeled justifications, indicates the effectiveness of the approach in streamlining vulnerability analysis tasks. We release our code and blueprint reference at github.com/NVIDIA-AI-Blueprints/vulnerability-analysis Keywords Vulnerability assessment, LLM, LLM agent 1. Introduction Modern enterprise applications have complex software dependencies, forming an interconnected web that provides unprecedented functionality, but with the cost of exponentially increasing complexity. Patching software security issues is becoming progressively more challenging as the number of reported security flaws in the common vulnerabilities and exposures (CVE) database hit a record high in 2022, according to the CVE database [1]. The National Vulnerability Database (NVD) reported 17% yearover- year increase in vulnerabilities, with over two hundred thousand cumulative vulnerabilities reported as of the end of 2023 [2]. It is clear that a traditional approach to scanning and patching has become unmanageable. Large Language Models can improve vulnerability remediation while decreasing the load on security teams. While some organizations have begun to explore generative AI to help automate this process, doing so at enterprise scale requires the collection, comprehension, and synthesis of many pieces of information. In recent years, LLM agents have gained attention, due to the capability of performing complex tasks autonomously. For example, tool assisted LLM agents are now capable of performing complex software engineering tasks, such as user interface design, code generation, test executor [3, 4] and even assisting in scientific investigations [5, 6]. A crucial factor enabling these advanced capabilities is the ability to utilize tools. LLM agents exhibit a wide range of capabilities in terms of tool usage and feedback response. In the domain of cybersecurity, LLM usage for various cybersecurity applications has been a new trend [7]. CAMLIS’24: Conference on Applied Machine Learning for Information Security, October 24–25, 2024, Arlington, VA * Corresponding author. $ tzemicheal@nvidia.com (T. ZeMicheal) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings The NVD defines vulnerabilities as a “weakness in the computational logic (e.g., code) found in software and hardware components that, when exploited, results in a negative impact to confidentiality, integrity, or availability” [8]. Vulnerability, characterized by a weakness in security system, implies that a hypothetical attacker could potentially leverage a misconfiguration to escalate privileges. In contrast, exploitability denotes the existence of a specific attack vector that can be utilized to gain unauthorized access to sensitive information. Although a vulnerability may be theoretically exploitable, it does not necessarily imply the presence of a feasible exploitation path, which is a critical distinction in vulnerability assessment and risk management. It is crucial for analysts to verify the exploitability nature of flagged vulnerabilities from CVEs. For example, these could include whether the vulnerable packages are updated to the latest patch or upgraded to recommended versions. This work introduces techniques and tools for automating software vulnerability verification and identification on large scale containers via LLM agent system. Our focus lies in leveraging all available input artifacts used for creating Docker containers to address CVE reported by vulnerability scanners. These input artifacts include Software Bill of Material (SBOM), source code, and developer documen- tation. In these settings, a container scanner, such as Anchore scanner[9], performs a security scan against the container using the latest reported CVEs. Based on the output of the scanner report and a container information dataset, we demonstrate an LLM agent powered system can gradually reduce the effort required for cybersecurity analysis and verification triggered by CVEs. To achieve this, we propose a multi-component LLM system comprising an agentic-style LLM guided by sets of checklists generated from scanner reports and CVE intel. By harnessing the proposed system, organizations can significantly reduce the amount of human effort needed to analyze multiple containers and CVEs simultaneously, thereby enhancing the efficiency and effectiveness of vulnerability verification processes. In the remainder of this work, we provide background on CVE analysis and AI agents; later, we dive into details of the proposed framework implementation and finally, benchmark results on both synthetic data and human labeled data 2. Related Works 2.1. LLMs for Vulnerability Detection The application of LLMs to vulnerability detection has been extensively explored, with a focus on fine-tuning LLMs to identify vulnerable code fragments. Recent studies have showed fine-tuning LLMs for binary classification of source code fragments to detect vulnerabilities [10, 11, 12]. To evaluate the effectiveness of LLMs in vulnerability detection, Gao et al. developed comprehensive benchmarks such as VulBench [13]. More recently, prompt-based approaches utilizing GPT-3.5 and GPT-4 have been employed to improve vulnerability detection accuracy [14, 15, 16]. For example, Purba et al. work demonstrated the application of several GPT models for vulnerability detection, primarily relying on source code representations, and prompt engineering [14]. While these methods have shown promising results, they are not without limitations, including high false positive detection rates. 2.2. LLM Agents and Security Several recent studies have shown LLM agent capability at various tasks [3, 4]. LLM agents in cyberse- curity have been shown to enhance security capabilities as a knowledge assistant [17, 18, 19]. These use cases often rely on the use of tools [20, 21] to guide LLM agents [22, 23]. In contrast, our work combines the application of LLM agents in vulnerability exploitability verification. Building on previous studies, we extend the scope of LLM agent-based vulnerability remediation by incorporating environmental data, in addition to source code, to assess the exploitability of CVEs for target containers. Our approach offers a new perspective on the intersection of LLM agents and vulnerability research, highlighting the potential for LLMs to improve the efficiency and effectiveness of vulnerability exploitability verification. 3. Proposed Model 3.1. Problem Statement Container vulnerability scanning typically involves generating a scan report using a vulnerability scanner, which produces a list of potential CVEs for the target container [6]. However, this raises critical questions: • Given a CVE description, is my container vulnerable to the specified vulnerability? • If I have a specific vulnerable package, under what conditions is the container exploitable? • Are all detected CVEs present in my container? To address these questions, we break down the problem into three sub-task objectives: Sub-Task 1: Optimal Checklist Generation, for a given CVE, what is the optimal list of vulnera- bility and exploitability checks required to determine if the container is vulnerable? Sub-Task 2: Vulnerability Determination, based on the generated checklist, do any of the checks satisfy the conditions to guarantee the container as "vulnerable" or "exploitable"? Sub-Task 3: Vulnerability Summarization and Justification, based on the workflow output, can we categorize the models output into justifiable standard advisory format, such as VEX? To tackle Sub-Task 1, we propose a novel approach using an LLM as a checklist generator. By crafting well-designed prompts, we task the LLM to generate validation task for a given CVE, ensuring a comprehensive evaluation of the container’s vulnerability. To address Sub-Task 2, we build upon the success of LLM agents and propose an LLM agent with plan and execute capabilities [22, 23]. Leveraging access to various container information, including SBOM, source codes, and documentation, our proposed agent incorporates chain of thought-reasoning capabilities to determine the container’s exploitability and vulnerability. Finally, for Sub-Task 3, we propose to prompt a pretrained Llama3-70B model [24], to categorize the output of agent response into subclass category of VEX justification format. 3.2. System Workflow The system has three major components: a checklist generator, an LLM agent planner, and justification and summary components. We show an overall architectural diagram in Figure 1. A checklist generator is initiated when a vulnerability scan event triggers the workflow by passing on a list of CVEs detected in the container. These results are combined with upto- date vulnerability and threat intelligence to provide the workflow with real-time information on the specific CVEs and their known exploitation status. This creates a list of plans and forwards the information to the LLM agent component. The LLM agent uses the checklist to query for verification and exploitability checks. The LLM agent uses container data, such as SBOM, source code, and documentation for verification. In addition to data sources, the LLM agent has access to tools that help it overcome some of the current limitations of LLMs. For example, a common weakness of LLMs is their difficulty with performing mathematical calculations. This can be overcome by giving LLMs access to calculator tools. For our workflow, we found that the model struggled to compare package semantic version numbers such as version 1.9.1 coming before 1.10. We built a version comparison tool that the agent uses to determine the relationship between package versions. Finally, the check information is passed to the summarization and justification component. The summarization process retains information from the previous steps and concludes with additional reasoning regarding the exploitability of the CVE. The justification process assigns Vulnerability Exploitability eXchange (VEX) justification for the identified response. These VEX responses follow the format of [25, 26]. We describe the components in detail as follows. Figure 1: End-to-end workflow of the proposed system 3.2.1. Checklist Generator Model The checklist model uses a generative LLM to serve as the surrogate for a cybersecurity analyst. Specifically, it constructs sequence of steps to check if a container is vulnerable to a given CVE. The model generates a customized checklist based on the CVE description and threat intelligence information related to the container and the vulnerability. Once created, the checklist items are passed to an LLM agent for further investigation. An example of generated checklist for CVE-2021-41496: "Buffer overflow in the 𝑎𝑟𝑟𝑎𝑦_𝑓 𝑟𝑜𝑚_𝑝𝑦𝑜𝑏𝑗 function of 𝑓 𝑜𝑟𝑡𝑟𝑎𝑛𝑜𝑏𝑗𝑒𝑐𝑡.𝑐 in 𝑁 𝑢𝑚𝑃 𝑦 < 1.19, which allows attackers to conduct a Denial of Service attacks by carefully constructing an array with negative values". Then an example of generated checklist would be: 1. Check Numpy Version: The vulnerability affects Numpy versions before 1.19.0. What version of Numpy is installed in the Docker container? Is the container running a vulnerable version? 2. Identify Buffer Copy Operations: Review the application code within the Docker container to check for buffer copy operations without size checking. Are there any instances of buffer copying without proper input validation? 3. Assess Input Data Handling: Does the application process potentially untrusted input data that could trigger a buffer overflow? Evaluate how the application handles large or malformed input data. 4. Buffer Overflow Mitigations: Review the application’s buffer overflow mitigation measures. Are there any protections in place to prevent buffer overflows, such as input validation, data sanitization, or address space layout randomization (ASLR)? We designed five criteria for creating checklist items: actionability, simplicity, completeness, context relevance, and achievability using the provided agent tools. See Table 1 for the specific breakdowns for each of these categories. Additionally, model consistency criteria – that repeated calls to the model generate consistent outputs – is also considered along with these criteria. 3.2.2. LLM Agent Planner The agent planner takes a created checklist as input to perform several checks for the target container. We adopt a ReAct agent model [22]. The ReAct agent model integrates reasoning capabilities of LLM Table 1 checklist generation metric criteria Does the checklist provide active and specific Actionability steps to be taken? Simplicity How simple is the operation to perform? Is the item well-formed and contains all needed Completeness information? Is the item relevant to the given context or sce- Context Relevance nario? Is the checklist item achievable using the provided Achievable source code, SBOM, or documentation of the con- tainer? with an ability to take actionable steps. The process involves “reasoning” and “acting”. A typical ReAct agent consists of an LLM as its core, tools for interacting with external information, and chain-of- thought (CoT) prompting, allowing reasoning traces to create and maintain an adjustable action plan. The LLM agent is provided with the following tools to interact with the environment: 1. Web search results 2. SBOM package checker 3. Source code similarity checker tool 4. Documentation retrieval tool 5. Software version comparison tool During agent process analysis, the LLM uses in-context learning to select tools that guide a specific task. For example, an LLM agent running for the above CVE-2021-41496, could be as follows. > Entering new AgentExecutor chain... > I need to check if NumPy is present in the Docker container and if so, determine its version to see if it falls within the vulnerable range. > Action: SBOM Package Checker > Action Input: NumPy > Observation: 1.22.2 > Thought: NumPy is present in the Docker container and the version is 1.22.2, which is not vulnerable. I need to compare this version with the vulnerable range to confirm. > Action: Docker Container software version comparator > Action Input: 1.22.2,1.19,1.22 > Observation: False > Thought: The NumPy version 1.22.2 in the Docker container is not vulnerable as it falls outside the vulnerable range of versions prior to 1.19. > Final Answer: The NumPy version in the Docker container is 1.22.2, which is not vulnerable to the reported CVE. Finished chain As we can see SBOM Package Checker tool is used for availability of NumPy package, and Software version comparator tool is used for comparing vulnerable Numpy versions before reaching final answer. 3.2.3. Summarization and Justification Process Once the agent has compiled findings for each checklist item, the summarization model condenses these results into a concise, human-readable paragraph. This model uses an LLM to perform standard summarization tasks. The summarization process retains information from the previous steps and concludes with additional reasoning regarding the exploitability of the CVE. Following the established standards of the Vulnerability Exploitability eXchange (VEX), the Justifica- tion Model assigns a VEX status to each identified vulnerability based on summarized findings [26]. Using an LLM for text-to-multiclass classification, this model categorizes CVEs by their exploitability in the given environment, detailing how they might be exploited or why they cannot be exploited. This classification step concludes the pipeline with a single label, which aids informed decision-making and allows for automation in downstream security systems. 4. Experimental Setup To validate the proposed models, we conduct experiments at various stages of the workflow process. First, we measure the efficacy of the checklist model to guide the agent planner. Next, we evaluate the ability of the LLM agent to accurately identify false positives and, how different LLM models compare on their test dataset performance. Finally, we examine how often the LLM agent’s final justification matches human evaluation. 4.1. Dataset For the experiments, we employed the Anchore vulnerability scanner tool to generate SBOM files and scanner reports, which were used to collect datasets for our experiments. The tool was used to discover CVEs in a target Docker container and identify all available packages within it. We created two datasets for all test cases. Synthetic Dataset: A synthetic dataset was generated by inserting CVEs trigger into the scanner report and manually modifying the versions of vulnerable packages in the container. In total, 45 CVEs collected using Tensorflow:23.08 and morpheus:23.07 release docker containers. Human generated Dataset: Based on the output of Anchore scanner report, a checklist was generated for each triggered CVE. A team of container owners were asked to provide labeled responses for each checklist, which were informed by container information (SBOM, documentation) and base source code inspection of the target container. For this setup, total 35 CVEs with 96 checklists query pair collected from morpheus:23.11-runtime docker container. 4.2. Implementation and LLM models We build the workflow pipeline using NVIDIA Morpheus framework [27] on top of generative LLMs. For the experiments we employed opensource LLM models served at [28]. We tested the experiments using Llama3-8B, Llama3-70B [24], and Mixtral-8x22B [29] models. 5. Evaluation and Results 5.1. Checklist Generation Model To measure the efficacy of the checklist generation at a greater scale than hand-labeling would provide, we leverage LLM-as-a-judge[30]. The first steps were to hand score a selection of checklist items based on the criteria from Table 1. Once this sample was scored, we used LLM-as-a-judge to verify the applicability of scoring the items. Figure 2 shows that both GPT-4 and Llama 3 lined up well with the human labels, with GPT-4 performing better. We measure the squared error gap between the LLM- judge and Human judge, with average more than 75% agreement in our test dataset. This aligns with the reported in human and LLM-judge agreement and the expected human-to-human error gap [30]. Satisfied with the ability for an LLM to provide insight into these metrics, experiments comparing zero shot prompts vs. few shot prompts were run (See Figure 3 (a) and (b)). These experiments consisted of generating checklists for six vulnerabilities and averaging each individual metric over the vulnerabilities Figure 2: Error gap in agreement between human judge and LLM judge (a) Zero shot prompting (b) Multi-shot prompting Figure 3: LLM checklist generation scores to give a single score for the different prompt types. Using examples in the prompt led to overall better scores compared to the zero-shot approach. In terms of consistency, the first measure was to see if repeated runs produce identical checklists. Although not identical, the different runs produced largely similar checklists. To explore this further, 18 sets of checklists were generated for 6 different vulnerabilities. Table 2 shows the number of unique checklists generated for each vulnerability. To give a metric for the overall similarity of the generated checklists, the checklist items were shingled into character 4-grams so that a weighted Jaccard distance could be taken between each set of checklists. Using this distance, the diameter (i.e., the largest distance observed between pairs of checklists) of the group was estimated (see Table 3). The observed diameters were within the accepted bound of error, except for CVE-2023-24540 and GHSA-5wvp-7f3h-6wmm; due to an extra checklist being generated for some generations. Adjustments to the examples used in the few shot prompt have led to reducing the diameter for CVE-2023-24540 to 0.171 and GHSA-5wvp-7f3h-6wmm to 0.384. Table 2 Checklist generation count agreement Vulnerability Checklist Counts Created Unique CVE-2023-24538 18 7 CVE-2023-24540 18 11 CVE-2023-29402 18 8 CVE-2023-29404 18 6 CVE-2023-29405 18 12 GHSA-5wvp-7f3h-6wmm 18 14 Table 3 Jaccard distance of generated checklist Vulnerability Group Diameter CVE-2023-24538 0.195911 CVE-2023-24540 0.460748 CVE-2023-29402 0.140586 CVE-2023-29404 0.248237 CVE-2023-29405 0.233309 GHSA-5wvp-7f3h-6wmm 0.436236 5.2. LLM Agent In this section, we describe experiments to validate the LLM agent’s response to container scans’ false positives and its overall response to checklist investigations. A false positive trigger happens when a container scanner (e.g., Anchore scanner) performs scanning of CVEs, sometimes the scanner triggers CVEs that do not exist in the container. This mostly happens due to signature mismatch in the SBOM and reported package. Validating these findings can be a time-consuming task for analysts when the containers are not vulnerable. To address this question, we create a dataset of synthetic SBOM files and scanner reports for a target container. The scanner report consists of both false positive and valid CVEs of the container. The vulnerability consists of version lower or equal to the vulnerable package, missing packages in the SBOM, and existing vulnerable in the SBOM. We measure the system’s ability to identify both false positive CVEs and true positive vulnerable packages. We measure accuracy against ground truth label that indicates whether the CVE is “vulnerable” or “non vulnerable” for the target container. We tested three agent models based on Llama3-8b, Mixtral-22x7B and Llama3-70B against ground truth dataset. The result in Figure 4 shows the larger models are significantly better than the smaller models at identifying the false positive and vulnerable packages. The small model Llama3-8b tends to suffer at using appropriate tools, e.g., not using version comparison tool when comparing versions or unable to parse versions. This is not surprising, as smaller models tend to be worse at function call compared to large models [31]. Next, we evaluate the LLM models for agent response on the human labeled dataset. The goal is to evaluate LLM agent final answer to the input checklist query. We measure the performance with combination of LLM-as-Judge[30] and ground truth accuracy metrics. For LLM-as-Judge we use metrics such as context relevance, answer relevance, and groundedness. For ground truth agreement, we compare agent response token similarity against the token label provided. The context relevance metric evaluates the alignment of the retrieved context with the query; answer relevance assesses the accuracy of the final answer and measure the extent to which the answer addresses the query; groundedness quantifies the degree to which the answer is supported by evidence from the retrieved documents. In this case, the retrieved documents are source codes, and documentation. At both answer and context relevance, all LLM agents achieve greater than 80% accuracy, with Llama3-70B Figure 4: Evaluation of OSS LLM models for LLM agent on a human generated dataset the highest. In comparison, the groundedness metric is lower compared to other metrics, with the best model averaging around 71%. This is expected because many of the retrieved documents do not necessarily end up being used for the final agent answer. For example, if the answer is not found in the retrieved source code or documentation, the agent might end up using internet search for answers. Additionally, we measure agent response agreement to human provided feedback for the checklist. We compute similarity by measuring the agent token response against human provided feedback. Overall, the best model Llama3-70B achieves average agreement of 72% with the response feedback. 5.3. Justification and Summary Models Finally, to measure the efficacy of the overall system, we compare the workflow justification results against a ground truth justification labeled by human annotators. The aim is to assess how often the workflow justification suggestion agrees with human justification. Using the best performing agent model, Llama3-70B, we summarize the response of the agent and format it as a VEX justification label to end users. We compare the VEX formatted output against human provided justifications in Figure 5 and Figure 6 The results in Figure 5 and Figure 6 show the LLM system’s output compared to human annotators’ labels. The labeled dataset consists of 35 CVEs that security analysts investigated against a given software container. Human analysts provided their final verdict on the exploitability (Boolean) of each CVE along with the reasoning categories. We provided a set of predefined categories (VEX justifications) for human analysts to choose from. If the CVE is deemed exploitable, the reasoning category is "vulnerable". If it is not exploitable, there are 10 different reasoning categories to explain why the vulnerability is not exploitable in the given environment. Overall, the accuracy of the pipeline’s exploitability prediction is 75.7% (Figure 5). Considering the detailed reasoning for the non-exploitable classes, the pipeline’s justification status accuracy is 54.0% (Figure 6). The pipeline has a high precision (92.9%) in predicting the vulnerable CVEs, which can help analysts prioritize and focus on the true positive that requires patching before having to investigate the entire batch of CVEs. The decent correlation between the pipeline output and the human labels shown in Figures 5 and 6 demonstrates that the pipeline is retrieving relevant information and performing meaningful evaluations of the CVEs’ exploitability. Figure 5: Confusion matrix comparing human exploitability labels with pipeline output. Figure 6: Confusion matrix comparing human justification status labels with pipeline output. Evaluating the exploitability of a vulnerability involves many nuances. The verdict, including both the exploitable or not decision and the justification status, often depends on each organization’s risk tolerance and each security analyst’s perspective on risk evaluation. Even among human analysts, we observed frequent disagreements in the final exploitability and justification status assignments. The results show that using LLM to assist with CVE analysis and investigation is a promising application. While there is room for improvement in accuracy, this tool still significantly aids and expedites the decision-making process for security analysts by prioritizing alerts and providing context for the investigation. 6. Conclusion and Limitation In this work, we demonstrate the potential of LLM agent in facilitating vulnerability and exploitability checks of Docker containers. To address vulnerability and exploitability checks, we propose a multi- component LLM agent system. Our results indicate that the proposed workflow can effectively perform Docker container vulnerability checks, leveraging only the container’s input configuration and source codes. In future works, we plan to explore extending agent capability for code path understanding, such as investigating execution path through LLM agent to further improve exploitability checks. Additionally, with enriching labeled dataset we foresee further improvement in the multilabel justification and summary models. With this we plan to explore fine-tuning LLM targeted further toward justification categories. In summary, it is essential to note that this work does not encompass vulnerability verification at the host level or command execution level, such as scenarios requiring the execution of specific commands or access to the host operating system. This work’s scope is limited to tasks involving verification at the resource level, specifically focusing on inputs like SBOM, source code, or documentation. References [1] CVE Website, CVE common vulnerability exposure, 2024. URL: https://www.cve.org/. [2] M. Rosen, Skybox security report reveals over 30,000 new vulnerabilities pub- lished in past year, https://www.skyboxsecurity.com/company/press-releases/ skybox-security-report-reveals-over-30000-new-vulnerabilities-published-in-past-year/, 2023. Accessed: 2024-7-2. [3] J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, Ofir Press, SWE-agent: Agent- computer interfaces enable automated software engineering, arXiv [cs.SE] (2024). [4] D. Huang, J. M. Zhang, M. Luck, Q. Bu, Y. Qing, H. Cui, AgentCoder: Multi-agent-based code generation with iterative testing and optimisation, arXiv [cs.CL] (2023). [5] A. M Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, P. Schwaller, Augmenting large language models with chemistry tools, Nat. Mach. Intell. 6 (2024) 525–535. [6] D. A. Boiko, R. MacKnight, G. Gomes, Emergent autonomous scientific research capabilities of large language models, arXiv [physics.chem-ph] (2023). [7] J. Zhang, H. Bu, H. Wen, Y. Chen, L. Li, H. Zhu, When LLMs meet cybersecurity: A systematic literature review, arXiv [cs.CR] (2024). [8] N. V. Database, NVD - vulnerabilities, https://nvd.nist.gov/vuln, 2023. Accessed: 2024-7-2. [9] Anchore, Container vulnerability scanning & management •, https://anchore.com/ container-vulnerability-scanning/, 2021. Accessed: 2024-6-12. [10] R. Russell, L. Kim, L. Hamilton, T. Lazovich, J. Harer, O. Ozdemir, P. Ellingwood, M. McConley, Automated vulnerability detection in source code using deep representation learning, in: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), IEEE, 2018. [11] N. S. Harzevili, A. B. Belle, J. Wang, S. Wang, Z. Ming, Jiang, N. Nagappan, A survey on automated software vulnerability detection using machine learning and deep learning, arXiv [cs.SE] (2023). [12] A. Shestov, R. Levichev, R. Mussabayev, E. Maslov, A. Cheshkov, P. Zadorozhny, Finetuning large language models for vulnerability detection, arXiv [cs.CR] (2024). [13] Z. Gao, H. Wang, Y. Zhou, W. Zhu, C. Zhang, How far have we gone in vulnerability detection using large language models, arXiv [cs.AI] (2023). [14] M. D. Purba, A. Ghosh, B. J. Radford, B. Chu, Software vulnerability detection using large language models, in: 2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW), IEEE, 2023. [15] X. Zhou, T. Zhang, D. Lo, Large language model for vulnerability detection: Emerging results and future directions, arXiv [cs.SE] (2024). [16] A. Cheshkov, P. Zadorozhny, R. Levichev, Evaluation of ChatGPT model for vulnerability detection, arXiv [cs.CR] (2023). [17] V. Jakkal, Introducing microsoft security copilot: Empowering de- fenders at the speed of AI, https://blogs.microsoft.com/blog/2023/03/28/ introducing-microsoft-security-copilot-empowering-defenders-at-the-speed-of-ai/, 2023. Accessed: 2024-6-23. [18] A. Arora, A. Arora, J. McIntyre, Developing chatbots for cyber security: Assessing threats through sentiment analysis on social media, Sustainability 15 (2023) 13178. [19] A. Happe, J. Cito, Getting pwn’d by AI: Penetration testing with large language models, arXiv [cs.CL] (2023). [20] Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, M. Sun, ToolLLM: Facilitating large language models to master 16000+ real-world APIs, arXiv [cs.AI] (2023). [21] Z. Wang, Z. Cheng, H. Zhu, D. Fried, G. Neubig, What are tools anyway? a survey from the language model perspective (2024). [22] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, Y. Cao, ReAct: Synergizing reasoning and acting in language models, arXiv [cs.CL] (2022). [23] N. Shinn, F. Cassano, B. Labash, A. Gopinath, K. Narasimhan, S. Yao, Reflexion: language agents with verbal reinforcement learning, Adv. Neural Inf. Process. Syst. (2023). [24] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al., The llama 3 herd of models, arXiv preprint arXiv:2407.21783 (2024). [25] Vulnerability exploitability eXchange (VEX) status justification doc- ument (june 2022), https://www.cisa.gov/resources-tools/resources/ vulnerability-exploitability-exchange-vex-status-justification-document-june-2022, 2022. Accessed: 2024-6-11. [26] CycloneDX - vulnerability exploitability eXchange (VEX), https://cyclonedx.org/capabilities/vex/, 2024. Accessed: 2024-6-23. [27] NVIDIA, Morpheus: Morpheus SDK, https://github.com/nv-morpheus/Morpheus, 2024. [28] Try NVIDIA NIM APIs, https://build.nvidia.com/explore/discover, 2024. Accessed: 2024-6-24. [29] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, et al., Mixtral of experts, arXiv preprint arXiv:2401.04088 (2024). [30] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, I. Stoica, Judging LLM-as-a-judge with MT-bench and chatbot arena, arXiv [cs.CL] (2023). [31] Fanjia Yan and Huanzhi Mao and Charlie Cheng-Jie Ji and Tianjun Zhang and Shishir G. Patil and Ion Stoica and Joseph E. Gonzalez, Berkeley function calling leaderboard, https://gorilla.cs.berkeley. edu/blogs/8_berkeley_function_calling_leaderboard.html#citation, 2024. Accessed: 2024-6-11.