KFC-QR: A Knowledge Fusion Constraint Method
                                Based on Query Retrieval for CPE prediction
                                Jinrui Zhang1 , Linyi Han1 and Xiaowang Zhang1,∗
                                1
                                    College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China


                                               Abstract
                                               Textual vulnerability description (TVD) is the record of software vulnerability by software engineers.
                                               Software engineers use the common platform enumerations (CPE) in TVDs to find software related to the
                                               vulnerability record. However, CPE is always incomplete, which makes it difficult to comprehensively
                                               identify the software affected by the vulnerability. Existing work focuses on completing CPE through CPE
                                               prediction based on the Knowledge Graph Embedding (KGE) model. However, the KGE model cannot
                                               capture the potential connections between softwares in TVD. In this poster, we propose a knowledge
                                               fusion constraint method based on query retrieval. Firstly, we extract the subgraph related to TVDs
                                               from the graph, which contains known CPEs and vulnerability types. Then, we use a large language
                                               model (LLM) combined with the subgraph to rerank the candidate CPEs predicted by the KGE model.
                                               Experiments show that our framework improves the accuracy of TVD-CPE link prediction, providing
                                               valuable support for software developers in product security assessment.

                                               Keywords
                                               CPE Prediction, Vulnerability Knowledge Graph, Large Language Model


                                1. Introduction
                                Textual vulnerability description (TVD) is the record of software vulnerability by software
                                engineers. Common platform enumeration[1] (CPE) is a standardized naming convention
                                in TVD for uniquely identifying software, hardware, and applications. Software engineers
                                can find out the affected products through CPE in TVD. However, a TVD corresponds to
                                multiple CPEs, and it takes several weeks to find all CPEs in the TVD. Therefore, it is essential
                                to predict undiscovered CPEs in the TVD. Shi et al.[2] construct a vulnerability knowledge
                                graph and use the TransE[3] model to predict the link between TVD and CPE. ULTRA[4] is
                                an approach for learning universal and transferable graph representations. It builds relational
                                representations as a function conditioned on their interactions, which allows a pre-trained
                                ULTRA model to inductively generalize to unseen KGs. Alfasi et al.[5] propose the VulnScopper
                                framework, which uses a large language model(LLM) to represent TVD and then incorporate the
                                vulnerability embeddings into the ULTRA model for training. However, they do not consider
                                the relationship between softwares affected in TVD. We observe that CPEs in the same TVD
                                usually have similar software architectures or vulnerability types. In this poster, we construct a
                                knowledge fusion constrained framework based on query retrieve (KFC-QR) to predict CPEs.
                                Given a TVD, firstly, we generate candidate CPEs through the KGE model. Secondly, we retrieve
                                Posters, Demos, and Industry Tracks at ISWC 2024, November 13–15, 2024, Baltimore, USA
                                ∗
                                    Corresponding author.
                                Envelope-Open jinrui_zhang@tju.edu.cn (J. Zhang); hanly2@tju.edu.cn (L. Han); xiaowangzhang@tju.edu.cn (X. Zhang)
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Figure 1: Overview of the KFC-QR framework.


subgraphs of similar CVEs from the vulnerability knowledge graph, which contains known
CPEs and vulnerability types. Thirdly, we use LLM for re-ranking CPEs and average the scores
with those output by the KGE model to obtain the final prediction results. Experiments show
that our framework improves the accuracy of TVD-CPE link prediction.


2. Approach
2.1. Problem Definition
Our task is to link prediction between TVD and CPE. Formally, the vulnerability knowledge
graph can be represented as G = {E, R}, where E and R represent the set of entities and relations
in G, respectively. Given a TVD 𝑥𝑖 , the link prediction task ranks all entities as 𝐸𝑐 = (𝑒1 , 𝑒2 , ..., 𝑒𝑛 )
by calculating their scores as 𝑆𝑐 = (𝑠1 , 𝑠2 , ..., 𝑠𝑛 ).

2.2. Approach Overview
Our approach consists of the KGE model, the query-related subgraph retrieval module and the
knowledge fusion constrained inference module. We use ULTRA as the KGE model, which
allows KFC-QR to link predictions for unseen entities. We use the KGE model to predict
candidate CPEs for the given TVD. The Query-related Subgraph Retrieval module is used to
retrieve CVE subgraphs that are similar to the given TVD, providing query-related knowledge.
The Knowledge Fusion Constrained Inference module is used to fuse candidate CPEs, query
related knowledge into a prompt and instruct the LLM to rerank the candidate CPEs.

2.3. Query-related Subgraph Retrieval
This module is used to retrieve the subgraph for a given TVD, providing known CPEs and vul-
nerability types. In the transductive setting, we directly use the TVD to query the vulnerability
Table 1
Link-prediction of TVD to CPE. ChatGPT 4 MRR is marked with ”X” since it returns only the top 10
results.
               Setting             Model           MRR     Hits@1     Hits@3    Hits@10
                                  TransE           0.223    0.116      0.230    0.387
                                ChatGPT 4            X      0.120      0.145    0.183
                            VulnScopper(SOTA)      0.573    0.535      0.594    0.680
              inductive
                                   Ours            0.623    0.584      0.653    0.706
                            Ours KGE(ULTRA)        0.531    0.492      0.556    0.637
                            Ours w/o Subgraph      0.598    0.557      0.633    0.688
                                  TransE           0.310    0.254      0.380    0.455
                                ChatGPT 4            X      0.127      0.157    0.180
                            VulnScopper(SOTA)      0.651    0.533      0.706    0.768
             transductive
                                   Ours            0.693    0.580      0.744    0.780
                            Ours KGE(ULTRA)        0.553    0.483      0.577    0.683
                            Ours w/o Subgraph      0.673    0.550      0.714    0.771


knowledge graph to get the subgraph. In the inductive setting, we retrieve the TVDs in the
vulnerability knowledge graph that are similar to the given TVD and use the subgraphs of
similar TVDs to provide knowledge for inference. We calculate the cosine similarity between
TVD embeddings to obtain candidate similar TVDs. Then we extract the affected products of
the candidate TVDs and filter the final similar TVDs by rules.

2.4. Knowledge Fusion Constrained Inference
Due to the natural language understanding capability of the LLM, we construct the vulnerability
knowledge as prompts and instruct the LLM to rerank the candidate CPEs. We convert the
triplet information of the TVD subgraph into natural language format using predefined rules,
which allows LLMs to understand the knowledge more accurately. Additionally, we replace the
original abstract symbols with the actual meanings of the common weakness enumeration[6].
We replace complex CPEs with simple letters to reduce the omission and fabrication problems
in LLM generation.


3. Experiments
We retrieve available vulnerability knowledge from NVD[7] API and construct it into a vulnera-
bility knowledge graph following the method of Shi et al. Our dataset has a total of 653,319
triples, of which 309,139 are TVD-CPE triples. In the transductive setup, we use 4,971 TVD-CPE
triples as the test set and ensure that the entities in the test set have appeared in the training set.
In the inductive setup, we divide the graph temporally, the training set contains triples up to
October 1st, 2023. The triples from October 1st, 2023 to April 17th, 2024 is used as the test set.
We use MRR and Hits@K as evaluation metrics. Table 1 presents the inductive and transductive
link prediction results on the test set. The evaluation results indicate that KFC-QR performs
Figure 2: Hits@k results of ULTRA model.


better than all the other methods in predicting the link between CPE and TVD. We also conduct
ablation experiments to validate the effectiveness of the query-related subgraph module.


4. Limitation
Our model relies on the performance of the KGE model because we rerank the top K candidate
CPEs predicted by the KGE model. As shown in Figure 2, considering the accuracy of the model
and the size of the candidate CPEs, we set k=20. We rerank the top 20 candidate CPEs predicted
by the KGE model. In the application of other datasets, it is necessary to dynamically adjust
the K value according to the performance of the model to make the candidate CPEs cover the
correct CPEs as much as possible.


5. Conclusion
In this poster, we propose KFC-QR for link predictions between TVD and CPE, which considers
the similarity between CPEs affected by the same TVD. Experiments demonstrate that the
framework improves the link prediction accuracy between TVDs and CPEs. In the future, we
plan to extend this framework to link prediction for TVD and common weakness enumeration
to verify generalization.


Acknowledgement
This work was supported by the Project of Science and Technology Research and Development
Plan of China Railway Corporation (N2023J044).
References
[1] MITRE, Official common platform enumeration (cpe) dictionary, 2024. URL: https://nvd.nist.
    gov/products/cpe.
[2] Z. Shi, N. Matyunin, K. Graffi, D. Starobinski, Uncovering cwe-cve-cpe relations with threat
    knowledge graphs, ACM Transactions on Privacy and Security 27 (2024) 1–26.
[3] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, O. Yakhnenko, Translating embeddings
    for modeling multi-relational data, Advances in neural information processing systems 26
    (2013).
[4] M. Galkin, X. Yuan, H. Mostafa, J. Tang, Z. Zhu, Towards foundation models for knowledge
    graph reasoning, in: The Twelfth International Conference on Learning Representations,
    2024.
[5] D. Alfasi, T. Shapira, A. B. Barr, Unveiling hidden links between unseen security entities,
    arXiv preprint arXiv:2403.02014 (2024).
[6] MITRE, Common weakness enumeration (cwe), 2024. URL: https://cwe.mitre.org.
[7] NVD, National vulnerability database (nvd), 2024. URL: https://nvd.nist.gov/general.