1. Introduction

Prompt engineering for tail prediction in domain-specific knowledge graph completion tasks

Davide Mario Ricardo Bara

0 0 Babes-Bolyai University, Business Informatics Research Center , Cluj-Napoca , Romania

In this paper we propose a system architecture for tackling the LM-KBC Challenge 2024 [1] task as a tail prediction problem in a knowledge graph completion setup. We experiment with several prompting techniques as suggested in the TELER taxonomy [2]. We notice that under the few-shot paradigm, using In-Context-Learning by supplying the LLM with valuable public information about the query subject and expert knowledge about the relation under study, we can improve the tail prediction performance.

1. Introduction

to their training data - thus general knowledge questions, but they are limited and known to hallucinate when responding questions on specific data outside their training.

In the proposed challenge we face a tail prediction problem on a KB containing specific data. Thus, the designed system should respond to the following question: given a specific knowledge base supplied as a training dataset, how can we enhance the tail entity prediction accuracy, leveraging a LLM of a limited capacity? This research question is challenging by itself as tail identification could be solved as a reasoning problem and LLMs’ ability for logical reasoning is unclear [ 5 ]. Moreover, LLMs are known to hallucinate [ 6 ] when responding questions outside the coverage of their training data.

In [ 7 ] the authors have shown that prompt engineering has a massive influence over the capacity of various LLMs to generate triples out of a free text, therefore, in our system we employ multiple prompting techniques, as suggested in [ 2 ], transmit to the LLM the relevant information from the training data and the real world and combine the LLM response with publicly available knowledge in order to enhance its performance over the tail prediction task.

The structure of the paper is as follows: Section 2 reviews related work in this field. Section 3 describes our research methodology, including the dataset and validation strategy, the processing steps followed by our system and the prompts used to interogate the LLM. Section 4 presents the results obtained by running the system with various prompts and the expected system performance over unseen data. The last section concludes the paper.

2. Related work

In this section, we aim to provide an overview of existing research eforts related to knowledge base construction tasks and the application of LLMs in this domain.

Language models have significantly transformed research in the field of natural language processing (NLP) in recent years. It has evolved from optimizing predictors and architecture to the paradigm of pre-training and model optimization, and recently to the paradigm where a pre-trained model is queried and provides a prediction [ 8 ]

Due to the large datasets used, pre-trained language models contain various types of knowledge stored in their parameters, thus becoming possible alternatives to traditional knowledge bases. The latter do not ofer the same flexibility in terms of information eloquence and expansion [ 8 ]. At the same time, creating and maintaining knowledge bases is a complex process that involves a series of challenges [ 9 ]. An important aspect is the efort required to create a knowledge base, which involves extracting relational data from large unstructured texts through complex NLP pipelines. Among the advantages of language models are the fact that they do not require the creation of a validation schema and their ability to handle queries from multiple domains [ 10 ].

The use of language models as knowledge bases has been done using a series of strategies: integrating knowledge at the time of pre-training, connecting an external base to a pre-trained model, using an attention mechanism, and using a method based on extracting necessary nodes from a knowledge graph. [ 8 ]

A report on the applicability of language models in specific knowledge base tasks [ 8 ] presents as a major impediment the interpretation and updating of knowledge stored in the model’s parameters. Unlike traditional knowledge bases, the information in a pre-trained language model is stored in a loosely structured manner, making it dificult to control.

The authors of the same report believe that future research should focus on three important aspects: improving the interpretability of language models, increasing their ability to store and extract specific information, and ensuring consistency of responses regardless of the questions asked or the language used.

3. Methodology

In this section we present the methodology of our study. The source code of the system and the results over the test dataset are available on Github at https://github.com/davidebara/lm-kbc. Experiments were performed on a Pay-as-you-go Google Colab subscription using a L4 GPU with Python 3.10.12.

Our final architecture uses Meta’s Llama 3 8B Instruct 1 model, the best ranked LLM in LMSYS’ ChatbotArena that has less than 10B parameters.

3.1. Datasets and validation strategy

Challenge organizers supplied the labeled data (triples with known object entities) in two ifles, in total making available 755 unique subject entities linked via 5 relations to a bigger number of object entities [ 1 ]. The train dataset contains 377 subject entities, while the validation dataset contains 378 subject entities. To design a validation strategy for being able to assess the confidence in our results, we merged these 2 datasets and created 50 train / validation datasets using stratified random sampling, by keeping the same number of subject entities in each dataset and preserving the given distribution of subject entities per relation. All results reported below are average values computed on the validation datasets over 50 experiments. The final submitted system makes use of the full training data available.

As the dataset contains a very limited number of subject entities and the available computing capabilities were very limited forbidding us to perform a fine-tuning experiment of a LLM with enough capacity, we decided to use prompting techniques together with exploiting publicly available information to enhance the performance of the tail prediction task. With this respect, we approached Wikidata2, a free and open knowledge base who can be seen as a huge knowledge graph and can be interrogated in a programmatic way via SPARQL queries. We performed a manual and implicit relation alignment task identifying the relations in the Wikidata which are relevant from the semantic point of view with the relations supplied in the training data.

3.2. System architecture

In this subsection we present the architecture of our system, detailing the processing steps performed to produce the requested output.

Algorithm of fig. 1 presents the steps pursued in order to process a test dataset. The algorithm iterates over all (, ) pairs of the dataset and for each pair it formats a 1https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct and https://ollama.com/library/llama3:instruct 2https://www.wikidata.org function PROCESS-SUBJECT-RELATION-PAIRS(dataset) : returns results ↼ ∅ for (subject, relation) in dataset do ↼ FETCH-KG-CONTEXT(subject, relation) ↼ GENERATE-PROMPT(subject, context, relation) ↼ CALL-LLM(prompt) ↼ DISAMBIGUATE(prediction) ↼ LINK-TO-REAL-WORLD(answers) ↼ ∪ end for return end function prompt according to one of the prompt levels of the TELER taxonomy [ 2 ] and sends a query to the selected LLM with the formatted prompt. Next, predictions supplied by the LLM are disambiguated and aligned to real-world knowledge extracted from Wikidata. In the last step, obtained entities are returned as a response.

When generating the prompt in the GENERATE-PROMPT function we applied specific prompt templates for each of the 5 relations supplied in the training dataset. Prompt templates are found in the _ folder. The LLM receives clear instructions about the task it needs to solve and some relevant context that could be publicly identified about the subject of the query. Prompts are described in subsection 3.3.

Function FETCH-KG-CONTEXT interrogates Wikidata to eventually discover publicly available information about the subject. SPARQL query templates are found in the relations folder. For each relation in the training dataset, we developed a specific query, aiming to incorporate as much expert knowledge as possible. Responses supplied by the SPAQRL query are transmitted as context for the LLM query. In this function, we perform a logical entity alignment task for a knowledge graph, with Wikidata as the target.

We ask the LLM to supply textual information in a given format. For prompt templates that require output examples, we provide the LLM with two question-response pairs, to indicate how the textual output should be formatted for both existing and non-existing object entities. This is essential for the levels 5 and 6 of prompting, the ones that prove to work better than all the previous ones.

DISAMBIGUATE function extracts the entities provided by the LLM from its response, by parsing the JSON output and removing all unnecesary details. LINK-TO-REAL-WORLD function checks the Wikidata knowledge base to determine if the LLM responded with information available there. If a match is found, it returns the corresponding entity identifier.

3.3. Prompt engineering

In this section we present the prompt templates formatted according to the TELER taxonomy [ 2 ]. Since the prompts were integrated with the research framework provided by the competition, the LLM might receive additional text based on the system’s setup.

3.3.1. Level 0 prompts

Level 0 prompts only provide the LLM with raw data. In our case, this means the model will only receive the name of the subject entity. As you could notice from section 4, the performance for this prompt is worse than that of the baseline model. No context information is supplied, either from the training dataset or from the real world.

3.3.2. Level 1 prompts

Level 1 prompts provide a basic one-sentence directive that states a high-level goal for the LLM. Our results show that this approach performs slightly better than Level 0 prompting, achieving results comparable to the baseline model provided in the competition.

Provide the names of the stock exchanges that list {subject_entity} shares.

3.3.3. Level 2 prompts

Level 2 prompts include sub-tasks that need to be executed by the LLM in order to achieve the high-level goal. The results indicate that incorporating a chain-of-thought approach can improve LLM performance by guiding the model through a clear, logical sequence of steps and reducing the ambiguity of the initial goal.

Provide a comma-separated list of all stock exchanges where shares of {subject_entity} are traded. Include only the names, and if none, state {None}. Perform the task in distinct steps: extract relevant financial data about {subject_entity}, verify the stock exchanges where its shares are traded from reliable sources, and cross-check the information for accuracy. Ensure your response is clear, accurate, and concise.

3.3.4. Level 3 prompts

Level 3 prompts extend the level 2 prompts by providing the LLM with a structured list of sub-tasks rather than a paragraph of instructions. Up to this level, the prompts fulfill the Zero-shot prompting paradigm, as presented in [ 11 ].

3.3.5. Level 4 prompts

Level 4 prompts build upon level 3 prompts, adding guidelines on how the answer will be evaluated or expected answer examples.

Here, we move towards a few-shot prompting paradigm, more precisely to In-ContextLearning, by adding relevant examples of a solution to the given task. For each relation type we provided two examples of how the LLM should respond when prompted about a specific subject entity, one where a right answer exists and one where it doesn’t.

Provide a comma-separated list of all stock exchanges where shares of {subject_entity} are traded. Include only the names, and if none, state “None”. Perform the task in distinct steps: 1. Extract relevant financial data about {subject_entity}. 2. Verify the stock exchanges where its shares are traded from reliable sources. 3. Cross-check the information for accuracy. 4. Ensure your response is clear, accurate, and concise. Follow the format of the provided examples: Example 1: “Provide a comma-separated list of all stock exchanges where shares of Apple Inc. are traded. Response: NASDAQ, Frankfurt Stock Exchange, Swiss Exchange.” Example 2: “Provide a comma-separated list of all stock exchanges where shares of a private company are traded. Response: None.”

3.3.6. Level 5 prompts

Level 5 prompts include information supplied in level 4 prompting, with the addition of context gathered via retrieval-based techniques from Wikidata. As shown in the results section, the predictive performance of the LLM improves significantly.

To retrieve relevant context from Wikidata, we created five SPARQL queries, each tailored to a specific relation type. Regardless of whether entities are found, the results are appended to the end of the prompt template within the script that runs the model. The template sentence is structured as follows: “Query results for subject “{subject_entity}” and property “{relation_type}” on the Wikidata Knowledge Graph: {query_result}”. If no information is returned, query_result will be the string ”None”.

3.3.7. Level 6 prompts

Level 6 includes all elements of the level 5 prompts, with an explicit statement asking the LLM to explain its own output. From the obtained results, we notice that the performance of the LLM does not improve.

3.3.8. Final prompt template

As level 5 prompts prove to be the best ones, we selected them in the final solution. Figure 8 illustrates the selected prompt template. Additionally, it assigns a specific role to the LLM for each task and clear instructions on how the response should be formatted to avoid extraneous tokens in the answers. We also removed any extraneous tokens like numbers or bullets of lists, as the text is sent to the LLM as a paragraph either way.

You are a financial expert. Provide a comma-separated list of all stock exchanges where shares of {subject_entity} are traded. Include only the names, and if none, state “None”. Perform the task in distinct steps: extract relevant financial data about {subject_entity}, verify the stock exchanges where its shares are traded from reliable sources and cross-check the information for accuracy. Ensure your response is clear, accurate, and concise. Follow the format of the provided examples: Example 1: “Provide a comma-separated list of all stock exchanges where shares of Apple Inc. are traded. Response: NASDAQ, Frankfurt Stock Exchange, Swiss Exchange.” Example 2: “Provide a comma-separated list of all stock exchanges where shares of a private company are traded. Response: None.” Utilize additional information fetched through reliable information retrieval techniques to confirm the accuracy of the stock exchanges list.

4. Results

for the macro-F1 score, which are similar to the baseline results supplied by the LLM. This is expected, as the LLM does not get any additional information from the real world or from the specific topic under discussion. While we add context to the prompt, we notice that the object retrieval performance get higher than 0.9 for the Macro-F1 score. Here we notice that only in half of the cases, the LLM produces responses exactly from the context, thus we anticipate that the LLM brings in added value in providing the response.

With the randomization scheme presented in the methodology, we are able to compute the confidence interval for the macro-F1 score. Fig. 4 presents the Macro-F1 scores derived from the 50 experiments. As fig. 4 shows, most experiments achieved a Macro-F1 score between 0.91 and 0.94. There were also two experiments whose Macro-F1 scores were below 0.91, highlighting the issue of inconsistency in the responses provided by the language models, which is in-line with existing literature[ 12 ].

The 95% confidence interval ranges from 0.927 to 0.932, with an average Macro-F1 score of 0.929. This indicates that the model performs well and consistently, despite variations in the training datasets.

5. Conclusion

As observed in our experiments, the ability of a pre-trained language model to accurately predict information about a specific entity is limited by the dataset it was trained on. Gradually optimizing the prompts proved to be a crucial step in improving an LLM’s predictive performance. Although there were a few exceptions, our model achieved a good confidence interval, suggesting low variation between the responses provided.

Future research should investigate the entity alignment task between two KGs - which could drastically improve prediction performance, how an LLM chooses between multiple conflicting sources of information, the reasoning process behind its answers and ways to control the generated answers. The source files of our system can be accessed through GitHub.

[1]

J.-C.

Kalo ,

Razniewski ,

T.-P.

Nguyen , B. Zhang,

Knowledge base construction from pre-trained language models 2022, in: Semantic Web Challenge on Knowledge Base Construction from Pre-trained Language Models, CEUR-

WS , 2024 .

[2]

K. K. Santu , D. Feng, TELeR: A general taxonomy of LLM prompts for benchmarking complex tasks , in: H. Bouamor et al. (Ed.), Findings of the ACL: EMNLP 2023 , Singapore, 2023 , ACL, 2023 , pp. 14197 - 14203 . doi: 10 .18653/V1/ 2023 .FINDINGS-EMNLP. 946 .

[3]

Pan ,

Luo ,

Wang ,

Chen ,

Wang ,

Wu , Unifying Large Language Models and knowledge graphs: A roadmap , CoRR abs/2306 .08302 ( 2023 ). doi: 10 .48550/ARXIV. 2306.08302.

[4]

Zhu ,

Wang ,

Chen ,

Qiao ,

Ou ,

Yao ,

Deng ,

Chen , N. Zhang, LLMs for knowledge graph construction and reasoning: Recent capabilities and future opportunities , CoRR abs/2305 .13168 ( 2023 ). doi: 10 .48550/ARXIV.2305.13168.

[5]

Huang ,

K. C.

Chang , Towards reasoning in large language models: A survey , in: A. Rogers , J. L. Boyd-Graber , N. Okazaki (Eds.), Findings of the Association for Computational Linguistics: ACL 2023 , Toronto, Canada, July 9- 14 , 2023 , Association for Computational Linguistics, 2023 , pp. 1049 - 1065 . doi: 10 .18653/V1/ 2023 .FINDINGS-ACL. 67 .

[6]

Zhang ,

Li ,

Cui ,

Cai , L. Liu,

Fu ,

Huang ,

Zhao ,

Zhang ,

Chen ,

Wang ,

A. T.

Luu ,

Bi ,

Shi ,

Shi , Siren's song in the AI ocean: A survey on hallucination in large language models , CoRR abs/2309 .01219 ( 2023 ). doi: 10 .48550/ARXIV.2309.01219.

[7]

V. I.

Iga ,

G. C.

Silaghi , Assessing llms suitability for knowledge graph completion , CoRR abs/2405 .17249 ( 2024 ). doi: 10 .48550/ARXIV.2405.17249.

[8]

AlKhamissi ,

Li ,

Celikyilmaz ,

Diab ,

Ghazvininejad , A review on language models as knowledge bases , CoRR abs/2204 .06031 ( 2022 ). doi: 10 .48550/ARXIV.2204. 06031.

[9]

Weikum ,

Dong ,

Razniewski ,

Suchanek , Machine knowledge: Creation and curation of comprehensive knowledge bases , Foundations and Trends in Databases 10 ( 2021 ) 108 - 490 . doi: 10 .1561/1900000064.

[10]

Petroni ,

Rocktäschel ,

Lewis ,

Bakhtin ,

Wu ,

A. H.

Miller ,

Riedel , Language models as knowledge bases? , CoRR abs/ 1909 .01066 ( 2019 ). doi: 10 .48550/arXiv. 1909 . 01066 .

[11]

W. X.

Zhao ,

Zhou ,

Li ,

Tang ,

Wang ,

Hou ,

Min ,

Zhang ,

Dong ,

Du ,

Yang ,

Chen ,

Jiang ,

Ren ,

Li ,

Tang ,

Liu , P. Liu,

Nie ,

Wen , A survey of large language models , CoRR abs/2303 .18223 ( 2023 ). doi: 10 .48550/ARXIV. 2303.18223.

[12]

Elazar ,

Kassner ,

Ravfogel ,

Ravichander ,

E. H.

Hovy ,

Schütze ,

Goldberg , Measuring and improving consistency in pretrained language models , Trans. Assoc. Comput. Linguistics 9 ( 2021 ) 1012 - 1031 . doi: 10 .1162/TACL\_A\_ 00410 .