=Paper= {{Paper |id=Vol-3853/paper10 |storemode=property |title=Prompt Engineering for Tail Prediction in Domain-Specific Knowledge Graph Completion Tasks |pdfUrl=https://ceur-ws.org/Vol-3853/paper10.pdf |volume=Vol-3853 |authors=Davide-Mario-Ricardo Bara |dblpUrl=https://dblp.org/rec/conf/kbclm/Bara24 }} ==Prompt Engineering for Tail Prediction in Domain-Specific Knowledge Graph Completion Tasks== https://ceur-ws.org/Vol-3853/paper10.pdf
                                Prompt engineering for tail prediction in
                                domain-specific knowledge graph completion tasks
                                Davide Mario Ricardo Bara1
                                1
                                    Babes-Bolyai University, Business Informatics Research Center, Cluj-Napoca, Romania


                                                                         Abstract
                                                                         In this paper we propose a system architecture for tackling the LM-KBC Challenge 2024 [1] task as a
                                                                         tail prediction problem in a knowledge graph completion setup. We experiment with several prompting
                                                                         techniques as suggested in the TELER taxonomy [2]. We notice that under the few-shot paradigm, using
                                                                         In-Context-Learning by supplying the LLM with valuable public information about the query subject
                                                                         and expert knowledge about the relation under study, we can improve the tail prediction performance.




                                1. Introduction
                                Knowledge bases (KBs) are structured repositories that store data in a way that is easily re-
                                trievable and useful for various applications. These systems emulate human understanding by
                                organizing information into a structured format such that machines can process them efficiently.
                                Knowledge graphs (KG) store the facts of a knowledge base as triples (𝑠, 𝑝, 𝑜), where 𝑠 is the
                                subject of the relationship, 𝑝 represents the predicate or the relation and 𝑜 represents the object.
                                Thus, as the subject or the object of each triple could be any entity in the target domain, the
                                KG is in fact a huge graph allowing one to navigate through it. We can distinguish between
                                (i) general and publicly available KGs, like DBPedia, Yago or Wikidata encompassing general
                                knowledge available in the world or (ii) KGs storing specific data related to some particular
                                domain of interest, either from the scientific or the business environment.
                                    KGs are seen as a more expressive way of representing data than the traditional databases, as
                                they allow one to run semantic queries, enabling advanced capabilities of information discovery,
                                eventually through reasoning. But, in many situations, such KBs and their associated KGs
                                store incomplete information, or particular KGs are improperly aligned with one another,
                                although storing the same logical information. Thus, various tasks like triple classification,
                                head prediction, tail prediction, relation prediction or entity classification are of interest and
                                they are generally known as knowledge graph completion tasks.
                                    Recent research [3, 4] showed that Large Language Models can be employed to solve such
                                tasks with increasingly better performance. LLMs are deep learning (DL) models comprising
                                billion of parameters learned on large amounts of publicly available data, trying to model the
                                generative likelihood of word sequences in order to correctly predict the next token that should
                                come in a sentence. Generally, they show good performance in responding to questions related

                                KBC-LM ’24: Knowledge Base Construction from Pre-trained Language Models workshop at ISWC 2024
                                $ dmrbara@gmail.com (D. M. R. Bara)
                                                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
to their training data - thus general knowledge questions, but they are limited and known to
hallucinate when responding questions on specific data outside their training.
   In the proposed challenge we face a tail prediction problem on a KB containing specific data.
Thus, the designed system should respond to the following question: given a specific knowledge
base supplied as a training dataset, how can we enhance the tail entity prediction accuracy,
leveraging a LLM of a limited capacity? This research question is challenging by itself as tail
identification could be solved as a reasoning problem and LLMs’ ability for logical reasoning is
unclear [5]. Moreover, LLMs are known to hallucinate [6] when responding questions outside
the coverage of their training data.
   In [7] the authors have shown that prompt engineering has a massive influence over the
capacity of various LLMs to generate triples out of a free text, therefore, in our system we
employ multiple prompting techniques, as suggested in [2], transmit to the LLM the relevant
information from the training data and the real world and combine the LLM response with
publicly available knowledge in order to enhance its performance over the tail prediction task.
   The structure of the paper is as follows: Section 2 reviews related work in this field. Section 3
describes our research methodology, including the dataset and validation strategy, the processing
steps followed by our system and the prompts used to interogate the LLM. Section 4 presents
the results obtained by running the system with various prompts and the expected system
performance over unseen data. The last section concludes the paper.


2. Related work
In this section, we aim to provide an overview of existing research efforts related to knowledge
base construction tasks and the application of LLMs in this domain.
   Language models have significantly transformed research in the field of natural language
processing (NLP) in recent years. It has evolved from optimizing predictors and architecture to
the paradigm of pre-training and model optimization, and recently to the paradigm where a
pre-trained model is queried and provides a prediction [8]
   Due to the large datasets used, pre-trained language models contain various types of knowl-
edge stored in their parameters, thus becoming possible alternatives to traditional knowledge
bases. The latter do not offer the same flexibility in terms of information eloquence and expan-
sion [8]. At the same time, creating and maintaining knowledge bases is a complex process
that involves a series of challenges [9]. An important aspect is the effort required to create a
knowledge base, which involves extracting relational data from large unstructured texts through
complex NLP pipelines. Among the advantages of language models are the fact that they do not
require the creation of a validation schema and their ability to handle queries from multiple
domains [10].
   The use of language models as knowledge bases has been done using a series of strategies:
integrating knowledge at the time of pre-training, connecting an external base to a pre-trained
model, using an attention mechanism, and using a method based on extracting necessary nodes
from a knowledge graph. [8]
   A report on the applicability of language models in specific knowledge base tasks [8] presents
as a major impediment the interpretation and updating of knowledge stored in the model’s
parameters. Unlike traditional knowledge bases, the information in a pre-trained language
model is stored in a loosely structured manner, making it difficult to control.
  The authors of the same report believe that future research should focus on three important
aspects: improving the interpretability of language models, increasing their ability to store and
extract specific information, and ensuring consistency of responses regardless of the questions
asked or the language used.


3. Methodology
In this section we present the methodology of our study. The source code of the system and the
results over the test dataset are available on Github at https://github.com/davidebara/lm-kbc.
Experiments were performed on a Pay-as-you-go Google Colab subscription using a L4 GPU
with Python 3.10.12.
   Our final architecture uses Meta’s Llama 3 8B Instruct1 model, the best ranked LLM in LMSYS’
ChatbotArena that has less than 10B parameters.

3.1. Datasets and validation strategy
Challenge organizers supplied the labeled data (triples with known object entities) in two
files, in total making available 755 unique subject entities linked via 5 relations to a bigger
number of object entities [1]. The train dataset contains 377 subject entities, while the validation
dataset contains 378 subject entities. To design a validation strategy for being able to assess
the confidence in our results, we merged these 2 datasets and created 50 train / validation
datasets using stratified random sampling, by keeping the same number of subject entities in
each dataset and preserving the given distribution of subject entities per relation. All results
reported below are average values computed on the validation datasets over 50 experiments.
The final submitted system makes use of the full training data available.
   As the dataset contains a very limited number of subject entities and the available computing
capabilities were very limited forbidding us to perform a fine-tuning experiment of a LLM with
enough capacity, we decided to use prompting techniques together with exploiting publicly
available information to enhance the performance of the tail prediction task. With this respect,
we approached Wikidata2 , a free and open knowledge base who can be seen as a huge knowledge
graph and can be interrogated in a programmatic way via SPARQL queries. We performed a
manual and implicit relation alignment task identifying the relations in the Wikidata which are
relevant from the semantic point of view with the relations supplied in the training data.

3.2. System architecture
In this subsection we present the architecture of our system, detailing the processing steps
performed to produce the requested output.
   Algorithm of fig. 1 presents the steps pursued in order to process a test dataset. The algo-
rithm iterates over all (𝑠𝑢𝑏𝑗𝑒𝑐𝑡, 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛) pairs of the dataset and for each pair it formats a
    1
        https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct and https://ollama.com/library/llama3:instruct
    2
        https://www.wikidata.org
  function PROCESS-SUBJECT-RELATION-PAIRS(dataset) : returns results
     𝑟𝑒𝑠𝑢𝑙𝑡𝑠 ↼ ∅
     for (subject, relation) in dataset do
         𝑐𝑜𝑛𝑡𝑒𝑥𝑡 ↼ FETCH-KG-CONTEXT(subject, relation)
         𝑝𝑟𝑜𝑚𝑝𝑡 ↼ GENERATE-PROMPT(subject, context, relation)
         𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ↼ CALL-LLM(prompt)
         𝑎𝑛𝑠𝑤𝑒𝑟𝑠 ↼ DISAMBIGUATE(prediction)
         𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠 ↼ LINK-TO-REAL-WORLD(answers)
         𝑟𝑒𝑠𝑢𝑙𝑡𝑠 ↼ 𝑟𝑒𝑠𝑢𝑙𝑡𝑠 ∪ 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠
     end for
     return 𝑟𝑒𝑠𝑢𝑙𝑡𝑠
  end function

Figure 1: General algorithm for processing all items of a test dataset


prompt according to one of the prompt levels of the TELER taxonomy [2] and sends a query
to the selected LLM with the formatted prompt. Next, predictions supplied by the LLM are
disambiguated and aligned to real-world knowledge extracted from Wikidata. In the last step,
obtained entities are returned as a response.
   When generating the prompt in the GENERATE-PROMPT function we applied specific prompt
templates for each of the 5 relations supplied in the training dataset. Prompt templates are
found in the 𝑝𝑟𝑜𝑚𝑝𝑡_𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝑠 folder. The LLM receives clear instructions about the task it
needs to solve and some relevant context that could be publicly identified about the subject of
the query. Prompts are described in subsection 3.3.
   Function FETCH-KG-CONTEXT interrogates Wikidata to eventually discover publicly available
information about the subject. SPARQL query templates are found in the relations folder.
For each relation in the training dataset, we developed a specific query, aiming to incorporate as
much expert knowledge as possible. Responses supplied by the SPAQRL query are transmitted
as context for the LLM query. In this function, we perform a logical entity alignment task for a
knowledge graph, with Wikidata as the target.
   We ask the LLM to supply textual information in a given format. For prompt templates that
require output examples, we provide the LLM with two question-response pairs, to indicate
how the textual output should be formatted for both existing and non-existing object entities.
This is essential for the levels 5 and 6 of prompting, the ones that prove to work better than all
the previous ones.
   DISAMBIGUATE function extracts the entities provided by the LLM from its response, by
parsing the JSON output and removing all unnecesary details. LINK-TO-REAL-WORLD function
checks the Wikidata knowledge base to determine if the LLM responded with information
available there. If a match is found, it returns the corresponding entity identifier.

3.3. Prompt engineering
In this section we present the prompt templates formatted according to the TELER taxonomy [2].
Since the prompts were integrated with the research framework provided by the competition,
the LLM might receive additional text based on the system’s setup.
3.3.1. Level 0 prompts
Level 0 prompts only provide the LLM with raw data. In our case, this means the model will
only receive the name of the subject entity. As you could notice from section 4, the performance
for this prompt is worse than that of the baseline model. No context information is supplied,
either from the training dataset or from the real world.

3.3.2. Level 1 prompts
Level 1 prompts provide a basic one-sentence directive that states a high-level goal for the LLM.
Our results show that this approach performs slightly better than Level 0 prompting, achieving
results comparable to the baseline model provided in the competition.
  Provide the names of the stock exchanges that list {subject_entity} shares.

Figure 2: The level 1 prompt template for the “companyTradesAtStockExchange” relation.



3.3.3. Level 2 prompts
Level 2 prompts include sub-tasks that need to be executed by the LLM in order to achieve
the high-level goal. The results indicate that incorporating a chain-of-thought approach can
improve LLM performance by guiding the model through a clear, logical sequence of steps and
reducing the ambiguity of the initial goal.
  Provide a comma-separated list of all stock exchanges where shares of
  {subject_entity} are traded. Include only the names, and if none, state {None}.
  Perform the task in distinct steps:        extract relevant financial data about
  {subject_entity}, verify the stock exchanges where its shares are traded from
  reliable sources, and cross-check the information for accuracy. Ensure your response
  is clear, accurate, and concise.

Figure 3: The level 2 prompt template for the “companyTradesAtStockExchange” relation.



3.3.4. Level 3 prompts
Level 3 prompts extend the level 2 prompts by providing the LLM with a structured list of
sub-tasks rather than a paragraph of instructions. Up to this level, the prompts fulfill the
Zero-shot prompting paradigm, as presented in [11].
  Provide a comma-separated list of all stock exchanges where shares of
  {subject_entity} are traded. Include only the names, and if none, state “None”.
  Perform the task in distinct steps: 1. Extract relevant financial data about
  {subject_entity}. 2. Verify the stock exchanges where its shares are traded from
  reliable sources. 3. Cross-check the information for accuracy. 4. Ensure your
  response is clear, accurate, and concise.

Figure 4: The level 3 prompt template for the “companyTradesAtStockExchange” relation.
3.3.5. Level 4 prompts
Level 4 prompts build upon level 3 prompts, adding guidelines on how the answer will be
evaluated or expected answer examples.
  Here, we move towards a few-shot prompting paradigm, more precisely to In-Context-
Learning, by adding relevant examples of a solution to the given task. For each relation type
we provided two examples of how the LLM should respond when prompted about a specific
subject entity, one where a right answer exists and one where it doesn’t.

  Provide a comma-separated list of all stock exchanges where shares of
  {subject_entity} are traded. Include only the names, and if none, state “None”.
  Perform the task in distinct steps: 1. Extract relevant financial data about
  {subject_entity}.   2.  Verify the stock exchanges where its shares are traded
  from reliable sources. 3. Cross-check the information for accuracy. 4. Ensure
  your response is clear, accurate, and concise. Follow the format of the provided
  examples: Example 1: “Provide a comma-separated list of all stock exchanges where
  shares of Apple Inc. are traded. Response: NASDAQ, Frankfurt Stock Exchange, Swiss
  Exchange.” Example 2: “Provide a comma-separated list of all stock exchanges where
  shares of a private company are traded. Response: None.”

Figure 5: The level 4 prompt template for the “companyTradesAtStockExchange” relation.



3.3.6. Level 5 prompts
Level 5 prompts include information supplied in level 4 prompting, with the addition of context
gathered via retrieval-based techniques from Wikidata. As shown in the results section, the
predictive performance of the LLM improves significantly.
   To retrieve relevant context from Wikidata, we created five SPARQL queries, each tailored to
a specific relation type. Regardless of whether entities are found, the results are appended to the
end of the prompt template within the script that runs the model. The template sentence is struc-
tured as follows: “Query results for subject “{subject_entity}” and property
“{relation_type}” on the Wikidata Knowledge Graph: {query_result}”. If no
information is returned, query_result will be the string ”None”.

  Provide a comma-separated list of all stock exchanges where shares of
  {subject_entity} are traded. Include only the names, and if none, state “None”.
  Perform the task in distinct steps: 1. Extract relevant financial data about
  {subject_entity}. 2. Verify the stock exchanges where its shares are traded from
  reliable sources. 3. Cross-check the information for accuracy. 4. Ensure your
  response is clear, accurate, and concise. Follow the format of the provided examples:
  Example 1: “Provide a comma-separated list of all stock exchanges where shares of
  Apple Inc. are traded. Response: NASDAQ, Frankfurt Stock Exchange, Swiss Exchange.”
  Example 2: “Provide a comma-separated list of all stock exchanges where shares of
  a private company are traded. Response: None.” Utilize additional information
  fetched through reliable information retrieval techniques to confirm the accuracy
  of the stock exchanges list.

Figure 6: The level 5 prompt template for the “companyTradesAtStockExchange” relation.
3.3.7. Level 6 prompts
Level 6 includes all elements of the level 5 prompts, with an explicit statement asking the LLM
to explain its own output. From the obtained results, we notice that the performance of the
LLM does not improve.

  Provide a comma-separated list of all stock exchanges where shares of
  {subject_entity} are traded. Include only the names, and if none, state “None”.
  Perform the task in distinct steps: 1. Extract relevant financial data about
  {subject_entity}. 2. Verify the stock exchanges where its shares are traded from
  reliable sources. 3. Cross-check the information for accuracy. 4. Ensure your
  response is clear, accurate, and concise. Follow the format of the provided examples:
  Example 1: “Provide a comma-separated list of all stock exchanges where shares of
  Apple Inc. are traded. Response: NASDAQ, Frankfurt Stock Exchange, Swiss Exchange.”
  Example 2: “Provide a comma-separated list of all stock exchanges where shares of
  a private company are traded. Response: None.” Utilize additional information
  fetched through reliable information retrieval techniques to confirm the accuracy
  of the stock exchanges list. Justify your response with a detailed explanation of
  the sources and reasoning process. If stating ’None’, explain why the information
  is unavailable or not applicable.

Figure 7: The level 6 prompt template for the “companyTradesAtStockExchange” relation.



3.3.8. Final prompt template
As level 5 prompts prove to be the best ones, we selected them in the final solution. Figure 8
illustrates the selected prompt template. Additionally, it assigns a specific role to the LLM for
each task and clear instructions on how the response should be formatted to avoid extraneous
tokens in the answers. We also removed any extraneous tokens like numbers or bullets of lists,
as the text is sent to the LLM as a paragraph either way.

  You are a financial expert. Provide a comma-separated list of all stock exchanges
  where shares of {subject_entity} are traded. Include only the names, and if none,
  state “None”. Perform the task in distinct steps: extract relevant financial data
  about {subject_entity}, verify the stock exchanges where its shares are traded from
  reliable sources and cross-check the information for accuracy. Ensure your response
  is clear, accurate, and concise. Follow the format of the provided examples: Example
  1: “Provide a comma-separated list of all stock exchanges where shares of Apple
  Inc. are traded. Response: NASDAQ, Frankfurt Stock Exchange, Swiss Exchange.”
  Example 2: “Provide a comma-separated list of all stock exchanges where shares
  of a private company are traded. Response: None.” Utilize additional information
  fetched through reliable information retrieval techniques to confirm the accuracy
  of the stock exchanges list.

Figure 8: The final prompt template for the “companyTradesAtStockExchange” relation.
Table 1
Results from running the system with various prompting techniques.
        Prompt level   Relation                       Macro-F1   Micro-F1   Avg. num-   Empty
                                                                            ber    of   preds.
                                                                            preds.
          Level 0      awardWonBy                        0.029      0.030   10.000      1
                       companyTradesAtStockExchange      0.475      0.488   0.890       18
                       countryLandBordersCountry         0.666      0.719   3.015       4
                       personHasCityOfDeath              0.320      0.202   0.540       49
                       seriesHasNumberOfEpisodes         0.040      0.054   0.480       57
                       All relations                     0.341      0.183   1.312       129
          Level 1      awardWonBy                        0.068      0.041   9.800       0
                       companyTradesAtStockExchange      0.481      0.466   1.140       22
                       countryLandBordersCountry         0.875      0.888   2.500       14
                       personHasCityOfDeath              0.480      0.316   0.400       61
                       seriesHasNumberOfEpisodes         0.150      0.182   0.650       36
                       All relations                     0.453      0.220   1.288       133
          Level 2      awardWonBy                        0.056      0.048   10.700      2
                       companyTradesAtStockExchange      0.559      0.484   1.110       42
                       countryLandBordersCountry         0.950      0.928   2.441       17
                       personHasCityOfDeath              0.450      0.320   0.450       55
                       seriesHasNumberOfEpisodes         0.130      0.173   0.500       51
                       All relations                     0.474      0.230   1.267       167
          Level 3      awardWonBy                        0.068      0.054   11.400      0
                       companyTradesAtStockExchange      0.532      0.522   1.010       18
                       countryLandBordersCountry         0.908      0.906   2.368       17
                       personHasCityOfDeath              0.420      0.321   0.570       43
                       seriesHasNumberOfEpisodes         0.117      0.144   0.670       34
                       All relations                     0.448      0.229   1.323       112
          Level 4      awardWonBy                        0.053      0.049   10.700      2
                       companyTradesAtStockExchange      0.544      0.447   1.090       42
                       countryLandBordersCountry         0.927      0.917   2.500       18
                       personHasCityOfDeath              0.360      0.293   0.610       39
                       seriesHasNumberOfEpisodes         0.100      0.119   0.680       34
                       All relations                     0.434      0.223   1.362       135
          Level 5      awardWonBy                        0.276      0.135   13.300      0
                       companyTradesAtStockExchange      0.997      0.994   0.800       35
                       countryLandBordersCountry         0.914      0.945   2.721       19
                       personHasCityOfDeath              0.937      0.939   0.600       41
                       seriesHasNumberOfEpisodes         0.950      0.960   0.980       2
                       All relations                     0.934      0.416   1.471       97
          Level 6      awardWonBy                        0.273      0.133   13.200      0
                       companyTradesAtStockExchange      0.981      0.969   0.820       36
                       countryLandBordersCountry         0.916      0.947   2.676       19
                       personHasCityOfDeath              0.910      0.900   0.650       38
                       seriesHasNumberOfEpisodes         0.910      0.875   1.080       1
                       All relations                     0.913      0.407   1.505       94



4. Results
Table 1 presents the results of running the architecture presented in subsection 3.2 with various
prompt styles supplied to the Llama 3 8B Instruct model.
  We notice that prompts formatted at levels 1 to 4 give results in the margin of 0.43 − 0.47
for the macro-F1 score, which are similar to the baseline results supplied by the LLM. This is
expected, as the LLM does not get any additional information from the real world or from the
specific topic under discussion. While we add context to the prompt, we notice that the object
retrieval performance get higher than 0.9 for the Macro-F1 score. Here we notice that only in
half of the cases, the LLM produces responses exactly from the context, thus we anticipate that
the LLM brings in added value in providing the response.
   With the randomization scheme presented in the methodology, we are able to compute the
confidence interval for the macro-F1 score. Fig. 4 presents the Macro-F1 scores derived from the
50 experiments. As fig. 4 shows, most experiments achieved a Macro-F1 score between 0.91 and
0.94. There were also two experiments whose Macro-F1 scores were below 0.91, highlighting
the issue of inconsistency in the responses provided by the language models, which is in-line
with existing literature[12].
   The 95% confidence interval ranges from 0.927 to 0.932, with an average Macro-F1 score of
0.929. This indicates that the model performs well and consistently, despite variations in the
training datasets.




Figure 9: Histogram of the obtained Macro-F1 scores




5. Conclusion
As observed in our experiments, the ability of a pre-trained language model to accurately
predict information about a specific entity is limited by the dataset it was trained on. Gradually
optimizing the prompts proved to be a crucial step in improving an LLM’s predictive performance.
Although there were a few exceptions, our model achieved a good confidence interval, suggesting
low variation between the responses provided.
  Future research should investigate the entity alignment task between two KGs - which could
drastically improve prediction performance, how an LLM chooses between multiple conflicting
sources of information, the reasoning process behind its answers and ways to control the
generated answers.
References
 [1] J.-C. Kalo, S. Razniewski, T.-P. Nguyen, B. Zhang, Knowledge base construction from
     pre-trained language models 2022, in: Semantic Web Challenge on Knowledge Base
     Construction from Pre-trained Language Models, CEUR-WS, 2024.
 [2] S. K. K. Santu, D. Feng, TELeR: A general taxonomy of LLM prompts for benchmarking
     complex tasks, in: H. Bouamor et al. (Ed.), Findings of the ACL: EMNLP 2023, Singapore,
     2023, ACL, 2023, pp. 14197–14203. doi:10.18653/V1/2023.FINDINGS-EMNLP.946.
 [3] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, X. Wu, Unifying Large Language Models
     and knowledge graphs: A roadmap, CoRR abs/2306.08302 (2023). doi:10.48550/ARXIV.
     2306.08302.
 [4] Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou, Y. Yao, S. Deng, H. Chen, N. Zhang, LLMs for
     knowledge graph construction and reasoning: Recent capabilities and future opportunities,
     CoRR abs/2305.13168 (2023). doi:10.48550/ARXIV.2305.13168.
 [5] J. Huang, K. C. Chang, Towards reasoning in large language models: A survey, in: A. Rogers,
     J. L. Boyd-Graber, N. Okazaki (Eds.), Findings of the Association for Computational Lin-
     guistics: ACL 2023, Toronto, Canada, July 9-14, 2023, Association for Computational
     Linguistics, 2023, pp. 1049–1065. doi:10.18653/V1/2023.FINDINGS-ACL.67.
 [6] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen, L. Wang,
     A. T. Luu, W. Bi, F. Shi, S. Shi, Siren’s song in the AI ocean: A survey on hallucination in
     large language models, CoRR abs/2309.01219 (2023). doi:10.48550/ARXIV.2309.01219.
 [7] V. I. Iga, G. C. Silaghi, Assessing llms suitability for knowledge graph completion, CoRR
     abs/2405.17249 (2024). doi:10.48550/ARXIV.2405.17249.
 [8] B. AlKhamissi, M. Li, A. Celikyilmaz, M. Diab, M. Ghazvininejad, A review on language
     models as knowledge bases, CoRR abs/2204.06031 (2022). doi:10.48550/ARXIV.2204.
     06031.
 [9] G. Weikum, X. Dong, S. Razniewski, F. Suchanek, Machine knowledge: Creation and
     curation of comprehensive knowledge bases, Foundations and Trends in Databases 10
     (2021) 108–490. doi:10.1561/1900000064.
[10] F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, S. Riedel, Language
     models as knowledge bases?, CoRR abs/1909.01066 (2019). doi:10.48550/arXiv.1909.
     01066.
[11] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong,
     Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J. Nie, J. Wen,
     A survey of large language models, CoRR abs/2303.18223 (2023). doi:10.48550/ARXIV.
     2303.18223.
[12] Y. Elazar, N. Kassner, S. Ravfogel, A. Ravichander, E. H. Hovy, H. Schütze, Y. Goldberg,
     Measuring and improving consistency in pretrained language models, Trans. Assoc.
     Comput. Linguistics 9 (2021) 1012–1031. doi:10.1162/TACL\_A\_00410.



A. Online Resources
The source files of our system can be accessed through GitHub.