<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Prompt engineering for tail prediction in domain-specific knowledge graph completion tasks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Davide Mario Ricardo Bara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Babes-Bolyai University, Business Informatics Research Center</institution>
          ,
          <addr-line>Cluj-Napoca</addr-line>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we propose a system architecture for tackling the LM-KBC Challenge 2024 [1] task as a tail prediction problem in a knowledge graph completion setup. We experiment with several prompting techniques as suggested in the TELER taxonomy [2]. We notice that under the few-shot paradigm, using In-Context-Learning by supplying the LLM with valuable public information about the query subject and expert knowledge about the relation under study, we can improve the tail prediction performance.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>to their training data - thus general knowledge questions, but they are limited and known to
hallucinate when responding questions on specific data outside their training.</p>
      <p>
        In the proposed challenge we face a tail prediction problem on a KB containing specific data.
Thus, the designed system should respond to the following question: given a specific knowledge
base supplied as a training dataset, how can we enhance the tail entity prediction accuracy,
leveraging a LLM of a limited capacity? This research question is challenging by itself as tail
identification could be solved as a reasoning problem and LLMs’ ability for logical reasoning is
unclear [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Moreover, LLMs are known to hallucinate [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] when responding questions outside
the coverage of their training data.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] the authors have shown that prompt engineering has a massive influence over the
capacity of various LLMs to generate triples out of a free text, therefore, in our system we
employ multiple prompting techniques, as suggested in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], transmit to the LLM the relevant
information from the training data and the real world and combine the LLM response with
publicly available knowledge in order to enhance its performance over the tail prediction task.
      </p>
      <p>The structure of the paper is as follows: Section 2 reviews related work in this field. Section 3
describes our research methodology, including the dataset and validation strategy, the processing
steps followed by our system and the prompts used to interogate the LLM. Section 4 presents
the results obtained by running the system with various prompts and the expected system
performance over unseen data. The last section concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>In this section, we aim to provide an overview of existing research eforts related to knowledge
base construction tasks and the application of LLMs in this domain.</p>
      <p>
        Language models have significantly transformed research in the field of natural language
processing (NLP) in recent years. It has evolved from optimizing predictors and architecture to
the paradigm of pre-training and model optimization, and recently to the paradigm where a
pre-trained model is queried and provides a prediction [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
      </p>
      <p>
        Due to the large datasets used, pre-trained language models contain various types of
knowledge stored in their parameters, thus becoming possible alternatives to traditional knowledge
bases. The latter do not ofer the same flexibility in terms of information eloquence and
expansion [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. At the same time, creating and maintaining knowledge bases is a complex process
that involves a series of challenges [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. An important aspect is the efort required to create a
knowledge base, which involves extracting relational data from large unstructured texts through
complex NLP pipelines. Among the advantages of language models are the fact that they do not
require the creation of a validation schema and their ability to handle queries from multiple
domains [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        The use of language models as knowledge bases has been done using a series of strategies:
integrating knowledge at the time of pre-training, connecting an external base to a pre-trained
model, using an attention mechanism, and using a method based on extracting necessary nodes
from a knowledge graph. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
      </p>
      <p>
        A report on the applicability of language models in specific knowledge base tasks [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] presents
as a major impediment the interpretation and updating of knowledge stored in the model’s
parameters. Unlike traditional knowledge bases, the information in a pre-trained language
model is stored in a loosely structured manner, making it dificult to control.
      </p>
      <p>The authors of the same report believe that future research should focus on three important
aspects: improving the interpretability of language models, increasing their ability to store and
extract specific information, and ensuring consistency of responses regardless of the questions
asked or the language used.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>In this section we present the methodology of our study. The source code of the system and the
results over the test dataset are available on Github at https://github.com/davidebara/lm-kbc.
Experiments were performed on a Pay-as-you-go Google Colab subscription using a L4 GPU
with Python 3.10.12.</p>
      <p>Our final architecture uses Meta’s Llama 3 8B Instruct 1 model, the best ranked LLM in LMSYS’
ChatbotArena that has less than 10B parameters.</p>
      <sec id="sec-3-1">
        <title>3.1. Datasets and validation strategy</title>
        <p>
          Challenge organizers supplied the labeled data (triples with known object entities) in two
ifles, in total making available 755 unique subject entities linked via 5 relations to a bigger
number of object entities [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The train dataset contains 377 subject entities, while the validation
dataset contains 378 subject entities. To design a validation strategy for being able to assess
the confidence in our results, we merged these 2 datasets and created 50 train / validation
datasets using stratified random sampling, by keeping the same number of subject entities in
each dataset and preserving the given distribution of subject entities per relation. All results
reported below are average values computed on the validation datasets over 50 experiments.
The final submitted system makes use of the full training data available.
        </p>
        <p>As the dataset contains a very limited number of subject entities and the available computing
capabilities were very limited forbidding us to perform a fine-tuning experiment of a LLM with
enough capacity, we decided to use prompting techniques together with exploiting publicly
available information to enhance the performance of the tail prediction task. With this respect,
we approached Wikidata2, a free and open knowledge base who can be seen as a huge knowledge
graph and can be interrogated in a programmatic way via SPARQL queries. We performed a
manual and implicit relation alignment task identifying the relations in the Wikidata which are
relevant from the semantic point of view with the relations supplied in the training data.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. System architecture</title>
        <p>In this subsection we present the architecture of our system, detailing the processing steps
performed to produce the requested output.</p>
        <p>
          Algorithm of fig. 1 presents the steps pursued in order to process a test dataset. The
algorithm iterates over all (, ) pairs of the dataset and for each pair it formats a
1https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct and https://ollama.com/library/llama3:instruct
2https://www.wikidata.org
function PROCESS-SUBJECT-RELATION-PAIRS(dataset) : returns results
 ↼ ∅
for (subject, relation) in dataset do
 ↼ FETCH-KG-CONTEXT(subject, relation)
 ↼ GENERATE-PROMPT(subject, context, relation)
 ↼ CALL-LLM(prompt)
 ↼ DISAMBIGUATE(prediction)
 ↼ LINK-TO-REAL-WORLD(answers)
 ↼  ∪ 
end for
return 
end function
prompt according to one of the prompt levels of the TELER taxonomy [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and sends a query
to the selected LLM with the formatted prompt. Next, predictions supplied by the LLM are
disambiguated and aligned to real-world knowledge extracted from Wikidata. In the last step,
obtained entities are returned as a response.
        </p>
        <p>When generating the prompt in the GENERATE-PROMPT function we applied specific prompt
templates for each of the 5 relations supplied in the training dataset. Prompt templates are
found in the _ folder. The LLM receives clear instructions about the task it
needs to solve and some relevant context that could be publicly identified about the subject of
the query. Prompts are described in subsection 3.3.</p>
        <p>Function FETCH-KG-CONTEXT interrogates Wikidata to eventually discover publicly available
information about the subject. SPARQL query templates are found in the relations folder.
For each relation in the training dataset, we developed a specific query, aiming to incorporate as
much expert knowledge as possible. Responses supplied by the SPAQRL query are transmitted
as context for the LLM query. In this function, we perform a logical entity alignment task for a
knowledge graph, with Wikidata as the target.</p>
        <p>We ask the LLM to supply textual information in a given format. For prompt templates that
require output examples, we provide the LLM with two question-response pairs, to indicate
how the textual output should be formatted for both existing and non-existing object entities.
This is essential for the levels 5 and 6 of prompting, the ones that prove to work better than all
the previous ones.</p>
        <p>DISAMBIGUATE function extracts the entities provided by the LLM from its response, by
parsing the JSON output and removing all unnecesary details. LINK-TO-REAL-WORLD function
checks the Wikidata knowledge base to determine if the LLM responded with information
available there. If a match is found, it returns the corresponding entity identifier.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Prompt engineering</title>
        <p>
          In this section we present the prompt templates formatted according to the TELER taxonomy [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
Since the prompts were integrated with the research framework provided by the competition,
the LLM might receive additional text based on the system’s setup.
        </p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Level 0 prompts</title>
          <p>Level 0 prompts only provide the LLM with raw data. In our case, this means the model will
only receive the name of the subject entity. As you could notice from section 4, the performance
for this prompt is worse than that of the baseline model. No context information is supplied,
either from the training dataset or from the real world.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Level 1 prompts</title>
          <p>Level 1 prompts provide a basic one-sentence directive that states a high-level goal for the LLM.
Our results show that this approach performs slightly better than Level 0 prompting, achieving
results comparable to the baseline model provided in the competition.</p>
          <p>Provide the names of the stock exchanges that list {subject_entity} shares.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>3.3.3. Level 2 prompts</title>
          <p>Level 2 prompts include sub-tasks that need to be executed by the LLM in order to achieve
the high-level goal. The results indicate that incorporating a chain-of-thought approach can
improve LLM performance by guiding the model through a clear, logical sequence of steps and
reducing the ambiguity of the initial goal.</p>
          <p>Provide a comma-separated list of all stock exchanges where shares of
{subject_entity} are traded. Include only the names, and if none, state {None}.
Perform the task in distinct steps: extract relevant financial data about
{subject_entity}, verify the stock exchanges where its shares are traded from
reliable sources, and cross-check the information for accuracy. Ensure your response
is clear, accurate, and concise.</p>
        </sec>
        <sec id="sec-3-3-4">
          <title>3.3.4. Level 3 prompts</title>
          <p>
            Level 3 prompts extend the level 2 prompts by providing the LLM with a structured list of
sub-tasks rather than a paragraph of instructions. Up to this level, the prompts fulfill the
Zero-shot prompting paradigm, as presented in [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ].
          </p>
          <p>Provide a comma-separated list of all stock exchanges where shares of
{subject_entity} are traded. Include only the names, and if none, state “None”.
Perform the task in distinct steps: 1. Extract relevant financial data about
{subject_entity}. 2. Verify the stock exchanges where its shares are traded from
reliable sources. 3. Cross-check the information for accuracy. 4. Ensure your
response is clear, accurate, and concise.</p>
        </sec>
        <sec id="sec-3-3-5">
          <title>3.3.5. Level 4 prompts</title>
          <p>Level 4 prompts build upon level 3 prompts, adding guidelines on how the answer will be
evaluated or expected answer examples.</p>
          <p>Here, we move towards a few-shot prompting paradigm, more precisely to
In-ContextLearning, by adding relevant examples of a solution to the given task. For each relation type
we provided two examples of how the LLM should respond when prompted about a specific
subject entity, one where a right answer exists and one where it doesn’t.</p>
          <p>Provide a comma-separated list of all stock exchanges where shares of
{subject_entity} are traded. Include only the names, and if none, state “None”.
Perform the task in distinct steps: 1. Extract relevant financial data about
{subject_entity}. 2. Verify the stock exchanges where its shares are traded
from reliable sources. 3. Cross-check the information for accuracy. 4. Ensure
your response is clear, accurate, and concise. Follow the format of the provided
examples: Example 1: “Provide a comma-separated list of all stock exchanges where
shares of Apple Inc. are traded. Response: NASDAQ, Frankfurt Stock Exchange, Swiss
Exchange.” Example 2: “Provide a comma-separated list of all stock exchanges where
shares of a private company are traded. Response: None.”</p>
        </sec>
        <sec id="sec-3-3-6">
          <title>3.3.6. Level 5 prompts</title>
          <p>Level 5 prompts include information supplied in level 4 prompting, with the addition of context
gathered via retrieval-based techniques from Wikidata. As shown in the results section, the
predictive performance of the LLM improves significantly.</p>
          <p>To retrieve relevant context from Wikidata, we created five SPARQL queries, each tailored to
a specific relation type. Regardless of whether entities are found, the results are appended to the
end of the prompt template within the script that runs the model. The template sentence is
structured as follows: “Query results for subject “{subject_entity}” and property
“{relation_type}” on the Wikidata Knowledge Graph: {query_result}”. If no
information is returned, query_result will be the string ”None”.</p>
          <p>Provide a comma-separated list of all stock exchanges where shares of
{subject_entity} are traded. Include only the names, and if none, state “None”.
Perform the task in distinct steps: 1. Extract relevant financial data about
{subject_entity}. 2. Verify the stock exchanges where its shares are traded from
reliable sources. 3. Cross-check the information for accuracy. 4. Ensure your
response is clear, accurate, and concise. Follow the format of the provided examples:
Example 1: “Provide a comma-separated list of all stock exchanges where shares of
Apple Inc. are traded. Response: NASDAQ, Frankfurt Stock Exchange, Swiss Exchange.”
Example 2: “Provide a comma-separated list of all stock exchanges where shares of
a private company are traded. Response: None.” Utilize additional information
fetched through reliable information retrieval techniques to confirm the accuracy
of the stock exchanges list.</p>
        </sec>
        <sec id="sec-3-3-7">
          <title>3.3.7. Level 6 prompts</title>
          <p>Level 6 includes all elements of the level 5 prompts, with an explicit statement asking the LLM
to explain its own output. From the obtained results, we notice that the performance of the
LLM does not improve.</p>
          <p>Provide a comma-separated list of all stock exchanges where shares of
{subject_entity} are traded. Include only the names, and if none, state “None”.
Perform the task in distinct steps: 1. Extract relevant financial data about
{subject_entity}. 2. Verify the stock exchanges where its shares are traded from
reliable sources. 3. Cross-check the information for accuracy. 4. Ensure your
response is clear, accurate, and concise. Follow the format of the provided examples:
Example 1: “Provide a comma-separated list of all stock exchanges where shares of
Apple Inc. are traded. Response: NASDAQ, Frankfurt Stock Exchange, Swiss Exchange.”
Example 2: “Provide a comma-separated list of all stock exchanges where shares of
a private company are traded. Response: None.” Utilize additional information
fetched through reliable information retrieval techniques to confirm the accuracy
of the stock exchanges list. Justify your response with a detailed explanation of
the sources and reasoning process. If stating ’None’, explain why the information
is unavailable or not applicable.</p>
        </sec>
        <sec id="sec-3-3-8">
          <title>3.3.8. Final prompt template</title>
          <p>As level 5 prompts prove to be the best ones, we selected them in the final solution. Figure 8
illustrates the selected prompt template. Additionally, it assigns a specific role to the LLM for
each task and clear instructions on how the response should be formatted to avoid extraneous
tokens in the answers. We also removed any extraneous tokens like numbers or bullets of lists,
as the text is sent to the LLM as a paragraph either way.</p>
          <p>You are a financial expert. Provide a comma-separated list of all stock exchanges
where shares of {subject_entity} are traded. Include only the names, and if none,
state “None”. Perform the task in distinct steps: extract relevant financial data
about {subject_entity}, verify the stock exchanges where its shares are traded from
reliable sources and cross-check the information for accuracy. Ensure your response
is clear, accurate, and concise. Follow the format of the provided examples: Example
1: “Provide a comma-separated list of all stock exchanges where shares of Apple
Inc. are traded. Response: NASDAQ, Frankfurt Stock Exchange, Swiss Exchange.”
Example 2: “Provide a comma-separated list of all stock exchanges where shares
of a private company are traded. Response: None.” Utilize additional information
fetched through reliable information retrieval techniques to confirm the accuracy
of the stock exchanges list.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>for the macro-F1 score, which are similar to the baseline results supplied by the LLM. This is
expected, as the LLM does not get any additional information from the real world or from the
specific topic under discussion. While we add context to the prompt, we notice that the object
retrieval performance get higher than 0.9 for the Macro-F1 score. Here we notice that only in
half of the cases, the LLM produces responses exactly from the context, thus we anticipate that
the LLM brings in added value in providing the response.</p>
      <p>
        With the randomization scheme presented in the methodology, we are able to compute the
confidence interval for the macro-F1 score. Fig. 4 presents the Macro-F1 scores derived from the
50 experiments. As fig. 4 shows, most experiments achieved a Macro-F1 score between 0.91 and
0.94. There were also two experiments whose Macro-F1 scores were below 0.91, highlighting
the issue of inconsistency in the responses provided by the language models, which is in-line
with existing literature[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>The 95% confidence interval ranges from 0.927 to 0.932, with an average Macro-F1 score of
0.929. This indicates that the model performs well and consistently, despite variations in the
training datasets.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>As observed in our experiments, the ability of a pre-trained language model to accurately
predict information about a specific entity is limited by the dataset it was trained on. Gradually
optimizing the prompts proved to be a crucial step in improving an LLM’s predictive performance.
Although there were a few exceptions, our model achieved a good confidence interval, suggesting
low variation between the responses provided.</p>
      <p>Future research should investigate the entity alignment task between two KGs - which could
drastically improve prediction performance, how an LLM chooses between multiple conflicting
sources of information, the reasoning process behind its answers and ways to control the
generated answers.
The source files of our system can be accessed through GitHub.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Kalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Razniewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.-P.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Zhang,</surname>
          </string-name>
          <article-title>Knowledge base construction from pre-trained language models 2022, in: Semantic Web Challenge on Knowledge Base Construction from Pre-trained Language Models, CEUR-</article-title>
          <string-name>
            <surname>WS</surname>
          </string-name>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>K. K. Santu</surname>
          </string-name>
          , D. Feng,
          <article-title>TELeR: A general taxonomy of LLM prompts for benchmarking complex tasks</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
          </string-name>
          et al. (Ed.),
          <source>Findings of the ACL: EMNLP</source>
          <year>2023</year>
          , Singapore,
          <year>2023</year>
          , ACL,
          <year>2023</year>
          , pp.
          <fpage>14197</fpage>
          -
          <lpage>14203</lpage>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          .FINDINGS-EMNLP.
          <year>946</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Unifying Large Language Models and knowledge graphs: A roadmap</article-title>
          ,
          <source>CoRR abs/2306</source>
          .08302 (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV. 2306.08302.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. Zhang,</surname>
          </string-name>
          <article-title>LLMs for knowledge graph construction and reasoning: Recent capabilities and future opportunities</article-title>
          ,
          <source>CoRR abs/2305</source>
          .13168 (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.2305.13168.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. C.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <article-title>Towards reasoning in large language models: A survey</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J. L.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Findings of the Association for Computational Linguistics: ACL</source>
          <year>2023</year>
          , Toronto, Canada, July 9-
          <issue>14</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>1049</fpage>
          -
          <lpage>1065</lpage>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          .FINDINGS-ACL.
          <year>67</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cai</surname>
          </string-name>
          , L. Liu,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Luu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <article-title>Siren's song in the AI ocean: A survey on hallucination in large language models</article-title>
          ,
          <source>CoRR abs/2309</source>
          .01219 (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.2309.01219.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V. I.</given-names>
            <surname>Iga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. C.</given-names>
            <surname>Silaghi</surname>
          </string-name>
          ,
          <article-title>Assessing llms suitability for knowledge graph completion</article-title>
          ,
          <source>CoRR abs/2405</source>
          .17249 (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.2405.17249.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>AlKhamissi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Celikyilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Diab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <article-title>A review on language models as knowledge bases</article-title>
          ,
          <source>CoRR abs/2204</source>
          .06031 (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.2204. 06031.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Razniewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          ,
          <article-title>Machine knowledge: Creation and curation of comprehensive knowledge bases</article-title>
          ,
          <source>Foundations and Trends in Databases</source>
          <volume>10</volume>
          (
          <year>2021</year>
          )
          <fpage>108</fpage>
          -
          <lpage>490</lpage>
          . doi:
          <volume>10</volume>
          .1561/1900000064.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bakhtin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <article-title>Language models as knowledge bases?</article-title>
          , CoRR abs/
          <year>1909</year>
          .01066 (
          <year>2019</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1909</year>
          .
          <volume>01066</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , P. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>A survey of large language models</article-title>
          ,
          <source>CoRR abs/2303</source>
          .18223 (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV. 2303.18223.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Elazar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kassner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ravfogel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ravichander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schütze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <article-title>Measuring and improving consistency in pretrained language models</article-title>
          ,
          <source>Trans. Assoc. Comput. Linguistics</source>
          <volume>9</volume>
          (
          <year>2021</year>
          )
          <fpage>1012</fpage>
          -
          <lpage>1031</lpage>
          . doi:
          <volume>10</volume>
          .1162/TACL\_A\_
          <volume>00410</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>