<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>KGC-RAG: Knowledge Graph Construction from Large Language Model Using Retrieval-Augmented Generation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thin Prabhong</string-name>
          <email>thin.pra2013@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Natthawut Kertkeidkachorn</string-name>
          <email>natt@jaist.ac.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Areerat Trongratsameethong</string-name>
          <email>areerat.t@cmu.ac.th</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chiang Mai University</institution>
          ,
          <addr-line>Chiang Mai</addr-line>
          ,
          <country country="TH">Thailand</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Japan Advanced Institute of Science and Technology</institution>
          ,
          <addr-line>Ishikawa</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The construction of Knowledge Graphs (KGs) has become increasingly important due to their ability to integrate and represent complex relationships across various domains, making them essential for applications like information retrieval and semantic search. Recently, Large Language Models (LLMs) have been utilized to enhance KGs creation by leveraging their advanced capabilities in understanding and generating human-like text. The Large Language Models for Knowledge Engineering (LLMKE) pipeline was introduced to combine knowledge probing with Wikidata entity mapping for knowledge engineering. Nevertheless, this approach has a limitation: it primarily relies on retrieval-augmented context drawn from the first paragraph and Wikipedia Infobox of the subject entity's page. This narrow focus can lead to incomplete knowledge representations, as relevant information is often spread throughout the text and linked pages. To address this issue, we propose the Knowledge Graph Construction from Large Language Model using Retrieval-Augmented Generation method (KGC-RAG). This method leverages web scraping to retrieve documents from the subject entity's Wikipedia page and to extend the search to include linked pages, thereby increasing the likelihood of capturing comprehensive and contextually rich information. We further enhance this approach by using LLMs in conjunction with cosine similarity to filter out irrelevant content, ensuring that only the most pertinent data are included in the relevant contexts. We conducted an experiment on datasets from ISWC 2024 LM-KBC Challenge and applied the meta-llama/Meta-Llama-3-8B-Instruct model as our pre-trained large language model along with the all-MiniLM-L6-v2 as our vector embedding model. We set a relevant score threshold of 0.5 to filter Wikipedia URLs. Our approach achieved macro average F1-scores of 0.695 and 0.698 on the validation and test sets, respectively. The implementation is available at https://github.com/jaejeajay/LM-KBC2024.</p>
      </abstract>
      <kwd-group>
        <kwd>Retrieval-Augmented</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Knowledge graphs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] store information in a Subject-Predicate-Object format, providing more
eficient semantic data storage compared to relational database. They are more easily understood
by computers, ofering flexibility and the ability to integrate diverse information, making them
crucial for applications such as question answering and recommendation systems. A notable
example of a significant knowledge graph repository is Wikidata [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], an open-source platform
by the Wikimedia Foundation for storing factual data about the world.
      </p>
      <p>https://github.com/jaejeajay/LM-KBC2024 (T. Prabhong)
CEUR
Workshop
Proceedings
(A. Trongratsameethong)</p>
      <p>
        Constructing knowledge graphs (KGs) is highly challenging [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] due to the need for access to
vast and diverse information sources. Large Language Models (LLMs) are gaining popularity
for their ability to perform a wide range of tasks, such as answering questions, providing
information, translating languages, and engaging in conversations. Since LLMs are trained
on extensive datasets, they serve as comprehensive repositories of knowledge. Extracting
knowledge from LLMs can yield valuable information for various applications, including the
construction of knowledge graphs, where LLMs can provide foundational data and relationships
needed for building these complex structures.
      </p>
      <p>
        However, challenges in extracting knowledge from LLMs [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] include the generation of
fabricated answers and outdated information. Over time, facts can change or be disproven, but
the data used to train LLMs remain factual only as of the time of training. Thus, optimizing
LLMs for accurate and reliable responses is crucial.
      </p>
      <p>
        Retrieval-Augmented Generation (RAG) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] ofers an efective solution to these challenges.
By combining retrieval and generation processes, RAG enhances the accuracy of LLM outputs
by retrieving up-to-date information from external databases or sources, thereby providing the
necessary context for generating more factual and reliable responses.
      </p>
      <p>
        In this study, we explore the construction of knowledge graphs using knowledge extracted
from LLMs. We optimize LLM performance with RAG by improving the quality of relevant
context through web scraping and web crawling. The relevance score is determined by the path
names in the Wikipedia URLs. We utilized datasets from the ISWC 2024 LM-KBC Challenge [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
where the task limited LLM parameters to 10B. We selected the Llama-3-8B-Instruct model for
our study, and the results demonstrate that our approach outperforms the baseline.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Works</title>
      <p>2.1. Using Large Language Models for Knowledge Engineering (LLMKE): A</p>
      <p>
        Case Study on Wikidata
In recent research, Zhang et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] explored the use of LLMs for knowledge engineering tasks
within the context of the ISWC 2023 LM-KBC Challenge. They utilized pre-trained LLMs
to generate relevant objects in string format from given subject-relation pairs sourced from
Wikidata, subsequently linking these objects to their respective Wikidata QIDs. The developed
pipeline, known as Large Language Models for Knowledge Engineering (LLMKE), combines
knowledge probing with Wikidata entity mapping and incorporates retrieval-augmented context
to enhance predictions. This context is derived from external sources, such as Wikipedia, to
refine the model’s responses by comparing and integrating this information with the model’s
initial predictions.
      </p>
      <p>Nonetheless, LLMKE operates under the assumption that the most relevant documents are
primarily found in the first paragraph (Introduction) and the Wikipedia Infobox, which may
not always be the case, as related information can be dispersed throughout various sections.</p>
      <p>To address this limitation, we propose a method to enhance content extraction from Wikipedia
by employing web scraping techniques to gather information from paragraphs, Infoboxes, and
Wikitables on the Wikipedia page of the subject entity. Also, this approach extends the search to
other linked Wikipedia pages, thereby capturing more comprehensive and relevant information.</p>
      <p>
        One challenge of extending the search beyond the subject entity’s Wikipedia page is the
increased volume of documents, which can slow down the process. To mitigate this, we
implemented a solution using LLM and cosine similarity scores to filter and prioritize relevant
Wikipedia pages and documents, ensuring eficient and efective information retrieval.
2.2. Retrieval-Augmented Generation for Large Language Models
Yunfan Gao et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] reviewed the RAG paradigm, categorizing it into three types: Naïve,
Advanced, and Modular RAG. Naïve RAG involves a simple two-phase process of retrieving
documents and generating responses based on them. Advanced RAG improves this by optimizing
query modification and refining the retrieved context for better LLM performance. Modular
RAG further breaks down the process into independent modules, allowing easier updates but
requiring more resources for development.
      </p>
      <p>
        This study focuses on utilizing the Naïve RAG process to explore its application in constructing
knowledge graphs through knowledge extraction from LLMs, using the ISWC 2024 LM-KBC
Challenge dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The research integrates LLMs into the retrieval stage to aid in filtering
relevant documents from web scraping, rather than limiting their role to the generation phase.
Given that LLMs are trained on vast knowledge, the hypothesis is that incorporating them in
the Retrieval process may outperform using relevance scores alone.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Methods</title>
      <p>The Knowledge Graph Construction from Large Language Model using Retrieval-Augmented
Generation method (KBC-RAG) provides an overview as shown in Figure 1. This method is
composed of two processes: RAG and Entity Mapping. In the RAG process, relevant contexts are
identified and used to enhance the LLM’s knowledge extraction. In the entity mapping process,
the answers from the LLM are used to construct knowledge graphs by mapping through API
functions from Wikidata, resulting in completed knowledge tuples. The details of each process
are as follows:</p>
      <sec id="sec-4-1">
        <title>3.1. Retrieval-Augmented Generation (RAG)</title>
        <p>The overview of RAG process is presented in Figure 2. The goal of this process is to extract
knowledge from LLM using RAG as an LLM optimization method. The outcome of this process
will be LLM-generated answers, which will then be used for subsequent entity mapping process.</p>
        <sec id="sec-4-1-1">
          <title>3.1.1. Entity ID Query</title>
          <p>The entity ID query step involves locating the Wikipedia URLs of the subject entities using the
provided SubjectEntityID, which is essential for the web scraping process. The SubjectEntityID
is used to search for the URLs in Wikidata through an API called SPARQLWrapper 1, with a
preference for English URLs. The outcome of this step is the subject entity’s Wikipedia URL,
which is used to access the subject entity’s Wikipedia page, referred to as the main page in
1https://sparqlwrapper.readthedocs.io/en/latest/main.html
this study. For example, if SubjectEntity is “Uruguay”, SubjectEntityID is “Q77”, and relation is
“countryLandBordersCountry”, after using SubjectEntityID to search via SPARQLWrapper, the
resulting Wikipedia URL for SubjectEntity would be https://en.wikipedia.org/wiki/Uruguay.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>3.1.2. Web Scraping</title>
          <p>In web scraping step, the focus is on scraping data from the Wikipedia pages. The data targeted
for scraping includes three tags: p (Paragraph), Infobox, and Wikitable. The outputs of this step
are the scraped data from Wikipedia, which will later be assessed for relevance in the relevant
identification step.
awardWonBy “Who has won ” + SubjectEntity + “ ?”
companyTradesAtStockExchange “Which stock exchange does ” + SubjectEntity + “ trade on ?”
countryLandBordersCountry “Which country share land border with ” + SubjectEntity + “ ?”
personHasCityOfDeath “What is the city of death of ” + SubjectEntity + “ ?”
seriesHasNumberOfEpisodes “How many episodes does series ” + SubjectEntity + “ has ?”</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>3.1.3. Web Crawling</title>
          <p>The web crawling step is designed to broaden the scope of web scraping by extending the
search from the main page to other linked Wikipedia pages, thereby increasing the chances
of identifying relevant context. This is achieved by identifying &lt;a&gt; tags within the main
page. The result of this step is a collection of linked Wikipedia URLs linked to the main page.
For example, from the Wikipedia page of the SubjectEntity “Uruguay” at https://en.wikipedia.
org/wiki/Uruguay, examples of linked Wikipedia pages include https://en.wikipedia.org/wiki/
Economy_of_Uruguay, https://en.wikipedia.org/wiki/Religion_in_Uruguay, and so on.</p>
        </sec>
        <sec id="sec-4-1-4">
          <title>3.1.4. Relevant Scoring</title>
          <p>The primary aim of relevant scoring step is to filter out unnecessary linked URLs from the web
crawling process by using a Relevant score, calculated via cosine similarity. This approach helps
streamline the RAG process. The filtering method involves calculating the cosine similarity
between the path name of each URL linked to the main page and the Prompt for web scraping,
utilizing the vector model all-MiniLM-L6-v2. The templates of the prompt for web scraping
are shown in Table 1. For instance, from the URL https://en.wikipedia.org/wiki/Economy_of_
Uruguay, only the part “Economy of Uruguay” is used to calculate the Relevant Score. URLs
with a cosine similarity score above the threshold are selected for web scraping. The outcome
of this step is a refined list of linked URLs for web scraping.</p>
        </sec>
        <sec id="sec-4-1-5">
          <title>3.1.5. Relevant Identification</title>
          <p>The relevant identification step focuses on filtering scraped data from Wikipedia by leveraging
LLM responses to specific questions. The result is a list of relevant documents, which will be
used in the subsequent step for determining the relevant score.</p>
          <p>We begin by filtering each paragraph, Infobox, and Wikitable data using the
Meta-Llama3-8B-Instruct Model through question-based evaluation. The format of question is: Is this
information “[Paragraph/Infobox/Wikitable]” able to answer the question: “[Prompt for web
scraping]”? Information deemed relevant by the LLM is added to the list of relevant documents.
If the LLM determines that it can answer the question, it will return a response of 1; otherwise,
it will return 0. The answer from the LLM is 1, meaning that this paragraph will be included in
the list of relevant documents.</p>
        </sec>
        <sec id="sec-4-1-6">
          <title>3.1.6. Relevant Score Ranking</title>
          <p>In relevant score ranking step, the goal is to retrieve the relevant context from the list of
relevant documents by applying the same method used in the relevant scoring step, combined
with relevant score ranking. Each document is compared to the web scraping prompt by
calculating cosine similarity using the vector model all-MiniLM-L6-v2, as in the relevant scoring
step, to determine its relevant score. The documents are then ranked from highest to lowest,
and the Top  documents are selected to form the relevant context, which will be crucial for
knowledge extraction. The key diference between this step and the relevant scoring step lies
in their objectives. In relevant scoring, the focus is on identifying relevant linked URLs using a
threshold-based filtering technique, while this step selects relevant documents to create the
relevant context through a Top  filtering approach. The Top  method is not used in the
relevant scoring step to avoid restricting the number of links, thereby allowing broader web
scraping and increasing the chances of finding relevant documents.</p>
        </sec>
        <sec id="sec-4-1-7">
          <title>3.1.7. Knowledge Extraction</title>
          <p>The purpose of knowledge extraction step is to extract knowledge from LLM by using relevant
context. The result of this step will be LLM-generated answers that are ready for entity mapping.
For example, for the question “Which countries share land borders with Uruguay with country
name only with comma? If None, answer None.” the answer generated by the LLM is “Argentina,
Brazil”.</p>
          <p>To ask questions using the prompt for knowledge extraction, as shown in Table 2, and the
obtained relevant context, the input message format consists of three parts:
1. Relevant Context: The first message provides the relevant context to the LLM in the
following format:
{“role”: “system”, “content”: “Using this context to answer the question: ” +
[Relevant Context]}.
2. Behavior Setting: The second message sets the behavior of the Chatbot of LLM to ensure
responses are as required:
{“role”: “system”, “content”: “You are a chatbot who always responds an answer
in english with comma and no explanation. If u don’t know the answer, answer
None”}.
3. Question Prompt: The third message is used to ask the question and is formatted as:
{“role”: “user”, “content”: [Prompt for Knowledge Extraction]}.</p>
          <p>This structured approach ensures that the LLM has the necessary context and guidance to
provide accurate and concise answers.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Entity Mapping</title>
        <p>The goal of entity mapping process is to construct a knowledge graph using the answers from the
LLM obtained in the previous session. The result of this process will be completed knowledge
tuples based on the given subject entity and relation. For example, after receiving the answer
from the question “Which countries share land borders with Uruguay with country name only with
comma? If None, answer None.” from the LLM as “Argentina, Brazil”, we map the answer using
wbsearchentities. The final result is a completed tuple: {“SubjectEntity”: “Uruguay”, “Relation”:
“countryLandBordersCountry”, “ObjectEntitiesID”: [“Q414”, “Q155”]}.</p>
        <p>We begin by mapping the answers to create a knowledge graph using wbsearchentities, an
API function provided by the Wikidata Query Service 2. This function searches for entities in
Wikidata using labels or keywords and returns a list of matching entities ranked by relevance.
The method used to select entities for linking as objects is “Choose First” for all relations. After
selecting an entity, its validity is verified by checking the range of each relation. As shown in
Table 3, if the entity’s “instance of” is within the same range of relations as the subject, the
entity is selected for mapping.</p>
        <p>In cases where responses from LLM exhibit ambiguity, it may be due to the LLM relying
too heavily on context. For instance, in the relation companyTradesAtStockExchange, stock
exchange information for a company on Wikipedia is often presented in the format “Traded as”,
such as “West Japan Railway Company traded as TYO: 9021”, where TYO is the stock exchange
name and 9021 is the stock exchange code. The LLM might provide an answer like “TYO: 9021”.</p>
        <p>2https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service
To address this ambiguity, we employ hard coding to filter and refine the responses from the
LLM, thereby improving the accuracy and clarity of the output.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments and Results</title>
      <sec id="sec-5-1">
        <title>4.1. Datasets</title>
        <p>
          The datasets used in this study were obtained from ISWC 2024 LM-KBC Challenge [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Details
of the datasets are shown in Table 4. Each dataset consists of 378 unique subject-entities and 5
relations, with object entities referenced from Wikidata. Additionally, each relation has special
features that describe the characteristics of the object entities, which vary by relation. For
example, the special feature of the relation countryLandBordersCountry is the possibility of Null
values. For instance, New Zealand, being an island nation with no land borders, would have a
Null value for the object entity corresponding to the given subject entity “New Zealand” and
the given relation countryLandBordersCountry.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Experiment Settings</title>
        <p>The LLM used in this study is Llama-3-8B-Instruct. Its parameter count also falls within the
required limit of the task. Wikipedia was the primary data source, and the vector model chosen
for cosine similarity calculations was all-MiniLM-L6-v2. This model was selected for its compact
size, speed, and reasonable performance, particularly given the resource-intensive nature of
web scraping and crawling, which may involve processing numerous linked URLs and a large
volume of documents. It helps eficiently manage this challenge. The relevant score threshold
for filtering Wikipedia URLs is set at 0.5 to select URLs similar to the web scraping prompt,
which functions as a user query. A cosine similarity of 0.5 reflects a moderate level of similarity,
balancing relevance and coverage to avoid overly filtering out pertinent URLs. Each document
was divided into segments of 4,500 tokens, and a Top  approach, with  =20, was applied to
combine documents into a single context within the relevant score ranking pipeline. In the
relevant identification step, we used max_new_tokens=1, temperature=0.1, and top_p=0.9.
max_new_tokens=1 was chosen because this step of the pipeline only returns a 0 or 1 for
ifltering. In the knowledge extraction step, we set max_new_tokens=3000, temperature=0.1,
and top_p=0.9, with max_new_tokens=3000 being an estimated value for the maximum length
of the LLM’s responses. The choice of using  =20, temperature=0.1, and top_p=0.9 was
based on preliminary experiments, which demonstrated better results compared to not setting
these values. Moreover, the values of temperature=0.1 and top_p=0.9 are suitable for tasks
requiring logical consistency.</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Baseline</title>
        <p>Our study used the baseline of meta-llama/Meta-Llama-3-8B-Instruct from ISWC 2024
LMKBC Challenge1. The baseline was derived from the use of a Masked Language Model, an
Autoregressive Language Model, and Llama-3 chat models.</p>
      </sec>
      <sec id="sec-5-4">
        <title>4.4. Evaluation Metrics</title>
        <p>In our study, we used three evaluation metrics: Macro-Precision (Macro-P), Macro-Recall
(Macro-R), and Macro-F1. Macro-P is the average precision score across all classes, calculated
by determining the precision for each class and averaging the values. Macro-R represents the
average recall score across all classes, computed similarly by averaging the recall for each class.
Finally, Macro-F1 is the average F1 score across all classes, derived by calculating the F1 score
for each class and averaging the results.</p>
      </sec>
      <sec id="sec-5-5">
        <title>4.5. Results</title>
        <p>We conducted experiments on our system using the validation set, with the results presented in
Table 5. The experimental results demonstrated that our approach outperformed the baseline,
achieving an average F1-score of 0.695. For the test set, we submitted the results to the ISWC
2024 LM-KBC Challenge. The test results, shown in Table 6, indicate an average F1-score of
0.698, which is consistent with the performance on the validation set.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Discussion</title>
      <sec id="sec-6-1">
        <title>5.1. Data Source Quality</title>
        <p>
          The source of context information plays a critical role in determining the coverage score of
the context [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. If a data source provides limited information, the coverage score will be lower,
which subsequently afects the responses generated by the LLM [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. ISWC 2024 KM-KBC
Challenge this year is represented by five distinctive relations, particularly the awardWonBy
and companyTradesAtStockExchange relations. For the awardWonBy relation, some subjects on
Wikipedia do not display the award winners on the main award page, but instead on a separate
“List of laureates” page. Similarly, for the companyTradesAtStockExchange relation, information
about stock exchanges may not be directly provided or may be absent from the Wikipedia page.
Therefore, utilizing efective data sources and retrieval methods is crucial.
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Knowledge Discrepancy</title>
        <p>While performing the task, we observed that although the responses generated by the LLM were
consistent with the context derived from Wikipedia, they did not align with the information
Baseline
KGC-RAG
awardWonBy
companyTrades
AtStockExchange
countryLand
BordersCountry
personHas
CityOfDeath
seriesHas
NumberOfEpisodes
All Relations
awardWonBy
companyTrades
AtStockExchange
countryLand
BordersCountry
personHas
CityOfDeath
seriesHas
NumberOfEpisodes</p>
        <p>All Relations
referenced in Wikidata. This discrepancy afected the system’s performance. For example, in the
training data for the subject: Love &amp; Hip Hop: New York, relation: seriesHasNumberOfEpisodes,
the LLM predicted that the series had 143 episodes, based on Wikipedia. However, Wikidata
indicated that the series had 82 episodes, which was outdated. Although both Wikipedia and
Wikidata are open-source platforms where users can update information, the discrepancy in
information between them still exists. Helping to update information or reporting issues to the
community could help reduce this discrepancy.</p>
      </sec>
      <sec id="sec-6-3">
        <title>5.3. Relevant Context</title>
        <p>
          High-quality context can improve the performance of question-answering in LLMs [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. We
employed LLMs and relevant scores to filter the quality of documents obtained from web
scraping and web crawling. After consolidating documents into a single context, we evaluate
the quality of the relevant context using the coverage score. The coverage score indicates how
well the relevant context contains substrings that are object entities from the given subject
entity and relation. It is calculated by dividing the number of object entity substrings found
in the relevant context by the total number of object entities for that given subject entity and
relation. The coverage scores for each relation on the validation set are shown in Table 7.
        </p>
        <p>
          It was observed that for certain relations, particularly awardWonBy, the coverage score was
not high. This may be due to the fact that Wikipedia often presents award winners in tables
that include additional information. Although the award winner’s name is present, the presence
of other noise in the data can impact the relevant score [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. When comparing the coverage
score with the macro F1 score for each relation in Table 5, it was found that the macro F1 score
is generally lower or similar. This implies that increasing the coverage score of the context
could potentially enhance the quality of responses generated by the LLM.
        </p>
        <p>Additionally, comparing the diference in coverage score between using web crawling and
not using it reveals that web crawling generally improves the coverage score for most relations.
However, in some cases, it does not. This may be due to the presence of noise in the documents
obtained from web scraping. Expanding the scope of web scraping increases the likelihood
of encountering noisy data, which can afect the cosine similarity score. As a result, some
documents with higher cosine similarity scores may be selected over those that actually contain
the correct answers to the query.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion</title>
      <p>
        This study aims to construct KGs from LLM by using RAG to optimize knowledge extraction
from LLM and to involve the large language model in finding relevant documents using ISWC
2024 LM-KBC Challenge datasets [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We achieved an F1 score of 0.695 for the validation set
and 0.698 for the test set. The results demonstrate that the quality of context is crucial for
optimizing LLM performance in answering questions. Additionally, the LLM can efectively
assist in screening relevant documents, which is a key factor in constructing an accurate and
high-quality knowledge graph. For future investigations, it is recommended to explore the
implementation of automatic relevant document retrieval instead of relying solely on
questionanswering combined with relevant scores.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgement</title>
      <p>This work was supported by JSPS Grant-in-Aid for Early-Career Scientists (Grant Number
24K20834).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Paulheim</surname>
          </string-name>
          ,
          <article-title>Knowledge graph refinement: A survey of approaches and evaluation methods</article-title>
          ,
          <source>Semantic Web</source>
          <volume>8</volume>
          (
          <year>2017</year>
          )
          <fpage>489</fpage>
          -
          <lpage>508</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandečić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          ,
          <article-title>Wikidata: A free collaborative knowledge base</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>57</volume>
          (
          <year>2014</year>
          )
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Jansen</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. van Hirtum</surname>
          </string-name>
          ,
          <article-title>Constructing knowledge graphs from text: A survey of methods and tools</article-title>
          ,
          <source>Journal of Computer Science and Technology</source>
          <volume>35</volume>
          (
          <year>2020</year>
          )
          <fpage>1001</fpage>
          -
          <lpage>1020</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Binns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Veitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shadbolt</surname>
          </string-name>
          ,
          <article-title>Evaluating the reliability of large language models for knowledge extraction</article-title>
          ,
          <source>in: Proceedings of the ACM Conference on Information and Knowledge Management (CIKM)</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Kalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Razniewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.-P.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Zhang,</surname>
          </string-name>
          <article-title>Knowledge base construction from pre-trained language models 2022, in: Semantic Web Challenge on Knowledge Base Construction from Pre-trained Language Models, CEUR-</article-title>
          <string-name>
            <surname>WS</surname>
          </string-name>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , I. Reklos,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Peñuela</surname>
          </string-name>
          , E. Simperl,
          <article-title>Using large language models for knowledge engineering (llmke): A case study on wikidata</article-title>
          ,
          <source>in: Proceedings of the ISWC</source>
          <year>2023</year>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Retrievalaugmented generation for large language models: A survey</article-title>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Enhancing context coverage for question answering with multiple knowledge sources</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>2340</fpage>
          -
          <lpage>2350</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskenazi</surname>
          </string-name>
          ,
          <article-title>Understanding and mitigating the impact of data source limitations on llm performance</article-title>
          ,
          <source>in: Proceedings of NeurIPS</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kwiatkowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Palomaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Redfield</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Edward</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Collins</surname>
          </string-name>
          , et al.,
          <article-title>Natural questions: a benchmark for question answering research, Transactions of the Association for Computational Linguistics 7 (</article-title>
          <year>2019</year>
          )
          <fpage>453</fpage>
          -
          <lpage>466</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. M.</surname>
          </string-name>
          ,
          <article-title>Analyzing the efects of noise on neural network performance in natural language processing</article-title>
          ,
          <source>in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2892</fpage>
          -
          <lpage>2900</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>