LLM Store: Leveraging Large Language Models as Sources of Wikidata-Structured Knowledge Marcelo Machado1 , João M. B. Rodrigues1 , Guilherme Lima1 , Sandro Rama Fiorini1 and Viviane T. da Silva1 1 IBM Research Brazil, Rio de Janeiro, Brazil Abstract Knowledge Integration Framework (KIF) is a Wikidata-based framework for integrating heterogeneous knowledge sources. These can be SPARQL endpoints, SQL endpoints, RDF files, CSV files, etc., and are represented in KIF as knowledge “stores”. A KIF store exposes a Wikidata view of the underlying knowledge source by interpreting its content as a set of Wikidata-like statements and allowing it to be queried through a simple but expressive pattern-matching interface. In this paper, we present LLM Store, a KIF store implementation that uses language models (LLMs) as knowledge sources. Instead of consulting a static knowledge base, when queried, the LLM Store uses the underlying LLM to synthesize Wikidata-like statements on-the-fly. The knowledge completion pipeline used by LLM Store can be fully customized and supports strategies that range from simple zero-shot prompts to retrieval-augment generation (RAG). This paper discusses the design and implementation of LLM Store and presents an evaluation using the test and validation datasets of LM-KBC Challenge @ ISWC 2024. We analyze the results of the evaluation in light of the results obtained by our submission to the same challenge, which was based on LLM Store and achieved a macro averaged F1-score of 91%. LLM Store is released as open-source and its code is available at https://github.com/IBM/kif-llm-store. 1. Introduction Knowledge bases are at the core of many AI applications. Two key problems associated with these bases, especially when they are large and general purpose, are accuracy and complete- ness [1, 2]. It is notoriously hard and expensive to maintain accurate, up-to-date information covering a wide range of topics [3]. (A study from 2018 [4] estimates the cost per triple at about $2 for manually curated triples and $0.01 for automatically extracted ones.) Consider Wikidata [5], one of the largest publicly available knowledge bases. It has been maintained for over 10 years by a small army of volunteers and bots and currently has about 112 million entities and 1.5 billion statements. Despite all this effort and scale, it remains relatively easy to find Wikidata entity pages that are incomplete or contain misleading or false information (sometimes due to vandalism [6]). This is why pre-trained (large) language models (LMs or, from now on, LLMs) are such a promising technology for unsupervised knowledge base construction/completion (KBC) [7]. Their appeal lies in their potential to offer a much faster and, ideally, more cost-effective KBC-LM’24: Knowledge Base Construction from Pre-trained Language Models workshop at ISWC 2024 $ mmachado@ibm.com (M. Machado); joao.bessa@ibm.com (J. M. B. Rodrigues); guilherme.lima@ibm.com (G. Lima); srfiorini@ibm.com (S. R. Fiorini); vivianet@br.ibm.com (V. T. da Silva) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings alternative to both manual curation and supervised methods [8]. Also, a particularly interesting feature of LLMs is that they can operate in real-time. That is, in the context of KBC, LLMs can be used to generate statements at the time the knowledge base is queried. This can be done either through zero- or few-shot probing or via retrieval-augmented generation (RAG) [9] by extracting the statements from some textual context (also obtained or provided in real-time). In this paper, we propose an approach that leverages this flexibility and real-time operation capacity of LLMs to the problem of Wikidata-based KBC. Our approach is implemented in LLM Store1 , a software component that uses LLMs to generate Wikidata-like statements from user queries on-the-fly. LLM Store itself is implemented as a plugin within KIF2 , a Wikidata-based framework for integrating heterogeneous knowledge sources [10]. Both LLM Store and KIF are written in Python and released as open-source. The name “LLM Store” comes from KIF’s store abstraction. In KIF, a store is an interface to a Wikidata view of a knowledge source. This can be a SPARQL endpoint, SQL endpoint, RDF file, CSV file, or in the case of LLM Store, a pre-trained LLM possibly associated with some textual context. The basic store operation is the filter call. It takes a pattern of the form Filter(𝑠, 𝑝, 𝑣), where 𝑠 is the subject, 𝑝 is the property, and 𝑣 is the value, and returns all statements with the matching 𝑠, 𝑝, and 𝑣 in the underlying knowledge source. The important thing is that in KIF the patterns consumed and the statements produced by all stores must follow the syntax of Wikidata [11]. In the case of LLM Store, when a filter pattern using Wikidata-like entities and values is given, it must query the underlying LLM in real-time to generate matching Wikidata-like statements. The actual filter pattern evaluation pipeline used by LLM Store consists of three steps: (context-assisted) knowledge extraction, entity resolution, and statement construction. In the knowledge extraction step, the input pattern is converted into a prompt, potentially augmented with context information, and sent to the LLM. The model’s response is parsed and then, in the entity resolution step, named entities occurring in it are resolved against the target knowledge base—usually, but not necessarily, Wikidata. Finally, in the statement construction step, the resolved entities are used to instantiate the input pattern and construct corresponding statements following Wikidata’s syntax. To evaluate LLM Store’s approach we compare different methods for the implementation of each of the steps of the pipeline above. We then analyze the results of these experiments in light of the results obtained by the system we submitted to the LM-KBC Challenge @ ISWC 2024 [12], which is based on LLM Store. This system [13] achieved a macro averaged F1-score of 91% in the test dataset of the challenge using the Llama3-8B-Instruct model3 . The rest of the paper is organized as follows. Section 2 gives a concise introduction to KIF and discusses relevant related work. Section 3 presents the filter evaluation pipeline used by LLM Store in detail and discusses the various customization options supported by the plugin. Section 4 presents and discusses the results of an experimental evaluation of LLM Store, including the results obtained in the LM-KBC Challenge @ ISWC 2024. Finally, Section 5 presents our conclusions and future work. 1 https://github.com/IBM/kif-llm-store 2 KIF stands for Knowledge Integration Framework, which is completely different from Knowledge Interchange Format. It can be accessed at: https://github.com/IBM/kif 3 https://ai.meta.com/blog/meta-llama-3/ 2. Background KIF [10] is a knowledge integration framework based on Wikidata. It uses Wikidata to standard- ize the syntax and possibly the vocabulary of the underlying knowledge sources. Users can then query the sources through a pattern language described in terms of the Wikidata data model. The integration done by KIF is virtual, i.e., the syntax and vocabulary translations happen at query time. Before detailing KIF itself, we have to recap the data model of Wikidata. 2.1. Wikidata Data Model The Wikidata data model [11] consists of entities and statements about entities. Figure 1 shows the page of entity Q2270, which stands for benzene (the chemical compound). Every entity has a label, a description, and zero or more aliases. These are shown at the header of the page. Figure 1: Part of Wikidata’s entity page of benzene. (Adapted from [14].) Below the header, comes the “Statements” section which groups the statements about the entity being described. A statement consists of two parts: subject and snak. The subject is the entity about which the statement is made. The snak is the statement’s claim. It associates a property with either a specific value (value snak), some unspecified value (some-value snak), or no value (no-value snak). In this paper, we are mainly concerned with value snaks. So, whenever we speak of a statement we mean one which can be decomposed into subject, property, and value. (Note that KIF supports all three kinds of snaks.) Figure 1 depicts two statements which can be read as follows: 1. “benzene has an LD50 of 4,699–4,701 milligrams per kilogram” 2. “benzene has an LD50 of 87–89 milligrams per kilogram” LD50 (or median lethal dose) is a toxicity unit that measures the dose of a substance that is required to kill half of the members of the tested population. Here the subject of both statements (1) and (2) is the same, “benzene” (Q2270). Their snak is a value snak (of the form property-value). The property of both is “median lethal dose (LD50)” (P2240). The value of (1) is the structured quantity “4,700 ± 1 mg/kg”, and the value of (2) is “88 ± 1 mg/kg”. The data model of Wikidata also supports the notions of qualifiers and references associated with statements. Qualifiers are extra snaks that qualify the statement. In Figure 1, the qualifiers just below statement (1), highlighted in blue, indicate that the statement assertion holds when the route of administration is “oral” and taxon is “laboratory mouse”. References are sets of snaks that keep provenance information. Figure 1 shows two references for statement (1), highlighted in red. The first one indicates that the statement was obtained from PubChem [15] (a popular base of chemical information) on April 12, 2024, and carries additional snaks that identify the subject (benzene) in PubChem. The second reference contains a single snak pointing to an external page of the CDC (a public health agency in the US). In KIF, qualifiers and references are treated as annotations and are manipulated through specific methods of the store API. One important use of references is to distinguish between statements produced by different stores. This is done by instructing KIF to attach a specific (unseen) reference to all statements produced by a given store. Using this technique, applica- tions can distinguish statements produced by LLM Store from statements produced by more authoritative sources, like PubChem or Wikidata. 2.2. KIF Stores and Filters The core abstraction of KIF is the store. A store is an interface to a Wikidata view of a knowledge source. The prototypical store is the SPARQL store, which exposes a Wikidata view of a SPARQL endpoint. Here is how we create a SPARQL store pointing to WDQS, the public SPARQL query service of Wikidata: 1 from kif_lib import Store 2 kb = Store('sparql', 'https://query.wikidata.org/sparql') At line 1, we import from the namespace of KIF the Store constructor. At line 2, we use it to create a new SPARQL store kb pointing to WDQS. Not incidentally, because WDQS adopts the Wikidata encoding of RDF, it can be queried directly by the SPARQL store. If that was not the case, for instance, if we wanted to point the SPARQL store to PubChem’s query interface (which is not Wikidata-compatible), at line 2, we would have to provide a mapping object to be used to translate the queries and results to the syntax of Wikidata. (See [10] for details.) With store kb created, we can now read statements from it as follows: 4 from kif_lib.vocabulary import wd 5 it = kb.filter(subject=wd.benzene, property=wd.median_lethal_dose) 6 print(next(it)) 7 # Statement(Item(IRI('...Q2270')), ValueSnak(Property(IRI('...P2240'), ˓→ QuantityDatatype()), Quantity(4700, Item(IRI('...Q21091747')), 4699, 4701))) The kb.filter() call on line 5 searches in kb for statements with subject “benzene” and property “median lethal dose”. More specifically, when kb.filter() is called its arguments are used to create a Filter(𝑠,𝑝,𝑣 ) pattern, which is compiled by the store into a SPARQL query. This query is then evaluated over the target endpoint and the results are used to construct a (lazy) iterator it with the matching statements. At line 6, we print the first statement in it. The result is shown on line 7 and corresponds to the statement (1) of Figure 1. Note that we used the vocabulary module of KIF (line 4) to avoid writing the full IRIs of Wikidata entities. That is, instead of wd.benzene, we could have written wd.Q(2270) or even Item('http://www.wikidata.org/entity/Q2270'). The same applies to the property wd.median_lethal_dose. That said, whenever possible we will use symbolic names in wd instead of numeric ids or IRIs. Fingerprints (indirect ids) In the previous example, we used direct ids (wd.benzene and wd.median_lethal_dose) to identify the subject and property of the desired statements. Sometimes, however, we might need to specify an entity indirectly by giving not its id but a property it satisfies. In KIF, this can be done through a fingerprint (indirect id): 9 it = kb.filter( 10 subject=wd.official_language(wd.Portuguese) & wd.continent(wd.South_America), 11 value=wd.Argentina) 12 print(next(it)) 13 # Statement(Item(IRI('...Q155')), ValueSnak(Property(IRI('...P47'), ItemDatatype()), ˓→ Item(IRI('...Q414')))) This filter searches for statements such that: the subject’s official language is Portuguese and continent is South America (line 10); the property is anything (unrestricted); and the value is Argentina (line 11). The result, shown in line 13, is the statement “Brazil (Q155) shares border with (P47) Argentina (Q414)”. Notice that Brazil matches desired subject fingerprint. As we will see later, in LLM Store fingerprints are used to constrain the universe of predicted answers to certain kinds of entities or values. For instance, we can set the value parameter of filter() to the fingerprint wd.instance_of(wd.human) to restrict LLM Store’s answers to statements whose value is a person. 2.3. Related Work In 2019, Petroni et al. [16] proposed the LAMA dataset, a seminal benchmark designed to evaluate the ability of LLMs to retrieve relational facts using cloze-style prompts. This work spurred numerous follow-ups, including [17, 18, 19, 20, 21]. Among these, the work of Zhang et al. [18], which introduced LLMKE, bears the closest resemblance to our proposal. LLMKE was the winner system of track 2 of the LM-KBC Challenge @ ISWC 2023 [22]. The task was to predict the value part (𝑣) of incomplete triples of the form (𝑠, 𝑝, 𝑣), where the subject 𝑠 is a Wikidata entity and the property 𝑝 is an abstract property, such as bandHasMember, which may or may not correspond to a single property in Wikidata. Depending on the relation, the predicted value 𝑣 might be a Wikidata entity or a data value (quantity, string, etc.). The LLMKE system operates in two main stages: probing and disambiguation. During the probing stage, the system generates prompts for each input 𝑠-𝑝 pair, guiding the LLM to produce the label of the Wikidata entity or data value that completes the relation. The disambiguation stage then maps entity labels to the ids of the correct Wikidata entity. For the challenge submission, LLMKE used rules tailored to each of the 21 abstract relations. These rules customized the prompts and the textual context used during the probing stage. In some cases, rules were also used during the disambiguation process to map specific keywords and labels to the corresponding data values and entities in Wikidata. Although the design of LLM Store was inspired by LLMKE, in particular, by the approach described in [18], it differs in many aspects: 1. LLM Store is not a standalone system designed for one specific purpose. Instead, it is a component of a larger framework (KIF) and as such can be combined with other components (stores) and reused by different applications. 2. LLM Store is not tied to a particular LLM platform. It provides a common interface to access these platforms and comes with builtin support for IBM’s Big AI Model (BAM), Hugging Face, and OpenAI. Support for new platforms can be added as needed. 3. LLM Store is not limited to a specific set of relations. It works out-of-the-box with any Wikidata relation and can handle filter patterns other than (𝑠, 𝑝, *), including patterns with fingerprints whose evaluation go beyond simple value prediction. 4. LLM Store allows users to customize every step of the evaluation pipeline. This includes the prompt templates, textual contexts, and methods of response parsing and entity resolution used. 5. LLM Store statements can have annotations and can be integrated with statements of other KIF stores. Annotations can be used to carry extra information, such as the model used to generate the statements, textual context, etc. Also, because LLM Store is a KIF store, the statements it produces can be seamlessly integrated with those produced by other KIF stores using, for example, a mixer store (see [10] for details). 3. LLM Store LLM Store is a KIF store that uses an LLM as a knowledge source. Instead of searching for the given filter pattern in an existing knowledge base, LLM Store converts the pattern into a prompt, sends it to the underlying LLM, and converts the model’s response back into one or more statements matching the pattern. Context LLM Store Generation Entity Statement Filter Resolution Construction Statements Knowledge Extraction Figure 2: The LLM Store pipeline. The LLM Store pipeline, illustrated in Figure 2, consists of three main steps: (Context-Assisted) Knowledge Extraction, Entity Resolution, and Statement Construction. The dashed arrows represent optional paths. We detail each step next. 3.1. Knowledge Extraction The knowledge extraction step involves creating and evaluating a prompt over the underlying LLM. The prompt is created by instantiating a cloze-style prompt template from the input filter pattern. The default prompt template has the form: Fill in the gap to complete the relation: {{subject}} {{property}} {{value}} When instantiated with the filter pattern: Filter(wd.Brazil, wd.official_language) It produces the cloze-style prompt: Fill in the gap to complete the relation: Brazil official language ___ Similarly, the same template instantiated with a filter containing a value fingerprint: Filter(wd.Brazil, wd.official_language, wd.instance_of(wd.sign_language)) Produces the prompt: Fill in the gap to complete the relation: Brazil official language ___ where ___ is an instance of sign language The “where” above was generated from the value fingerprint in the filter and specifies a type constraint on the answers to be predicted. In addition to describing the task, the prompt must instruct the model to format the response. The complete default prompt template is actually the following: [SYSTEM] You are a helpful and honest assistant that resolves a TASK. Please, respond ˓→ concisely, with no further explanation, and truthfully. [USER] TASK: Fill in the gap to complete the relation: {{subject}} {{property}} {{value}} The output should be only a list containing the answers, such as ["answer_1", ˓→ "answer_2",..., "answer_n"]. Do not provide any further explanation and avoid ˓→ false answers. Return an empty list, such as [], if no information is available. The default prompt template is accompanied by a default parsing function that parses the LLM output into a list of strings. Both the prompt template and the result-parsing function can be customized using LLM Store’s prompt_template and prompt_parser attributes. The result of the knowledge extraction step is a list of strings matching the blanks in the query. Once such a list of strings is obtained, there are two possibilities: either (1) the list is sent to the entity resolution step to be resolved into a list of Wikidata entity ids; or (2) the list is sent directly to the statement construction step. By default, the decision is made based on the datatype of the property used in the input pattern. In Wikidata, every property has a datatype which determines the range of its possible values. If the datatype is “item” or “property”, LLM Store sends the knowledge extraction results to the entity resolution step. Otherwise, if the datatype is a data value type (string, quantity, etc.), LLM Store sends the results to the statement generation step. The default behavior can be overridden by setting the disambiguation attribute of LLM Store. Textual context We just described the context-free knowledge extraction process. LLM Store also supports the use of textual contexts in the knowledge extraction step. Textual contexts tend to reduce hallucinations and are at the basis of more advanced prompting techniques, such as Retrieval-Augmented Generation (RAG) [9]. The LLM Store attribute context, which by default is undefined, instructs the store to use the assigned text as textual context. In practice, setting context makes LLM Store include in the default prompt the system instruction “Use the CONTEXT to support the answer” and the following text under the “[USER]” section: CONTEXT: {{context}} For other attributes related to textual contexts, see LLM Store’s documentation. Context generation Coming up with textual contexts or obtaining them manually quickly becomes impractical. Because of that, LLM Store comes with a context generator to obtain a textual context for the input filter pattern automatically. The context generator is based on Wikidata—more specifically, on the site-links and external ids sections of Wikidata pages, which contain links to Wikipedia and other external sources describing the pages’ subjects. Given a Wikidata entity 𝑄 described in the pattern (usually, but not necessarily, the subject), the context generator (1) retrieves all site-links and external ids from 𝑄’s Wikidata page; (2) passes these links through a series of scraping plugins to extract text chunks from the target HTML pages; and (3) ranks the extracted chunks by embedding similarity with a textual version of the input pattern. The best-ranked chunk is then used as context in the knowledge extraction step. Note that the context generator is disabled in the default configuration. 3.2. Entity Resolution The goal of the entity resolution step is to resolve the input entity labels into Wikidata entity ids. For instance, given the label “Brazilian Portuguese” it should produce the item id Q750553, which identifies Brazilian Portuguese in Wikidata. LLM Store currently supports three entity resolution methods. Given an input label, the three methods use Wikidata REST API’s search function to obtain a list 𝐶 of candidate entities plus their descriptions. The three methods differ only in how 𝐶 is processed. The first method, baseline, is the default one; it simply picks the first entity in 𝐶. The second method, similarity- based, ranks the descriptions by embedding similarity with the task (and context) text and picks the entity with the best ranked description. The third method, LLM-based, is similar to the second but instead of computing embedding similarity, it uses an LLM to rank the descriptions, by asking which description best matches the task (and context) text. The resolution method to be used can be customized via LLM Store’s disambiguation_method attribute. 3.3. Statement Construction The last step of LLM Store’s pipeline is statement construction. This step takes a list of Wikidata entities or data values as input and instantiates each of these in the input filter pattern, one at a time. The result is a sequence of output KIF statements. For instance, for a filter pattern Filter(wd.Brazil, wd.shares_border_with), in which the value part is missing, and a list of entities [wd.Argentina,wd.Paraguay,wd.Uruguay], the statement construction step outputs three KIF statements: (1) “Brazil shares border with Argentina”; (2) “Brazil shares border with Paraguay”; (3) “Brazil shares border with Urugray”. 4. Evaluation To evaluate LLM Store, we used the validation and test datasets of the LM-KBC Challenge @ ISWC 2024 [12]. As in the 2023 edition, the datasets of the 2024 challenge consist of incomplete (𝑠, 𝑝, 𝑣) triples with the value part (𝑣) missing. The task is to predict, for each incomplete triple, zero or more values 𝑣 which, depending on the relation, can be Wikidata entities or data values (numeric quantities, strings, etc.). The subjects 𝑠 are Wikidata entities and the properties 𝑝 are abstract relations that may or may not correspond to a single Wikidata relation. In the 2024 edition of the challenge, only five abstract relations are considered (see Table 1). For the evaluation, we converted each subject-property pair (𝑠, 𝑝) in the datasets into a corresponding KIF filter pattern. For instance, the pair (Q155, countryLandBordersCountry) was converted into the pattern Filter(wd.Q(155), wd.shares_border_with, wd.instance_of(wd.country_)) (Notice the use of a value fingerprint here to restrict the results to things which are countries; the rationale is that, by itself, Wikidata’s wd.shares_border_with does not capture the precise meaning of the abstraction relation countryLandBordersCountry.) Table 1 shows the filter pattern associated with each abstract relation of the 2024 challenge. Table 1 Mapping between abstract relations, Wikidata properties, and KIF filter patterns. Abstract Relation Wikidata Property Filter Pattern awardWonBy winner (P1346) Filter(𝑠, wd.P(1346)) companyTradesAtStockExchange stock exchange (P414) Filter(𝑠, wd.P(414)) countryLandBordersCountry shares border with (P47) Filter(𝑠, wd.P(47), wd.instance_of(wd.country_)) personHasCityOfDeath place of death (P20) Filter(𝑠, wd.P(20), wd.instance_of(wd.city)) seriesHasNumberOfEpisodes number of episodes (P1113) Filter(𝑠, wd.P(1113)) With the entries of the datasets converted to KIF filters, we then proceeded to evaluate each filter using LLM Store. We tested four different configurations: Triple LLM Store with the default prompt template and no textual context (see Section 3). Triple-Context LLM Store with the default prompt template and a custom textual context obtained using the context generator. In this case, the context generator was configured as in our submission to the LM-KBC Challenge @ ISWC 2024, i.e., with specific scraping plugins for each abstract relation. For instance, for awardWonBy, we picked the ner- extract plugin which searches for entity names in the scraped text. (See [13] for the precise configuration used for each relation.) Question LLM Store with a custom question-based prompt template and no textual context. By a custom question-based template, we mean one in which the task is framed as a question about the subject. For instance, for the abstract relation seriesHasNumberOfEpisodes, instead of the default template, obtained from the filter pattern, we used the question template “How many episodes does the series {{subject}} have?”. Question-Context LLM Store with the question-based prompt template of Question and the custom textual context of Triple-Context. We used this configuration in our submis- sion [13] to the LM-KBC Challenge @ ISWC 2024. The LM-KBC’24 Challenge limits model size to 10 billion parameters, ensuring that no participating team can outperform others by monetary investment. Therefore, first, we used the Llama3-8B-Instruct model to evaluate both the validation and test datasets (the latter being the final result submitted to the challenge in [13]). We then assessed the validation dataset with a larger model (Llama3-70B-Instruct) to showcase our solution’s capability to handle different models and to compare the results with the smaller one. Both models are accessed through the IBM Big AI Model (BAM) platform.4 However, the results should remain consistent if using the same models from Hugging Face or platforms like Ollama5 (LLM Store supports both, with the llm_name parameter set to hf or rest, respectively). Table 2 summarizes the results obtained by each of the four configurations using: (1) Llama3-8B-Instruct over the validation and test datasets; and (2) Llama3-70B-Instruct over the validation dataset. Runtime Figure 3 depicts the average execution time for each of the relations. We highlight the context generation time within the total execution time, as this is a key feature of the pipeline. The execution time for the awardWonBy relation is significantly higher than for the other relations, as the process was divided into several prompting stages. In general, context generation is the most time-consuming part. However, for the countryLandBordersCountry relation, the average context generation time was close to zero, due to the use of a cache. We also mitigated the execution time of the context generation process by caching the content of the accessed site-links. This solution is especially beneficial when the system is used with a repetitive set of entities. 4 BAM is built by IBM Research as a test bed and incubator for helping accelerate generative AI research and its transition into IBM products: https://bam.res.ibm.com/ 5 Ollama is an open-source tool for running, creating, and sharing LLMs on your computer: https://ollama.com/ Table 2 Results for the validation and test datasets using the Llama3-8B-Instruct model. Validation Dataset Test Dataset Relation Triple Triple-Context Question Question-Context Question-Context P R F1 P R F1 P R F1 P R F1 P R F1 awardWonBy 0.27 0.03 0.06 0.50 0.26 0.30 0.29 0.08 0.12 0.50 0.30 0.32 0.60 0.67 0.62 companyTradesAtStockExchange 0.26 0.59 0.25 0.85 0.70 0.66 0.48 0.72 0.42 0.93 0.79 0.77 0.95 0.86 0.85 countryLandBordersCountry 0.83 0.96 0.85 0.98 0.98 0.98 0.85 0.99 0.88 0.99 0.98 0.98 0.98 0.93 0.94 personHasCityOfDeath 0.22 0.67 0.22 0.88 0.87 0.83 0.29 0.63 0.26 0.90 0.88 0.84 0.96 0.97 0.93 seriesHasNumberOfEpisodes 0.12 0.12 0.12 0.82 0.82 0.82 0.13 0.12 0.12 0.85 0.85 0.85 0.95 0.95 0.95 All Relations 0.31 0.54 0.31 0.86 0.82 0.80 0.40 0.57 0.37 0.90 0.85 0.84 0.95 0.92 0.91 Results for the validation dataset using the Llama3-70B-Instruct model. Validation Dataset Relation Triple Triple-Context Question Question-Context P R F1 P R F1 P R F1 P R F1 awardWonBy 0.36 0.12 0.17 0.50 0.32 0.32 0.33 0.12 0.17 0.56 0.29 0.32 companyTradesAtStockExchange 0.45 0.60 0.37 0.95 0.80 0.77 0.73 0.77 0.61 0.93 0.78 0.74 countryLandBordersCountry 0.84 0.99 0.85 0.99 0.99 0.99 0.92 0.99 0.93 0.99 0.99 0.99 personHasCityOfDeath 0.41 0.72 0.39 0.93 0.88 0.87 0.51 0.71 0.42 0.94 0.87 0.86 seriesHasNumberOfEpisodes 0.30 0.30 0.30 0.84 0.84 0.84 0.46 0.42 0.42 0.85 0.85 0.85 All Relations 0.47 0.61 0.44 0.91 0.85 0.84 0.62 0.68 0.56 0.91 0.85 0.84 Figure 3: Comparison of Context Generation Time and Total Execution Time. Discussion Using Llama3-8B-Instruct, the Triple configuration achieved an average macro F1-score of 31%. The best-performing relation was seriesHasNumberOfEpisodes, with an F1 score of 85%, while the worst was awardWonBy, with only 6%. The average score improved slightly, to 37%, with the question-based configuration (Question), as it mitigated some task formulation errors. However, the overall F1-score remained low, with the primary bottleneck being the knowledge extraction process. In particular, the queries involving specific information are probably not present in the LLM’s training datasets. The best results for Llama3-8B-Instruct were obtained when contextual information was incorporated into the prompt, whether using triple or question prompt templates. In the validation dataset, the average macro F1-score of Triple-Context was 80% and the average score of Question-Context was 84%. The latter configuration achieves the best average macro F1-score, 91%, in the test dataset. In all configurations, relation awardWonBy had the worst score. This is primarily due to the difficulty in generating relevant context for this relation. When reviewing the dataset entries, we realized that some entries lacked external links on their Wikidata pages (e.g., Q38215093). Consequently, the context generator simply did not work in these cases. (In the test dataset, most of these entities have external links, which explains the improved score of 62%.). Also, even when external links were available, extracting direct answers was often problematic. The structure of the HTML pages of links varied a lot, making it impossible to design a custom scraping plugin that worked well for the various cases. Using Llama3-70B-Instruct we expected to mitigate the problem in the knowledge extraction process. While the results improved without the use of additional context, the model still struggles to provide accurate answers due to the specificity of the domain. However, an interesting point when using larger models is that considering the settings that use context, the way we instruct the LLM to perform the task is no longer a differentiating factor. And, aside from context acquisition issues in certain relations, the remaining errors are likely due to the entity resolution process and inaccuracies within the dataset. 5. Concluding Remarks In this paper, we presented the LLM Store, a plugin to KIF that accesses an LLM and responds Wikidata-structured statements. We detailed how LLM Store works, passing for each of its pipeline components. Then, we created a system that uses the LLM Store to tackle an LM-KBC task, showing its efficiency. Results showed that our approach achieved a macro average F1- score of 91% in the test dataset of the LM-KBC’24 Challenge. The key feature to achieve this result is the use of our proposed context generation module. However, when not using contexts, the results may vary, since the knowledge extraction process has a large impact on accuracy. Very specific relations will probably present poor results. Besides, even when using the context generation module, our approach may have limitations. The reliance on specialized plugins that scrape textual context from specific parts of HTML pages introduces a dependency on the stability of pages, which can easily change and break the plugin. Thus, although LLM Store may be used as a source of knowledge, it should be used with caution. Although our system was evaluated over the LM-KBC’24 Challenge, the approach used by LLM Store is not limited to the challenge’s specific relations. LLM Store can address any question defined as a pattern involving Wikidata entities. Moreover, we claim that, as a highly configurable tool, our solution has the potential to accommodate other proposals such as those already presented in previous years of LM-KBC. In future work, we aim to enhance the capabilities of LLM Store by introducing some key improvements and extensions. We will enable more expressive KIF filters, supporting logical AND, OR, and negation. Since the question-based prompt template yielded the best results, we aim to make it the default by developing a module that automatically generates questions from input filter patterns. We aim to improve the LLM-based entity resolution method to make it the default approach. While the baseline performed better in the Challenge, this method is flawed because it does not use any context during the disambiguation. Additionally, leveraging KIF’s ability to create mixed stores (see [10]), we will explore combining multiple LLM Stores for complementary usage. This approach will involve assigning specific roles to each LLM Store within a mixer, along with rules governing their functions. For example, one LLM Store might serve as a judge, by reflecting about the responses from others and determining the final answer. References [1] S. Razniewski, H. Arnaout, S. Ghosh, F. Suchanek, Completeness, recall, and negation in open-world knowledge bases: A survey, ACM Computing Surveys 56 (2024) 1–42. [2] M. Machado, G. Lima, E. Soares, V. Nascimento, R. Brandao, M. Moreno, An extensible approach for query-driven multimodal knowledge graph completion., in: International Semantic Web Conference., CEUR-WS, 2022. [3] J. Chicaiza, P. Valdiviezo-Diaz, A comprehensive survey of knowledge graph-based recommender systems: Technologies, development, and contributions, Information 12 (2021) 232. [4] H. Paulheim, How much is a triple? estimating the cost of knowledge graph creation, in: M. van Erp, M. Atre, V. Lopez, K. Srinivas, C. Fortuna (Eds.), Proc. ISWC 2018 Posters & Demonstrations, Industry and Blue Sky Ideas Tracks, co-located with ISWC 2018, Monterey, USA, October 8–12, 2018, 2028. URL: https://ceur-ws.org/Vol-2180/ISWC_2018_ Outrageous_Ideas_paper_10.pdf. [5] D. Vrandečić, M. Krötzsch, Wikidata: A free collaborative knowledgebase, Commun. ACM 57 (2014) 78–85. doi:10.1145/2629489. [6] S. Heindorf, M. Potthast, B. Stein, G. Engels, Vandalism detection in wikidata, in: Pro- ceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2016, pp. 327–336. [7] B. Veseli, S. Singhania, S. Razniewski, G. Weikum, Evaluating language models for knowl- edge base completion, in: C. Pesquita, E. Jimenez-Ruiz, J. McCusker, D. Faria, M. Dragoni, A. Dimou, R. Troncy, S. Hertling (Eds.), The Semantic Web, Springer Nature Switzerland, Cham, 2023, pp. 227–243. [8] A. Ratner, C. Ré, Knowledge base construction in the machine-learning era: Three critical design points: Joint-learning, weak supervision, and new representations, Queue 16 (2018) 79–90. URL: https://doi.org/10.1145/3236386.3243045. doi:10.1145/3236386.3243045. [9] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive NLP tasks, Advances in Neural Information Processing Systems 33 (2020) 9459–9474. [10] G. Lima, J. M. B. Rodrigues, M. Machado, E. Soares, S. R. Fiorini, R. Thiago, L. G. Azevedo, V. T. da Silva, R. Cerqueira, KIF: A Wikidata-based framework for integrating heterogeneous knowledge sources, 2024. URL: https://arxiv.org/abs/2403.10304. arXiv:2403.10304. [11] MediaWiki Team, Wikibase/DataModel, 2024. URL: https://www.mediawiki.org/wiki/ Wikibase/DataModel, last accessed September 30, 2024. [12] J.-C. Kalo, T.-P. Nguyenand, S. Razniewski, B. Zhang, LM-KBC: Knowledge base construc- tion from pre-trained language models, in: Semantic Web Challenge @ ISWC, CEUR-WS, 2024. URL: https://lm-kbc.github.io/challenge2024. [13] M. Machado, J. Rodrigues, G. Lima, V. Silva, LLM Store: A KIF plugin for Wikidata-based knowledge base completion via LLMs, in: Semantic Web Challenge @ ISWC, CEUR-WS, 2024. URL: https://lm-kbc.github.io/challenge2024. [14] J. Odell, M. Lemus-Rojas, L. Brys, Wikidata for Scholarly Communication Librarianship, IUPUI University Library, Indianapolis, 2022. doi:10.7912/9Z4E-9M13. [15] S. Kim, J. Chen, T. Cheng, A. Gindulyte, J. He, S. He, Q. Li, B. A. Shoemaker, P. A. Thiessen, B. Yu, L. Zaslavsky, J. Zhang, E. E. Bolton, PubChem 2023 update, Nucleic Acids Res. 51 (2023) D1373–D1380. doi:10.1093/nar/gkac956. [16] F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, S. Riedel, Language models as knowledge bases?, arXiv preprint arXiv:1909.01066 (2019). [17] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, et al., Opt: Open pre-trained transformer language models, 2022, URL https://arxiv. org/abs/2205.01068 3 (2023) 19–0. [18] B. Zhang, I. Reklos, N. Jain, A. M. Peñuela, E. Simperl, Using large language models for knowledge engineering (LLMKE): A case study on Wikidata, in: CEUR Workshop Proceedings, volume 3577, CEUR-WS, 2023. [19] T. Li, W. Huang, N. Papasarantopoulos, P. Vougiouklis, J. Z. Pan, Task-specific pre-training and prompt decomposition for knowledge graph population with language models, arXiv preprint arXiv:2208.12539 (2022). [20] G. Qin, J. Eisner, Learning how to ask: Querying LMs with mixtures of soft prompts, arXiv preprint arXiv:2104.06599 (2021). [21] Z. Zhong, D. Friedman, D. Chen, Factual probing is [mask]: Learning vs. learning to recall, arXiv preprint arXiv:2104.05240 (2021). [22] S. Singhania, J.-C. Kalo, S. Razniewski, J. Z. Pan, LM-KBC: Knowledge base construction from pre-trained language models, in: Semantic Web Challenge @ ISWC, CEUR-WS, 2023. URL: https://lm-kbc.github.io/challenge2023.