=Paper=
{{Paper
|id=Vol-3894/paper21
|storemode=property
|title=Prompt-Time Ontology-Driven Symbolic Knowledge Capture with Large Language Models
|pdfUrl=https://ceur-ws.org/Vol-3894/paper21.pdf
|volume=Vol-3894
|authors=Tolga Çöplü,Arto Bendiken,Andrii Skomorokhov,Eduard Bateiko,Stephen Cobb
}}
==Prompt-Time Ontology-Driven Symbolic Knowledge Capture with Large Language Models==
Prompt-Time Ontology-Driven Symbolic Knowledge
Capture with Large Language Models
Tolga Çöplü* , Arto Bendiken, Andrii Skomorokhov, Eduard Bateiko and Stephen Cobb
Haltia, Inc.
Abstract
In applications such as personal assistants, large language models (LLMs) must consider the user’s personal information and
preferences. However, LLMs lack the inherent ability to learn from user interactions. This paper explores capturing personal
information from user prompts using ontology and knowledge-graph approaches. We use a subset of the KNOW ontology,
which models personal information, to train the language model on these concepts. We then evaluate the success of knowledge
capture using a specially constructed dataset. Our code and datasets are publicly available at https://github.com/HaltiaAI/paper-
PTODSKC
Keywords
ontology-driven symbolic knowledge capture, KNOW ontology, symbolic representation, knowledge graphs, large language
models, fine-tuning
1. Introduction termine which personal information will be cap-
tured and how it will be associated with other
Currently, many generative artificial intelligence (AI) ap- captured knowledge.
plications, particularly personal assistants, strive to offer • Ontology rules can help identify inconsistencies
users personalized experiences. To achieve this, AI appli- in the captured knowledge, allowing for valida-
cations must learn personal information and preferences tion before storage.
from user interactions (knowledge capture) and use this • Ontology relationships allow the extraction of
learned knowledge in future conversations (knowledge implicit information from captured knowledge,
utilization). Implementing this fundamental personal AI effectively enabling automatic inference that ex-
approach depends on addressing several complex sub- pands the knowledge graph.
problems, such as discerning which user prompt infor-
• A robust, personalized knowledge graph forms a
mation is personal, extracting it, determining whether
reliable foundation for facilitating personalized
the extracted information is duplicate, and associating it
interactions with the application through lan-
with other personal data.
guage models.
These challenges have been the focus of extensive re-
search within the AI field for many years. However, the In this paper, we address a specific aspect of the AI
emergence of neurosymbolic approaches through the col- personalization challenge by focusing on prompt-time,
laboration between large language models (LLMs) and ontology-driven symbolic knowledge capture using lan-
symbolic AI has provided researchers with new perspec- guage models. We explore the extraction from user
tives [1, 2, 3, 4]. LLMs’ capabilities in natural language prompts of subject-predicate-object triples1 that conform
processing can be integrated with the representational to a specified ontology. We have investigated various
and factual reasoning abilities of knowledge graphs, en- methods to enable the underlying language model to
hanced by the structure, rules, and inference mechanisms comprehend a pre-defined ontology, ensuring effective
offered by an ontology. For targeted personal AI applica- symbolic knowledge capture. By utilizing a specially
tions, this ontology approach presents several benefits: designed dataset, we evaluate the effectiveness of these
methods, emphasizing their strengths and identifying
• Ontology schemas enable language models to de-
potential areas for improvement.
The structure of this paper is as follows: Section 2
KiL’24: Workshop on Knowledge-infused Learning co-located with
30th ACM KDD Conference, August 26, 2024, Barcelona, Spain discusses in-context learning and fine-tuning approaches
*
Corresponding author. for ontology-driven symbolic knowledge capture and
$ tolga@haltia.ai (T. Çöplü); arto@haltia.ai (A. Bendiken); focuses on the details of the fine-tuning approach. Sec-
andriy@haltia.ai (A. Skomorokhov); eduard@haltia.ai (E. Bateiko); tion 3 describes the experimental setup by presenting the
steve@haltia.ai (S. Cobb)
development framework, the language model selection,
0009-0004-9414-0588 (T. Çöplü); 0009-0002-0725-4874
(A. Bendiken); 0000-0002-5696-6723 (A. Skomorokhov); and the ontology and dataset creation process. Section 4
0000-0002-6729-5611 (E. Bateiko); 0009-0004-0476-6000 (S. Cobb)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 1
Attribution 4.0 International (CC BY 4.0). https://www.w3.org/TR/rdf12-concepts/
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
outlines our performance evaluation framework and the The following points highlight the key aspects of on-
test results. Finally, Section 5 concludes the paper and tology fine-tuning:
suggests future directions.
• The training dataset’s coverage and diversity are
vital for successful fine-tuning. These character-
2. Ontology-Driven Symbolic istics greatly influence the LLM’s ability to ef-
fectively capture knowledge. Details about the
Knowledge Capture dataset and how it is constructed are discussed in
In the literature, language models have demonstrated Section 3.4.
their capability to transform unstructured text into a • The training dataset must include a variety of
knowledge graph [5, 6, 7, 8, 9]. However, the process examples for each element of the predefined on-
of populating a knowledge graph from user prompts tology. This approach avoids scalability issues
in alignment with a pre-defined ontology has been ex- typically associated with in-context learning and
plored only marginally [10, 11, 12, 13, 14]. Research ensures comprehensive learning coverage.
typically centers on in-context learning, which heavily • If the LLM encounters a user prompt that is not
relies on prompt engineering. A significant limitation relevant to the predefined ontology concepts, it
of this approach is the requirement to incorporate the should not attempt to capture knowledge. There-
entire custom ontology into the prompt. This neces- fore, the dataset should also contain sufficient
sity not only slows down the knowledge capture pro- out-of-context samples to enable the LLM to dis-
cess, because of the high token overhead but also re- tinguish between relevant and irrelevant infor-
stricts the use of larger ontologies due to the constraint mation for capture.
on context-window length. Given these constraints, in-
context learning methods do not provide a scalable solu-
tion for ontology-driven symbolic knowledge capture. 3. Experimental Setup
An alternative approach involves training a language
This section explores the components of our experimen-
model with a pre-defined ontology, so that the model
tal setup.
internalizes it. There are two strategies to consider: pre-
training the LLM on the ontology or fine-tuning it. This
paper does not explore pre-training due to its extensive 3.1. Development Framework
data, computational resources, energy, and time require- The methods suggested in this paper have been imple-
ments. Additionally, pre-training does not offer a flexible mented using the Apple MLX framework [15]. MLX is a
response to ongoing changes or expansions in the ontol- specialized array framework designed for machine learn-
ogy. Therefore, this paper will focus on fine-tuning as a ing applications, akin to NumPy, PyTorch, or JAX, with
method to train language models on personal ontologies, the distinction of being exclusive to Apple silicon.
highlighting advantages in feasibility and maintainabil- Ontology fine-tuning has been conducted using the
ity. parameter-efficient QLoRA approach [16] on our custom
dataset, comprising randomly selected, non-overlapping
2.1. Ontology-Driven Knowledge Capture sets of training, validation, and test samples.
with Fine-Tuning
Fine-tuning is a process whereby a pre-trained language 3.2. Language Model
model is further trained on a specific dataset to tailor The methods we have developed here do not have a struc-
its capabilities to a particular task. In our study, the tural dependency on a particular underlying foundation
language model is expected to learn the classes, object model. The key factors guiding our language model se-
properties, and data properties defined in an ontology, lection were its proven effectiveness across diverse do-
and to use them to populate a knowledge graph from user mains in community benchmarks and its prevalence in
prompts. The first step involves preparing a fine-tuning the field. Owing to its performance in the Hugging Face
dataset, which includes user prompts, system prompts, Open LLM Leaderboard [17] and its robust ecosystem,
and expected model responses for each concept in the the Mistral-7B-Instruct-v0.2 [18], based on the Llama 2
ontology. This dataset is used to fine-tune the language [19] architecture, was selected for our research. We ran
model, which is then evaluated by testing it with new all examples, tests, and benchmarks on the MLX 4-bit
prompts to assess the effectiveness of the knowledge quantized version of this model to be able to run the tests
capture operation. We define a system prompt for this on personal laptops.
task with the requirement of maintaining the model’s
generality across other tasks.
3.3. Predefined Ontology we included 32 generic user prompts in the dataset. The
composition of this dataset, which consists of 175 user
Our study is inspired by KNOW[20]–the Knowledge Nav-
prompts, is illustrated in Figure 2. Concepts not associ-
igator Ontology for the World–and utilizes it for repre-
ated with the ontology are labeled as the ’none’ legend
senting personal information. KNOW is introduced as
in the figure. As each sample prompt typically contains
a pioneering framework designed to capture everyday
multiple modeled concepts, the chart shows a total num-
knowledge to enhance language models in real-world
ber of concept occurrences greater than the number of
generative AI applications such as personal AI assistants.
prompts.
The ontology focuses on human life, encompassing ev-
eryday concerns and significant milestones, and limits its
initial scope to established human universals, including Occurrences of Ontology Concepts
spacetime (places, events) and social dimensions (people, Person
Name
groups, organizations). This pragmatic approach empha- Sex
sizes universality and utility, contrasting with previous
Child
Father
works like Schema.org[21] and Cyc[22] by building on Mother
Sibling
language models’ inherent encoding of salient common- Sister
Brother
sense knowledge. Spouse
Due to the requirement that each element in the on-
Partner
Knows
tology be associated with a diverse set of prompt and re- None
sponse samples within the training dataset, our research
0 100 200 300
focuses on a specific subset of the KNOW ontology. This Number of Occurrences
subset concentrates on core family relationships with
Figure 2: Occurrences of ontology concepts in the prepared
four ontology classes, eleven object properties, and one dataset.
data property. A visual depiction of this subset is pre-
sented in Figure 1.
The Turtle format was chosen for serializing the ontol-
ogy population in our research because of its straightfor-
spouse child parent
ward structure, readability, and prevalent use in existing
partner knows sister mother sibling brother father
{