=Paper= {{Paper |id=Vol-3894/paper21 |storemode=property |title=Prompt-Time Ontology-Driven Symbolic Knowledge Capture with Large Language Models |pdfUrl=https://ceur-ws.org/Vol-3894/paper21.pdf |volume=Vol-3894 |authors=Tolga Çöplü,Arto Bendiken,Andrii Skomorokhov,Eduard Bateiko,Stephen Cobb }} ==Prompt-Time Ontology-Driven Symbolic Knowledge Capture with Large Language Models== https://ceur-ws.org/Vol-3894/paper21.pdf
                                Prompt-Time Ontology-Driven Symbolic Knowledge
                                Capture with Large Language Models
                                Tolga Çöplü* , Arto Bendiken, Andrii Skomorokhov, Eduard Bateiko and Stephen Cobb
                                Haltia, Inc.


                                                Abstract
                                                In applications such as personal assistants, large language models (LLMs) must consider the user’s personal information and
                                                preferences. However, LLMs lack the inherent ability to learn from user interactions. This paper explores capturing personal
                                                information from user prompts using ontology and knowledge-graph approaches. We use a subset of the KNOW ontology,
                                                which models personal information, to train the language model on these concepts. We then evaluate the success of knowledge
                                                capture using a specially constructed dataset. Our code and datasets are publicly available at https://github.com/HaltiaAI/paper-
                                                PTODSKC

                                                Keywords
                                                ontology-driven symbolic knowledge capture, KNOW ontology, symbolic representation, knowledge graphs, large language
                                                models, fine-tuning



                                1. Introduction                                                                                                    termine which personal information will be cap-
                                                                                                                                                   tured and how it will be associated with other
                                Currently, many generative artificial intelligence (AI) ap-                                                        captured knowledge.
                                plications, particularly personal assistants, strive to offer                                                    • Ontology rules can help identify inconsistencies
                                users personalized experiences. To achieve this, AI appli-                                                         in the captured knowledge, allowing for valida-
                                cations must learn personal information and preferences                                                            tion before storage.
                                from user interactions (knowledge capture) and use this                                                          • Ontology relationships allow the extraction of
                                learned knowledge in future conversations (knowledge                                                               implicit information from captured knowledge,
                                utilization). Implementing this fundamental personal AI                                                            effectively enabling automatic inference that ex-
                                approach depends on addressing several complex sub-                                                                pands the knowledge graph.
                                problems, such as discerning which user prompt infor-
                                                                                                                                                 • A robust, personalized knowledge graph forms a
                                mation is personal, extracting it, determining whether
                                                                                                                                                   reliable foundation for facilitating personalized
                                the extracted information is duplicate, and associating it
                                                                                                                                                   interactions with the application through lan-
                                with other personal data.
                                                                                                                                                   guage models.
                                   These challenges have been the focus of extensive re-
                                search within the AI field for many years. However, the              In this paper, we address a specific aspect of the AI
                                emergence of neurosymbolic approaches through the col-            personalization challenge by focusing on prompt-time,
                                laboration between large language models (LLMs) and               ontology-driven symbolic knowledge capture using lan-
                                symbolic AI has provided researchers with new perspec-            guage models. We explore the extraction from user
                                tives [1, 2, 3, 4]. LLMs’ capabilities in natural language        prompts of subject-predicate-object triples1 that conform
                                processing can be integrated with the representational            to a specified ontology. We have investigated various
                                and factual reasoning abilities of knowledge graphs, en-          methods to enable the underlying language model to
                                hanced by the structure, rules, and inference mechanisms          comprehend a pre-defined ontology, ensuring effective
                                offered by an ontology. For targeted personal AI applica-         symbolic knowledge capture. By utilizing a specially
                                tions, this ontology approach presents several benefits:          designed dataset, we evaluate the effectiveness of these
                                                                                                  methods, emphasizing their strengths and identifying
                                      • Ontology schemas enable language models to de-
                                                                                                  potential areas for improvement.
                                                                                                     The structure of this paper is as follows: Section 2
                                KiL’24: Workshop on Knowledge-infused Learning co-located with
                                30th ACM KDD Conference, August 26, 2024, Barcelona, Spain        discusses  in-context learning and fine-tuning approaches
                                *
                                  Corresponding author.                                           for ontology-driven symbolic knowledge capture and
                                $ tolga@haltia.ai (T. Çöplü); arto@haltia.ai (A. Bendiken);       focuses on the details of the fine-tuning approach. Sec-
                                andriy@haltia.ai (A. Skomorokhov); eduard@haltia.ai (E. Bateiko); tion 3 describes the experimental setup by presenting the
                                steve@haltia.ai (S. Cobb)
                                                                                                  development framework, the language model selection,
                                 0009-0004-9414-0588 (T. Çöplü); 0009-0002-0725-4874
                                (A. Bendiken); 0000-0002-5696-6723 (A. Skomorokhov);              and the ontology and dataset creation process. Section 4
                                0000-0002-6729-5611 (E. Bateiko); 0009-0004-0476-6000 (S. Cobb)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License   1
                                          Attribution 4.0 International (CC BY 4.0).                                                         https://www.w3.org/TR/rdf12-concepts/




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
outlines our performance evaluation framework and the            The following points highlight the key aspects of on-
test results. Finally, Section 5 concludes the paper and      tology fine-tuning:
suggests future directions.
                                                                   • The training dataset’s coverage and diversity are
                                                                     vital for successful fine-tuning. These character-
2. Ontology-Driven Symbolic                                          istics greatly influence the LLM’s ability to ef-
                                                                     fectively capture knowledge. Details about the
   Knowledge Capture                                                 dataset and how it is constructed are discussed in
In the literature, language models have demonstrated                 Section 3.4.
their capability to transform unstructured text into a             • The training dataset must include a variety of
knowledge graph [5, 6, 7, 8, 9]. However, the process                examples for each element of the predefined on-
of populating a knowledge graph from user prompts                    tology. This approach avoids scalability issues
in alignment with a pre-defined ontology has been ex-                typically associated with in-context learning and
plored only marginally [10, 11, 12, 13, 14]. Research                ensures comprehensive learning coverage.
typically centers on in-context learning, which heavily            • If the LLM encounters a user prompt that is not
relies on prompt engineering. A significant limitation               relevant to the predefined ontology concepts, it
of this approach is the requirement to incorporate the               should not attempt to capture knowledge. There-
entire custom ontology into the prompt. This neces-                  fore, the dataset should also contain sufficient
sity not only slows down the knowledge capture pro-                  out-of-context samples to enable the LLM to dis-
cess, because of the high token overhead but also re-                tinguish between relevant and irrelevant infor-
stricts the use of larger ontologies due to the constraint           mation for capture.
on context-window length. Given these constraints, in-
context learning methods do not provide a scalable solu-
tion for ontology-driven symbolic knowledge capture.          3. Experimental Setup
   An alternative approach involves training a language
                                                              This section explores the components of our experimen-
model with a pre-defined ontology, so that the model
                                                              tal setup.
internalizes it. There are two strategies to consider: pre-
training the LLM on the ontology or fine-tuning it. This
paper does not explore pre-training due to its extensive      3.1. Development Framework
data, computational resources, energy, and time require-      The methods suggested in this paper have been imple-
ments. Additionally, pre-training does not offer a flexible   mented using the Apple MLX framework [15]. MLX is a
response to ongoing changes or expansions in the ontol-       specialized array framework designed for machine learn-
ogy. Therefore, this paper will focus on fine-tuning as a     ing applications, akin to NumPy, PyTorch, or JAX, with
method to train language models on personal ontologies,       the distinction of being exclusive to Apple silicon.
highlighting advantages in feasibility and maintainabil-         Ontology fine-tuning has been conducted using the
ity.                                                          parameter-efficient QLoRA approach [16] on our custom
                                                              dataset, comprising randomly selected, non-overlapping
2.1. Ontology-Driven Knowledge Capture                        sets of training, validation, and test samples.
     with Fine-Tuning
Fine-tuning is a process whereby a pre-trained language 3.2. Language Model
model is further trained on a specific dataset to tailor The methods we have developed here do not have a struc-
its capabilities to a particular task. In our study, the tural dependency on a particular underlying foundation
language model is expected to learn the classes, object model. The key factors guiding our language model se-
properties, and data properties defined in an ontology, lection were its proven effectiveness across diverse do-
and to use them to populate a knowledge graph from user mains in community benchmarks and its prevalence in
prompts. The first step involves preparing a fine-tuning the field. Owing to its performance in the Hugging Face
dataset, which includes user prompts, system prompts, Open LLM Leaderboard [17] and its robust ecosystem,
and expected model responses for each concept in the the Mistral-7B-Instruct-v0.2 [18], based on the Llama 2
ontology. This dataset is used to fine-tune the language [19] architecture, was selected for our research. We ran
model, which is then evaluated by testing it with new all examples, tests, and benchmarks on the MLX 4-bit
prompts to assess the effectiveness of the knowledge quantized version of this model to be able to run the tests
capture operation. We define a system prompt for this on personal laptops.
task with the requirement of maintaining the model’s
generality across other tasks.
3.3. Predefined Ontology                                   we included 32 generic user prompts in the dataset. The
                                                           composition of this dataset, which consists of 175 user
Our study is inspired by KNOW[20]–the Knowledge Nav-
                                                           prompts, is illustrated in Figure 2. Concepts not associ-
igator Ontology for the World–and utilizes it for repre-
                                                           ated with the ontology are labeled as the ’none’ legend
senting personal information. KNOW is introduced as
                                                           in the figure. As each sample prompt typically contains
a pioneering framework designed to capture everyday
                                                           multiple modeled concepts, the chart shows a total num-
knowledge to enhance language models in real-world
                                                           ber of concept occurrences greater than the number of
generative AI applications such as personal AI assistants.
                                                           prompts.
The ontology focuses on human life, encompassing ev-
eryday concerns and significant milestones, and limits its
initial scope to established human universals, including                     Occurrences of Ontology Concepts

spacetime (places, events) and social dimensions (people,    Person
                                                               Name
groups, organizations). This pragmatic approach empha-           Sex

sizes universality and utility, contrasting with previous
                                                                Child
                                                              Father

works like Schema.org[21] and Cyc[22] by building on          Mother
                                                              Sibling

language models’ inherent encoding of salient common-          Sister
                                                             Brother
sense knowledge.                                             Spouse

   Due to the requirement that each element in the on-
                                                             Partner
                                                              Knows

tology be associated with a diverse set of prompt and re-      None


sponse samples within the training dataset, our research
                                                                      0            100                       200   300


focuses on a specific subset of the KNOW ontology. This                                Number of Occurrences


subset concentrates on core family relationships with
                                                           Figure 2: Occurrences of ontology concepts in the prepared
four ontology classes, eleven object properties, and one dataset.
data property. A visual depiction of this subset is pre-
sented in Figure 1.
                                                              The Turtle format was chosen for serializing the ontol-
                                                           ogy population in our research because of its straightfor-
                        spouse         child        parent
                                                           ward structure, readability, and prevalent use in existing
                                                                 partner     knows     sister     mother       sibling     brother      father
                      {