=Paper=
{{Paper
|id=Vol-3894/paper16
|storemode=property
|title=Towards Explainable and Ontologically Grounded Language Models⋆
|pdfUrl=https://ceur-ws.org/Vol-3894/paper16.pdf
|volume=Vol-3894
|authors=Walid S. Saba
|dblpUrl=https://dblp.org/rec/conf/kil/Saba24
}}
==Towards Explainable and Ontologically Grounded Language Models⋆==
Towards Explainable and Ontologically Grounded Language
Models⋆
Walid S. Saba1
1 Institute for Experiential AI, Northeastern University, 100 Fore St, Portland, ME 04101 USA
Abstract
We argue that the relative success of large language models (LLMs) is not a reflection on the
symbolic vs. subsymbolic debate but a reflection on employing an appropriate bottom-up
strategy of a reverse engineering of language at scale. However, and due to their subsymbolic
nature whatever knowledge these systems acquire about language will always be buried in
millions of weights none of which is meaningful on its own, rendering such systems utterly
unexplainable. Furthermore, and due to their stochastic nature, LLMs will often fail in making the
correct inferences in various linguistic contexts that require reasoning in intensional, temporal,
or modal contexts. To remedy these shortcomings we suggest employing the successful bottom-
up strategy employed in LLMs but in a symbolic setting, resulting in explainable, language-
agnostic, and ontologically grounded language models.
Keywords
Large language models, ontology, bottom-up reverse engineering 1
1. Introduction decades of top-down work in ontology and knowledge
representation (Lenat and Guha, 1990 and Sowa,
To arrive at a scientific explanation there are 1995) also faltered since most of this work amounted
generally two approaches we can adopt, a top-down to pushing, in a top-down manner, metaphysical
approach or a bottom-up approach (Salmon, 1989). theories of how the world is supposedly structured
However, for a top-down approach to work, there and represented in our minds, and again without any
must be a set of established general principles that agreed upon general principles to start with. On the
one can start with, which is clearly not the case when other hand, unprecedented progress has been made
it comes to language and how our minds externalize in only a few years of NLP work that employed a data-
our thoughts in language. In retrospect, therefore, it is driven bottom-up strategy, as exemplified by recent
not surprising that decades of top-down work in advances in large language models (LLMs) that are
natural language processing (NLP) failed to produce essentially a massive experiment of a bottom-up
satisfactory results since most of this work was reverse engineering of language at scale (e.g., ChatGPT
inspired by theories that made questionable and GPT-4)2.
assumptions where, for example, an innate universal
grammar was assumed (Chomsky, 1957), or that we 1.1. Issues with LLMs
metaphorically build our linguistic competence based
Despite their relative success, LLMs do not tell us
on a set of idealized cognitive models (Lakoff, 1987),
anything about how language works since these
or that natural language could be formally described
models are not really models of language but are
using the tools of formal logic (Montague, 1973). In a
similar vein, it is perhaps for the same reason that statistical models of regularities found in language3. In
KiL'24: Workshop on Knowledge-infused Learning co-located with 2 GPT stands for ‘Generative Pre-trained Transformer’, an
30th ACM KDD Conference, August 26, 2024, Barcelona, Spain. architecture that OpenAI built on top of the transformer
∗ Corresponding author. architecture (Vaswani, A. et. al., 2017).
† These authors contributed equally. 3 In looking inside the neural network (NN) of an LLM one does
w.saba@northeastern.edu (W. Saba) not find concepts, meanings, linguistic structures, etc. but
© 2024 Copyright for this paper by its authors. Use permitted under weights associated with neural connections, which is exactly
Creative Commons License Attribution 4.0 International (CC BY 4.0).
what one will find in an object recognition or any other NN.
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
fact, and due to their subsymbolic nature, whatever noting that the LLMs tested were consistently oblivion
‘knowledge’ these models acquire about language will to intension. For example, in ‘Perhaps Socrates was not
always be buried in millions of weights the tutor of Alexander the Great’, ‘Socrates’ and ‘the
(microfeatures) none of which is meaningful on its tutor of Alexander the Great’ were also deemed
own, rendering these models utterly unexplainable replaceable (since they are extensionally equal) resulting
(Guizzardia and Guarino, 2024). Besides in ‘Perhaps Socrates was not Socrates’. These results
unexplainability, LLMs are also oblivious to truth were expected since neural networks (deep or
(Borji, 2023), since for LLMs all text (factual or non- otherwise), that are the computing architecture behind
factual), is treated equally. Finally, and while LLMs all LLMs, are purely extensional models and are based
have been shown to do poorly in a number of tasks on the ‘empiricist theory of abstraction’ where their
that require high-level reasoning such as planning similarity semantics has no notion of ‘object identity’
(Valmeekam et. al., 2023), analogies (Lewis and (Lopes, 2023).
Mitchell, 2024) and formal reasoning (Arkoudas, Finally, example 3 illustrates failures of LLMs in
making the correct inferences in modal (belief) contexts:
2023) what concerns here is the failure of LLMs in
the response of the LLMs tested was that ‘Devon knows
making the right inferences in various linguistic
that Olga is a student’ which is clearly the wrong
contexts. As an illustration of the kinds of failures in
inference since inferring K(Devon, student(Olga)) from
deep language understanding we consider here three
K(Devon, client(Olga)student(Olga)) requires K(Devon,
linguistic contexts involving copredication, intension
client(Olga)), i.e., it requires Devon knowing that Olga is
and prepositional attitudes.
a client. We have collected many other tests that we
Example 1. Show the entities and the relations that are make available elsewhere for the sake of saving space.5
implicit in the following text: “I threw away the
newspaper I was reading because they fired my 1.2. LLMs: A Glass Half Empty, Half Full
favorite columnist”.
So where do we stand now? On one hand, LLMs have
Example 2. Since Madrid is the capital of Spain, can I clearly proven that one can get a handle on syntax and
replace one for the other in the following: “Maria thinks quite a bit of semantics in a bottom-up reverse
Madrid was not always the capital of Spain”? engineering of language at scale; yet on the other hand
what we have are unexplainable models that do not
Example 3. Suppose Devon knows that if someone is a shed any light on how language actually works.
client, then s/he is a student, and suppose that Olga is a Moreover, it would seem that due to their purely
client. Then what does Devon know? extensional and statistical nature, LLMs will always
fail in making the correct inferences in many linguistic
The first example involves a phenomenon called
contexts. Since we believe the relative success of LLMs
copredication (see Asher and Pustejovsky, 2005)
is not a reflection on the symbolic vs. subsymbolic
which occurs when the same entity is used in the same
debate but is a reflection on a successful bottom-up
context to refer to more than one semantic
reverse engineering strategy, we think that combining
(ontological) type. All LLMs tested4 failed in
the advantages of symbolic and ontologically
recognizing that ‘newspaper’ in the text is used to
grounded representations with a bottom-up reverse
simultaneously refer to three entities: (i) the physical
engineering strategy is a worthwhile effort. In fact, the
object I threw away; (ii) the content of the newspaper
idea that word meaning can be extracted from how
I was reading; and (iii) the ‘editorial board’ of the
words are actually used in language is not exclusive to
newspaper that did the firing of the columnist. Note
linguistic work in the empirical tradition, but in fact it
that the failure of the LLMs was more acute when the
can be traced back to Frege.
LLMs were asked to draw a graph showing all entities
In the rest of the paper we will (i) first argue that
and relations implied by the text since to show all the
current word embeddings that are the genesis of
relations in the text all the different types of entities
modern-day large language models can be
must be extracted. Here all LLMs tested showed the
constructed in a symbolic setting instead of being the
same newspaper (physical) object doing the firing of
result of statistical cooccurrences; (ii) we will show
the columnist.
that symbolic vectors perform better than current
In example 2 all LLMs we tested approved replacing
embeddings on a well-known word similarity
‘the capital of Spain’ by ‘Madrid’ resulting in ‘Maria
benchmark; (iii) we will discuss how our symbolic
thinks that Madrid was not always Madrid’. It is worth
4 Our experiments were conducted on GPT-4o (chat.openai.com). 5 https://shorturl.at/ejmH8
vectors can be used to discover the ontological the proposal put forth by Fred Sommers (1963) was
structure that is implicit in our ordinary language. very specific. Again, Sommers suggests that “to know
the meaning of a word is to know how to formulate
2. Concerning 'the Company a Word some sentences containing the word” and this would
lead, like in Frege’s case, to the conclusion that a
Keeps' complete knowledge of some word w would be all the
The genesis of modern LLMs is the distributional ways w can be used. For Sommers, the process of
semantics hypothesis which states that the more understanding the meaning of some word w starts by
semantically similar words are, the more they tend to analyzing all the properties P that can sensibly be
occur in similar contexts – or, similarity in meaning is said of w. Thus, for example, [delicious Thursday] is not
similarity in linguistic distribution (Harris, 1954). sensible while [delicious apple] is, regardless of the
This is usually summarized by a saying that is truth or falsity of the predication. Moreover, and since
attributed to the British linguist John R. Firth that “you [delicious cake] is also sensible, then there must be a
shall know a word by the company it keeps”. When common type (perhaps food?) that subsumes both
processing a large corpus, this idea can be used by apple and cake. This idea is similar to the idea of type
analyzing co-occurrences and contexts of use to checking in strongly typed polymorphic programming
approximate word meanings by word embeddings languages. For example, the types in an expression
(vectors or tensors), that are essentially points in such as ‘x + 3’ will only unify (or the expression will
multidimensional space. Thus, at the root of LLMs is a only ‘make sense’) if/when x is an object of type
bottom-up reverse engineering of language strategy number (as opposed to a tuple, for example). As it was
where, unlike top-down approaches, “reverse suggested in (Saba, 2007), this type of analysis can
engineers the process and induces semantic thus be used to ‘discover’ the ontology that seems to
representations from contexts of use” (Boleda, 2020). be implicit in the language, as will be discussed below.
But nothing precludes this idea from being carried out First, however, we describe how a bottom-up reverse
in a symbolic setting. In other words, the ‘company a engineering of language can be done in a symbolic
word keeps’ can be measured in several ways, other setting.
than the correlational and statistical measures that
underlie modern word embeddings. 2.2. Symbolic Reverse Engineering of
Language
2.1. Symbolic Dimensions of Meaning
The procedure we have in mind assumes a
In discussing possible models of the world that can be Platonic universe where all concepts, physical or
employed in computational linguistics Hobbs (1985) abstract, including states, activities, properties
once suggested that there are two alternatives: (i) on (tropes) (Moltmann, 2013), processes, events, etc. are
considered entities that can be defined by a number of
one extreme we could attempt building a “correct”
language-agnostic primitives (Smith, 2005) that we
theory that would entail a full description of the call the ‘dimensions of meaning’. We consider here the
world, something that would involve quantum following dimensions: AGENTOF, OBJECTOF, HASPROP,
physics and all the sciences; (ii) on the other hand, we INSTATE, PARTOF, INSTATE, INPROCESS, and OFTYPE. For
could have a promiscuous model of the world that is every word w in the language, and for every
isomorphic to the way we talk it about in natural dimension D, a reverse-engineering process is
conducted to compute a set wD = {(x, t) | D(w, x)}
language (emphasis is ours). Since the first option is a
where t is a weight in [0,1]. Here are example sets
project that is most likely impossible to complete, computed for ‘book’ along four dimensions of
what Hobbs is clearly suggesting here is a reverse meaning along with the masking prompt that queries
engineering of language to discover how we actually what an LLM has ‘learned’ about how we talk about
use language to talk about the world we live in. This is books:
also not much different from Frege’s Context Principal
that suggests “never ask for the meaning of words in book . HASPROP
isolation” (Dummett, 1981) but that a word gets its Everyone likes to read a [MASK] book.
=> {(popular, 0.9), (educational, 0.8), (famous, 0.8), ... }
meanings from analyzing all the contexts in which the
word can appear (Milne, 1986). Again, what this book . OBJECTOF
suggests is that the meaning of words is embedded (to Everyone I know enjoyed [MASK] ‘The Prince’.
=> {(reading, 0.9), (writing, 0.8), (editing, 0.8), ... }
use a modern terminology) in all the ways we use
these words in how we talk about the world. While book . AGENTOF
Hobbs’ and Frege’s observations might be a bit vague, Das Kapital has [MASK] many people over the years.
=> {(influenced, 0.9), (inspired, 0.8), (changed, 0.8), ... }
book . PARTOF ‘boy’ and ‘lad’ along the HASPROP dimension. Thus, in
Hamlet should be part of every [MASK]. ordinary spoken language it is sensible to speak of a
=> {(collection, 0.9), (archive, 0.8), (library, 0.8), ... } ‘handsome boy’ and a ‘funny boy’ as well as a ‘clever
book . INSTATE lad’ and a ‘talented lad’. We note here that in this
I was told that my book is now in [MASK]. process generic descriptions are removed using a
=> {(print, 0.9), (circulation, 0.8), (review, 0.8), ... } function that computes the information content of
some adjectives, where the information content of an
What the above says is the following (i) in ordinary adjective adj is inversely proportional to the set of
spoken language we speak of a ‘book’ that is popular, types of adj can sensibly be applied to. For example,
educational, famous, etc.; (ii) we speak of reading, ‘beautiful’ will have a low information content score
writing, editing, etc. a ‘book’; (iii) we speak of ‘book’ since ‘beautiful’ can sensibly be said of many concepts,
that may change, influence, inspire, etc.; and (iv) we both physical and abstract (e.g., car, movie, poem,
speak of a b ‘book’ that is part of a collection, an night, girl, …) while ‘tasty’ can sensibly be said of ‘food’
archive, or a library; and (v) a book can be in review, in and just a few others. The symbolic embeddings in
print, in circulation, etc. The nominalization process figure 2(b) are those of ‘automobile’ and ‘car’ along
can be conducted using the copular ‘is’ as shown in the OBJECTOF dimension. Note now that word
table 1. For example, ‘John is famous’ can be restated similarity along these symbolic dimensions can be
as ‘John has the property of fame’; ‘Jim is sad’ as ‘Jim is computed using cosine similarity as well as weighted
in a state of sadness’; etc. (see [Smith, 2005] for more Jaccard similarity where max and min can be used in
on the relationship between the copular and abstract fuzzy union and fuzzy intersection. We are currently
entities and [Moltmann, 2013] for more on abstract experimenting with the optimal number of
objects.) What should be noted here is that even with dimensions using a number of word similarity
the simple conceptual structure discovered thus far benchmarks, including the WordSim353 dataset
one can generate plausible text, such as the following: (Finkelstein, Lev. et al., 2001)6.
(1) enjoyed the interesting reading of the new book
(2) completed a boring reading of a controversial book
Table 1: From propositions to relations and entities
The sensible (and meaningful) fragment in (1) can be
generated because a book can be ‘read’ and described
by ‘new’, and readings can be ‘interesting’ and the
object of enjoyment; and similarly for (2) where a
reading of a controversial book can be boring and the
object of a completion, etc. Note, however, that text
generation in this case is not a function of ‘predicting’
the most likely continuation, but a function of Figure 1: We speak of a ‘book’ (i) that influence, change,
plausible filling in of subjects, objects, agents, convince; (ii) that is edited, read, written; (iii) that can be
popular, controversial, famous; (iv) that is part of a library,
descriptions, etc. to any propositional structure. an archive, etc.
2.3. Symbolic Embeddings 3. The Ontology of the Language of
The process we described thus far results in symbolic Thought?
word embeddings as the one shown in figure 2 below.
In figure 2(a) we show the symbolic embedding for The reverse engineering process we have described
above would result in symbolic embeddings along
6 https://kaggle.com/datasets/julianschelb/wordsim353-crowd
various dimensions, as the ones shown in figure 2. As
a result of this, however, we could then analyze the
subset relations between these embeddings to
discover the ontological structure that seems to be
implicit in our ordinary language. To illustrate,
consider the following:
(3) car . objectOf
= {(driving, 0.9), (repairing, 0.8), (buying, 0.8), ... }
(4) book . objectOf
= {(reading, 0.9), (writing, 0.8), (buying, 0.8), ... }
(5) person . AGENTOF
= {(reading, 0.9),(writing, 0.8), (driving, 0.8), ... }
(6) person. HASPROP
= {(popularity, 0.9), (fame, 0.8), (beautiful, 0.8), .. }
(7) car. hasProp
= {(popularity, 0.9), (fame, 0.8), (beautiful, 0.8), .. }
(8) book . HASPROP
= {(popularity, 0.9), (fame, 0.8), (beautiful, 0.8), .. }
Note that car can be the object of ‘buying’ and so can (a)
be a book and this means that car and book must, at
some level of abstraction, share the same parent
(perhaps ‘artifact’?) Note also that a car as well as a
book and a person can be popular. An analysis along
these lines would result in the following:
(9) read(person, book)
(10) write(person, book)
(11) buy(person, T1 = car book … )
(12) drive(person, car)
(13) beautiful(T2 = person car book … )
What the above says is the following: in ordinary
spoken language we speak of people reading and
writing books (9 and 10); we speak of people buying
cars and books, and thus of buying objects that are of
some type that subsumes both cars and books (11);
we speak of people driving cars (12); and we speak of
beautiful people, cars, and books (and thus beautiful
(b)
seems to be a property that can sensibly be said of
concepts that are at very high level of generality). As Figure 2: (a) the symbolic embeddings of ‘boy’ and ‘lad’ along
suggested by Sommers (1963) this type of analysis the HASPROP dimension (with a weighted Jaccard similarity of
that can be fully automated with the help of LLMs can 0.876) and (b) those of ‘automobile’ and ‘car’ along the
help us discover what he called ‘the Tree of Language’ OBJECTOF dimension (the weighted Jaccard similarity is 0.91)
– which is essentially the ontology that seems to be
underneath our ordinary language. This might also be However, we believe that LLMs are not the answer to
what Hobbs (1985) was seeking when he suggested the language understanding problem nor to reasoning
building a model of the world that isomorphic to the in general and in particular commonsense reasoning.
we talk about it in natural language. Due to their paradigmatic unexplainability LLMs will
also not shed any light on how language works and
4. Concluding Remarks how we externalize our thoughts in language. Since, in
our opinion, the relative success of LLMs is not due to
Large language models (LLMs) have shown their subsymbolic nature but due to applying a
impressive capabilities that pioneers in artificial successful bottom-up reverse engineering strategy,
intelligence and natural language processing would we suggested here applying the same strategy but in a
marvel at. symbolic setting, something that has been argued for
by logicians dating back to Frege. By combining the [14] Jesse Lopes. 2023. Can Deep CNNs Avoid Infinite
successful bottom-up strategy and symbolic and Regress/Circularity in Content Constitution?
ontological methods we arrive at explainable and Minds and Machines, vol. 33, pp. 507-524.
ontologically grounded language models that can be [15] Friederike Moltmann, 2013. Abstract Objects
used in problems requiring commonsense reasoning. and the Semantics of Natural Language, Oxford
We are still in the early stage of this work, but we University Press.
currently have the tools to realize the dream of Frege [16] Peter Milne. 1986. Frege's Context Principle,
and Sommers and perhaps shed some light on the Mind, Vol. 95, No. 380, pp. 491-495.
‘language of thought’ Fodor (1998) – the internal [17] Richard Montague. 1973. The Proper Treatment
language that we use to construct and process our of Quantification in Ordinary English. In: Kulas,
thoughts. J., Fetzer, J.H., Rankin, T.L. (eds) Philosophy,
Language, and Artificial Intelligence. Studies in
References Cognitive Systems, vol 2.
[18] Walid Saba. 2020. Language, Knowledge, and
[1] Nicholas Asher and James Pustejovsky. 2005. Ontology: Where Formal Semantics Went
Word Meaning and Commonsense Metaphysics. Wrong, and How to Go Forward, Again, Journal
In: Course Materials for Type Selection and the of Knowledge Structures and Systems, 1 (1): 40-
Semantics of Local Context, ESSLLI 2005. 62.
[2] Gemma Boleda. 2020. Distributional Semantics [19] Walid Saba. 2007. Language, logic and ontology:
and Linguistic Theory, Annual Review of Uncovering the structure of commonsense
Linguistics, 6, pp. 213–34. knowledge, International Journal of Human
[3] Borji, A. 2023. A Categorical Archive of ChatGPT Computer Studies, 65(7): 610-623.
Failures, Available online at [20] Wesley Salmon. 1989. Four decades of scientific
https://arxiv.org/abs/2302.03494
explanation, in: P. KITCHER & W. SALMON (Eds)
[4] Noam Chomsky. 1957. Syntactic Structures,
Minnesota Studies in the Philosophy of Science,
Mouton de Gruyter, NY.
Vol. XIII (Minnesota, University of Minnesota
[5] Michael Dummett. 1981. Frege: Philosophy of
Press), pp. 3-21.
Language. Harvard University Press.
[21] Barry Smith, 2005. Against Fantology, In J.
[6] Finkelstein, Lev, et al. "Placing search in context:
Marek and E. M. Reicher (eds.), Experience and
The concept revisited." Proceedings of the 10th
Analysis, pp. 153–170
international conference on the World Wide
[22] Fred Sommers. 1963. Types and ontology.
Web. ACM, 2001.
Philosophical Review, 72 (3), pp. 327-363.
[7] Jerry Fodor, 1988. Concepts: Where Cognitive
[23] John Sowa. 1995. Knowledge Representation:
Science Went Wrong, Oxford University Press.
Logical, Philosophical and Computational
[8] Giancarlo Guizzardia and Nicola Guarino, 2024.
Foundations, PWS Publishing Company, Boston.
Semantics, Ontology, and Explanation, Data &
[24] Karthik Valmeekam, Matthew Marquez, Sarath
Knowledge Engineering (to appear).
Sreedharan, Subbarao Kambhampati. 2023. On
[9] Zellig S. Harris. 1954. Distributional Structure.
the Planning Abilities of Large Language Models
Word 10, pp. 146–62
- A Critical Investigation, In Advances in Neural
[10] Jerry Hobbs. 1985. Ontological promiscuity. In
Information Processing Systems 36 (NeurIPS
Proceedings. of the 23rd Annual Meeting of the
2023).
Association for Computational Linguistics,
[25] Vaswani, A., Shazeer, N., et. al. 2017. Attention is
Chicago, Illinois, 1985, pp. 61–69.
All You Need, Available online at
[11] George Lakoff. 1987. Women, Fire, and
https://arxiv.org/abs/1706.03762.
Dangerous Things: What Categories Reveal
About the Mind, University of Chicago Press.
[12] Doug Lenat and Guha, R. V. 1990. Building Large
Knowledge-Based Systems: Representation and
Inference in the CYC Project. Addison-Wesley.
[13] Martha Lewis and Melanie Mitchell. 2024. Using
Counterfactual Tasks to Evaluate the Generality
of Analogical Reasoning in LLMs,
https://arxiv.org/abs/2402.08955