=Paper=
{{Paper
|id=Vol-3894/paper16
|storemode=property
|title=Towards Explainable and Ontologically Grounded Language Models⋆
|pdfUrl=https://ceur-ws.org/Vol-3894/paper16.pdf
|volume=Vol-3894
|authors=Walid S. Saba
|dblpUrl=https://dblp.org/rec/conf/kil/Saba24
}}
==Towards Explainable and Ontologically Grounded Language Models⋆==
<pdf width="1500px">https://ceur-ws.org/Vol-3894/paper16.pdf</pdf>
<pre>
                                Towards Explainable and Ontologically Grounded Language
                                Models⋆
                                Walid S. Saba1

                                1 Institute for Experiential AI, Northeastern University, 100 Fore St, Portland, ME 04101 USA


                                                   Abstract
                                                   We argue that the relative success of large language models (LLMs) is not a reflection on the
                                                   symbolic vs. subsymbolic debate but a reflection on employing an appropriate bottom-up
                                                   strategy of a reverse engineering of language at scale. However, and due to their subsymbolic
                                                   nature whatever knowledge these systems acquire about language will always be buried in
                                                   millions of weights none of which is meaningful on its own, rendering such systems utterly
                                                   unexplainable. Furthermore, and due to their stochastic nature, LLMs will often fail in making the
                                                   correct inferences in various linguistic contexts that require reasoning in intensional, temporal,
                                                   or modal contexts. To remedy these shortcomings we suggest employing the successful bottom-
                                                   up strategy employed in LLMs but in a symbolic setting, resulting in explainable, language-
                                                   agnostic, and ontologically grounded language models.

                                                   Keywords
                                                   Large language models, ontology, bottom-up reverse engineering 1


                                1. Introduction                                                                    decades of top-down work in ontology and knowledge
                                                                                                                   representation (Lenat and Guha, 1990 and Sowa,
                                To arrive at a scientific explanation there are                                    1995) also faltered since most of this work amounted
                                generally two approaches we can adopt, a top-down                                  to pushing, in a top-down manner, metaphysical
                                approach or a bottom-up approach (Salmon, 1989).                                   theories of how the world is supposedly structured
                                However, for a top-down approach to work, there                                    and represented in our minds, and again without any
                                must be a set of established general principles that                               agreed upon general principles to start with. On the
                                one can start with, which is clearly not the case when                             other hand, unprecedented progress has been made
                                it comes to language and how our minds externalize                                 in only a few years of NLP work that employed a data-
                                our thoughts in language. In retrospect, therefore, it is                          driven bottom-up strategy, as exemplified by recent
                                not surprising that decades of top-down work in                                    advances in large language models (LLMs) that are
                                natural language processing (NLP) failed to produce                                essentially a massive experiment of a bottom-up
                                satisfactory results since most of this work was                                   reverse engineering of language at scale (e.g., ChatGPT
                                inspired by theories that made questionable                                        and GPT-4)2.
                                assumptions where, for example, an innate universal
                                grammar was assumed (Chomsky, 1957), or that we                                    1.1. Issues with LLMs
                                metaphorically build our linguistic competence based
                                                                                                                   Despite their relative success, LLMs do not tell us
                                on a set of idealized cognitive models (Lakoff, 1987),
                                                                                                                   anything about how language works since these
                                or that natural language could be formally described
                                                                                                                   models are not really models of language but are
                                using the tools of formal logic (Montague, 1973). In a
                                similar vein, it is perhaps for the same reason that                               statistical models of regularities found in language3. In


                                KiL'24: Workshop on Knowledge-infused Learning co-located with                        2 GPT stands for ‘Generative Pre-trained Transformer’, an

                                30th ACM KDD Conference, August 26, 2024, Barcelona, Spain.                           architecture that OpenAI built on top of the transformer
                                ∗ Corresponding author.                                                               architecture (Vaswani, A. et. al., 2017).
                                † These authors contributed equally.                                                  3 In looking inside the neural network (NN) of an LLM one does

                                   w.saba@northeastern.edu (W. Saba)                                                  not find concepts, meanings, linguistic structures, etc. but
                                             © 2024 Copyright for this paper by its authors. Use permitted under      weights associated with neural connections, which is exactly
                                             Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                      what one will find in an object recognition or any other NN.


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
fact, and due to their subsymbolic nature, whatever             noting that the LLMs tested were consistently oblivion
‘knowledge’ these models acquire about language will            to intension. For example, in ‘Perhaps Socrates was not
always be buried in millions of weights                         the tutor of Alexander the Great’, ‘Socrates’ and ‘the
(microfeatures) none of which is meaningful on its              tutor of Alexander the Great’ were also deemed
own, rendering these models utterly unexplainable               replaceable (since they are extensionally equal) resulting
(Guizzardia      and    Guarino,     2024).     Besides         in ‘Perhaps Socrates was not Socrates’. These results
unexplainability, LLMs are also oblivious to truth              were expected since neural networks (deep or
(Borji, 2023), since for LLMs all text (factual or non-         otherwise), that are the computing architecture behind
factual), is treated equally. Finally, and while LLMs           all LLMs, are purely extensional models and are based
have been shown to do poorly in a number of tasks               on the ‘empiricist theory of abstraction’ where their
that require high-level reasoning such as planning              similarity semantics has no notion of ‘object identity’
(Valmeekam et. al., 2023), analogies (Lewis and                 (Lopes, 2023).
Mitchell, 2024) and formal reasoning (Arkoudas,                     Finally, example 3 illustrates failures of LLMs in
                                                                making the correct inferences in modal (belief) contexts:
2023) what concerns here is the failure of LLMs in
                                                                the response of the LLMs tested was that ‘Devon knows
making the right inferences in various linguistic
                                                                that Olga is a student’ which is clearly the wrong
contexts. As an illustration of the kinds of failures in
                                                                inference since inferring K(Devon, student(Olga)) from
deep language understanding we consider here three
                                                                K(Devon, client(Olga)student(Olga)) requires K(Devon,
linguistic contexts involving copredication, intension
                                                                client(Olga)), i.e., it requires Devon knowing that Olga is
and prepositional attitudes.
                                                                a client. We have collected many other tests that we
Example 1. Show the entities and the relations that are         make available elsewhere for the sake of saving space.5
implicit in the following text: “I threw away the
newspaper I was reading because they fired my                   1.2. LLMs: A Glass Half Empty, Half Full
favorite columnist”.
                                                                So where do we stand now? On one hand, LLMs have
Example 2. Since Madrid is the capital of Spain, can I          clearly proven that one can get a handle on syntax and
replace one for the other in the following: “Maria thinks       quite a bit of semantics in a bottom-up reverse
Madrid was not always the capital of Spain”?                    engineering of language at scale; yet on the other hand
                                                                what we have are unexplainable models that do not
Example 3. Suppose Devon knows that if someone is a             shed any light on how language actually works.
client, then s/he is a student, and suppose that Olga is a      Moreover, it would seem that due to their purely
client. Then what does Devon know?                              extensional and statistical nature, LLMs will always
                                                                fail in making the correct inferences in many linguistic
The first example involves a phenomenon called
                                                                contexts. Since we believe the relative success of LLMs
copredication (see Asher and Pustejovsky, 2005)
                                                                is not a reflection on the symbolic vs. subsymbolic
which occurs when the same entity is used in the same
                                                                debate but is a reflection on a successful bottom-up
context to refer to more than one semantic
                                                                reverse engineering strategy, we think that combining
(ontological) type. All LLMs tested4 failed in
                                                                the advantages of symbolic and ontologically
recognizing that ‘newspaper’ in the text is used to
                                                                grounded representations with a bottom-up reverse
simultaneously refer to three entities: (i) the physical
                                                                engineering strategy is a worthwhile effort. In fact, the
object I threw away; (ii) the content of the newspaper
                                                                idea that word meaning can be extracted from how
I was reading; and (iii) the ‘editorial board’ of the
                                                                words are actually used in language is not exclusive to
newspaper that did the firing of the columnist. Note
                                                                linguistic work in the empirical tradition, but in fact it
that the failure of the LLMs was more acute when the
                                                                can be traced back to Frege.
LLMs were asked to draw a graph showing all entities
                                                                     In the rest of the paper we will (i) first argue that
and relations implied by the text since to show all the
                                                                current word embeddings that are the genesis of
relations in the text all the different types of entities
                                                                modern-day large language models can be
must be extracted. Here all LLMs tested showed the
                                                                constructed in a symbolic setting instead of being the
same newspaper (physical) object doing the firing of
                                                                result of statistical cooccurrences; (ii) we will show
the columnist.
                                                                that symbolic vectors perform better than current
    In example 2 all LLMs we tested approved replacing
                                                                embeddings on a well-known word similarity
‘the capital of Spain’ by ‘Madrid’ resulting in ‘Maria
                                                                benchmark; (iii) we will discuss how our symbolic
thinks that Madrid was not always Madrid’. It is worth


4 Our experiments were conducted on GPT-4o (chat.openai.com).      5 https://shorturl.at/ejmH8
vectors can be used to discover the ontological              the proposal put forth by Fred Sommers (1963) was
structure that is implicit in our ordinary language.         very specific. Again, Sommers suggests that “to know
                                                             the meaning of a word is to know how to formulate
2. Concerning 'the Company a Word                            some sentences containing the word” and this would
                                                             lead, like in Frege’s case, to the conclusion that a
   Keeps'                                                    complete knowledge of some word w would be all the
The genesis of modern LLMs is the distributional             ways w can be used. For Sommers, the process of
semantics hypothesis which states that the more              understanding the meaning of some word w starts by
semantically similar words are, the more they tend to        analyzing all the properties P that can sensibly be
occur in similar contexts – or, similarity in meaning is     said of w. Thus, for example, [delicious Thursday] is not
similarity in linguistic distribution (Harris, 1954).        sensible while [delicious apple] is, regardless of the
This is usually summarized by a saying that is               truth or falsity of the predication. Moreover, and since
attributed to the British linguist John R. Firth that “you   [delicious cake] is also sensible, then there must be a
shall know a word by the company it keeps”. When             common type (perhaps food?) that subsumes both
processing a large corpus, this idea can be used by          apple and cake. This idea is similar to the idea of type
analyzing co-occurrences and contexts of use to              checking in strongly typed polymorphic programming
approximate word meanings by word embeddings                 languages. For example, the types in an expression
(vectors or tensors), that are essentially points in         such as ‘x + 3’ will only unify (or the expression will
multidimensional space. Thus, at the root of LLMs is a       only ‘make sense’) if/when x is an object of type
bottom-up reverse engineering of language strategy           number (as opposed to a tuple, for example). As it was
where, unlike top-down approaches, “reverse                  suggested in (Saba, 2007), this type of analysis can
engineers the process and induces semantic                   thus be used to ‘discover’ the ontology that seems to
representations from contexts of use” (Boleda, 2020).        be implicit in the language, as will be discussed below.
But nothing precludes this idea from being carried out       First, however, we describe how a bottom-up reverse
in a symbolic setting. In other words, the ‘company a        engineering of language can be done in a symbolic
word keeps’ can be measured in several ways, other           setting.
than the correlational and statistical measures that
underlie modern word embeddings.                             2.2. Symbolic Reverse Engineering of
                                                                  Language
2.1. Symbolic Dimensions of Meaning
                                                                  The procedure we have in mind assumes a
In discussing possible models of the world that can be       Platonic universe where all concepts, physical or
employed in computational linguistics Hobbs (1985)           abstract, including states, activities, properties
once suggested that there are two alternatives: (i) on       (tropes) (Moltmann, 2013), processes, events, etc. are
                                                             considered entities that can be defined by a number of
one extreme we could attempt building a “correct”
                                                             language-agnostic primitives (Smith, 2005) that we
theory that would entail a full description of the           call the ‘dimensions of meaning’. We consider here the
world, something that would involve quantum                  following dimensions: AGENTOF, OBJECTOF, HASPROP,
physics and all the sciences; (ii) on the other hand, we     INSTATE, PARTOF, INSTATE, INPROCESS, and OFTYPE. For
could have a promiscuous model of the world that is          every word w in the language, and for every
isomorphic to the way we talk it about in natural            dimension D, a reverse-engineering process is
                                                             conducted to compute a set wD = {(x, t) | D(w, x)}
language (emphasis is ours). Since the first option is a
                                                             where t is a weight in [0,1]. Here are example sets
project that is most likely impossible to complete,          computed for ‘book’ along four dimensions of
what Hobbs is clearly suggesting here is a reverse           meaning along with the masking prompt that queries
engineering of language to discover how we actually          what an LLM has ‘learned’ about how we talk about
use language to talk about the world we live in. This is     books:
also not much different from Frege’s Context Principal
that suggests “never ask for the meaning of words in         book . HASPROP
isolation” (Dummett, 1981) but that a word gets its          Everyone likes to read a [MASK] book.
                                                             => {(popular, 0.9), (educational, 0.8), (famous, 0.8), ... }
meanings from analyzing all the contexts in which the
word can appear (Milne, 1986). Again, what this              book . OBJECTOF
suggests is that the meaning of words is embedded (to        Everyone I know enjoyed [MASK] ‘The Prince’.
                                                             => {(reading, 0.9), (writing, 0.8), (editing, 0.8), ... }
use a modern terminology) in all the ways we use
these words in how we talk about the world. While            book . AGENTOF
Hobbs’ and Frege’s observations might be a bit vague,        Das Kapital has [MASK] many people over the years.
                                                             => {(influenced, 0.9), (inspired, 0.8), (changed, 0.8), ... }
book . PARTOF                                                  ‘boy’ and ‘lad’ along the HASPROP dimension. Thus, in
Hamlet should be part of every [MASK].                         ordinary spoken language it is sensible to speak of a
=> {(collection, 0.9), (archive, 0.8), (library, 0.8), ... }   ‘handsome boy’ and a ‘funny boy’ as well as a ‘clever
book . INSTATE                                                 lad’ and a ‘talented lad’. We note here that in this
I was told that my book is now in [MASK].                      process generic descriptions are removed using a
=> {(print, 0.9), (circulation, 0.8), (review, 0.8), ... }     function that computes the information content of
                                                               some adjectives, where the information content of an
What the above says is the following (i) in ordinary           adjective adj is inversely proportional to the set of
spoken language we speak of a ‘book’ that is popular,          types of adj can sensibly be applied to. For example,
educational, famous, etc.; (ii) we speak of reading,           ‘beautiful’ will have a low information content score
writing, editing, etc. a ‘book’; (iii) we speak of ‘book’      since ‘beautiful’ can sensibly be said of many concepts,
that may change, influence, inspire, etc.; and (iv) we         both physical and abstract (e.g., car, movie, poem,
speak of a b ‘book’ that is part of a collection, an           night, girl, …) while ‘tasty’ can sensibly be said of ‘food’
archive, or a library; and (v) a book can be in review, in     and just a few others. The symbolic embeddings in
print, in circulation, etc. The nominalization process         figure 2(b) are those of ‘automobile’ and ‘car’ along
can be conducted using the copular ‘is’ as shown in            the OBJECTOF dimension. Note now that word
table 1. For example, ‘John is famous’ can be restated         similarity along these symbolic dimensions can be
as ‘John has the property of fame’; ‘Jim is sad’ as ‘Jim is    computed using cosine similarity as well as weighted
in a state of sadness’; etc. (see [Smith, 2005] for more       Jaccard similarity where max and min can be used in
on the relationship between the copular and abstract           fuzzy union and fuzzy intersection. We are currently
entities and [Moltmann, 2013] for more on abstract             experimenting with the optimal number of
objects.) What should be noted here is that even with          dimensions using a number of word similarity
the simple conceptual structure discovered thus far            benchmarks, including the WordSim353 dataset
one can generate plausible text, such as the following:        (Finkelstein, Lev. et al., 2001)6.

(1) enjoyed the interesting reading of the new book
(2) completed a boring reading of a controversial book

Table 1: From propositions to relations and entities


The sensible (and meaningful) fragment in (1) can be
generated because a book can be ‘read’ and described
by ‘new’, and readings can be ‘interesting’ and the
object of enjoyment; and similarly for (2) where a
reading of a controversial book can be boring and the
object of a completion, etc. Note, however, that text
generation in this case is not a function of ‘predicting’
the most likely continuation, but a function of                Figure 1: We speak of a ‘book’ (i) that influence, change,
plausible filling in of subjects, objects, agents,             convince; (ii) that is edited, read, written; (iii) that can be
                                                               popular, controversial, famous; (iv) that is part of a library,
descriptions, etc. to any propositional structure.             an archive, etc.

2.3. Symbolic Embeddings                                       3. The Ontology of the Language of
The process we described thus far results in symbolic             Thought?
word embeddings as the one shown in figure 2 below.
In figure 2(a) we show the symbolic embedding for              The reverse engineering process we have described
                                                               above would result in symbolic embeddings along


6 https://kaggle.com/datasets/julianschelb/wordsim353-crowd
various dimensions, as the ones shown in figure 2. As
a result of this, however, we could then analyze the
subset relations between these embeddings to
discover the ontological structure that seems to be
implicit in our ordinary language. To illustrate,
consider the following:

(3) car . objectOf
    = {(driving, 0.9), (repairing, 0.8), (buying, 0.8), ... }
(4) book . objectOf
    = {(reading, 0.9), (writing, 0.8), (buying, 0.8), ... }
(5) person . AGENTOF
    = {(reading, 0.9),(writing, 0.8), (driving, 0.8), ... }
(6) person. HASPROP
    = {(popularity, 0.9), (fame, 0.8), (beautiful, 0.8), .. }
(7) car. hasProp
    = {(popularity, 0.9), (fame, 0.8), (beautiful, 0.8), .. }
(8) book . HASPROP
    = {(popularity, 0.9), (fame, 0.8), (beautiful, 0.8), .. }

Note that car can be the object of ‘buying’ and so can                                        (a)
be a book and this means that car and book must, at
some level of abstraction, share the same parent
(perhaps ‘artifact’?) Note also that a car as well as a
book and a person can be popular. An analysis along
these lines would result in the following:

(9) read(person, book)
(10) write(person, book)
(11) buy(person, T1 = car  book … )
(12) drive(person, car)
(13) beautiful(T2 = person  car  book … )

What the above says is the following: in ordinary
spoken language we speak of people reading and
writing books (9 and 10); we speak of people buying
cars and books, and thus of buying objects that are of
some type that subsumes both cars and books (11);
we speak of people driving cars (12); and we speak of
beautiful people, cars, and books (and thus beautiful
                                                                                              (b)
seems to be a property that can sensibly be said of
concepts that are at very high level of generality). As         Figure 2: (a) the symbolic embeddings of ‘boy’ and ‘lad’ along
suggested by Sommers (1963) this type of analysis               the HASPROP dimension (with a weighted Jaccard similarity of
that can be fully automated with the help of LLMs can           0.876) and (b) those of ‘automobile’ and ‘car’ along the
help us discover what he called ‘the Tree of Language’          OBJECTOF dimension (the weighted Jaccard similarity is 0.91)
– which is essentially the ontology that seems to be
underneath our ordinary language. This might also be            However, we believe that LLMs are not the answer to
what Hobbs (1985) was seeking when he suggested                 the language understanding problem nor to reasoning
building a model of the world that isomorphic to the            in general and in particular commonsense reasoning.
we talk about it in natural language.                           Due to their paradigmatic unexplainability LLMs will
                                                                also not shed any light on how language works and
4. Concluding Remarks                                           how we externalize our thoughts in language. Since, in
                                                                our opinion, the relative success of LLMs is not due to
Large language models (LLMs) have shown                         their subsymbolic nature but due to applying a
impressive capabilities that pioneers in artificial             successful bottom-up reverse engineering strategy,
intelligence and natural language processing would              we suggested here applying the same strategy but in a
marvel at.                                                      symbolic setting, something that has been argued for
by logicians dating back to Frege. By combining the        [14] Jesse Lopes. 2023. Can Deep CNNs Avoid Infinite
successful bottom-up strategy and symbolic and                  Regress/Circularity in Content Constitution?
ontological methods we arrive at explainable and                Minds and Machines, vol. 33, pp. 507-524.
ontologically grounded language models that can be         [15] Friederike Moltmann, 2013. Abstract Objects
used in problems requiring commonsense reasoning.               and the Semantics of Natural Language, Oxford
    We are still in the early stage of this work, but we        University Press.
currently have the tools to realize the dream of Frege     [16] Peter Milne. 1986. Frege's Context Principle,
and Sommers and perhaps shed some light on the                  Mind, Vol. 95, No. 380, pp. 491-495.
‘language of thought’ Fodor (1998) – the internal          [17] Richard Montague. 1973. The Proper Treatment
language that we use to construct and process our               of Quantification in Ordinary English. In: Kulas,
thoughts.                                                       J., Fetzer, J.H., Rankin, T.L. (eds) Philosophy,
                                                                Language, and Artificial Intelligence. Studies in
References                                                      Cognitive Systems, vol 2.
                                                           [18] Walid Saba. 2020. Language, Knowledge, and
[1]   Nicholas Asher and James Pustejovsky. 2005.               Ontology: Where Formal Semantics Went
      Word Meaning and Commonsense Metaphysics.                 Wrong, and How to Go Forward, Again, Journal
      In: Course Materials for Type Selection and the           of Knowledge Structures and Systems, 1 (1): 40-
      Semantics of Local Context, ESSLLI 2005.                  62.
[2]   Gemma Boleda. 2020. Distributional Semantics         [19] Walid Saba. 2007. Language, logic and ontology:
      and Linguistic Theory, Annual Review of                   Uncovering the structure of commonsense
      Linguistics, 6, pp. 213–34.                               knowledge, International Journal of Human
[3]   Borji, A. 2023. A Categorical Archive of ChatGPT          Computer Studies, 65(7): 610-623.
      Failures,        Available       online        at    [20] Wesley Salmon. 1989. Four decades of scientific
      https://arxiv.org/abs/2302.03494
                                                                explanation, in: P. KITCHER & W. SALMON (Eds)
[4]  Noam Chomsky. 1957. Syntactic Structures,
                                                                Minnesota Studies in the Philosophy of Science,
     Mouton de Gruyter, NY.
                                                                Vol. XIII (Minnesota, University of Minnesota
[5] Michael Dummett. 1981. Frege: Philosophy of
                                                                Press), pp. 3-21.
     Language. Harvard University Press.
                                                           [21] Barry Smith, 2005. Against Fantology, In J.
[6] Finkelstein, Lev, et al. "Placing search in context:
                                                                Marek and E. M. Reicher (eds.), Experience and
     The concept revisited." Proceedings of the 10th
                                                                Analysis, pp. 153–170
     international conference on the World Wide
                                                           [22] Fred Sommers. 1963. Types and ontology.
     Web. ACM, 2001.
                                                                Philosophical Review, 72 (3), pp. 327-363.
[7] Jerry Fodor, 1988. Concepts: Where Cognitive
                                                           [23] John Sowa. 1995. Knowledge Representation:
     Science Went Wrong, Oxford University Press.
                                                                Logical, Philosophical and Computational
[8] Giancarlo Guizzardia and Nicola Guarino, 2024.
                                                                Foundations, PWS Publishing Company, Boston.
     Semantics, Ontology, and Explanation, Data &
                                                           [24] Karthik Valmeekam, Matthew Marquez, Sarath
     Knowledge Engineering (to appear).
                                                                Sreedharan, Subbarao Kambhampati. 2023. On
[9] Zellig S. Harris. 1954. Distributional Structure.
                                                                the Planning Abilities of Large Language Models
     Word 10, pp. 146–62
                                                                - A Critical Investigation, In Advances in Neural
[10] Jerry Hobbs. 1985. Ontological promiscuity. In
                                                                Information Processing Systems 36 (NeurIPS
     Proceedings. of the 23rd Annual Meeting of the
                                                                2023).
     Association for Computational Linguistics,
                                                           [25] Vaswani, A., Shazeer, N., et. al. 2017. Attention is
     Chicago, Illinois, 1985, pp. 61–69.
                                                                All    You     Need,     Available      online   at
[11] George Lakoff. 1987. Women, Fire, and
                                                                https://arxiv.org/abs/1706.03762.
     Dangerous Things: What Categories Reveal
     About the Mind, University of Chicago Press.
[12] Doug Lenat and Guha, R. V. 1990. Building Large
     Knowledge-Based Systems: Representation and
     Inference in the CYC Project. Addison-Wesley.
[13] Martha Lewis and Melanie Mitchell. 2024. Using
     Counterfactual Tasks to Evaluate the Generality
     of    Analogical      Reasoning       in     LLMs,
      https://arxiv.org/abs/2402.08955

</pre>