=Paper=
{{Paper
|id=Vol-3322/short6
|storemode=property
|title=An Experiment in Measuring Understanding
|pdfUrl=https://ceur-ws.org/Vol-3322/short6.pdf
|volume=Vol-3322
|authors=Luc Steels,Lara Verheyen,Remi van Trijp
|dblpUrl=https://dblp.org/rec/conf/ijcai/SteelsVT22
}}
==An Experiment in Measuring Understanding==
An Experiment in Measuring Understanding
Luc Steels1 , Lara Verheyen2 and Remi van Trijp3
1
Barcelona Supercomputing Center, Barcelona, Spain
2
Artificial Intelligence Laboratory, Vrije Universiteit Brussel, Brussels, Belgium
3
SONY Computer Science Laboratories, 6, rue Amyot, 75005 Paris, France
Abstract
Human-centric AI requires not only data-driven pattern recognition methods but also reasoning. Reasoning
requires rich models and we call the process of coming up with these models understanding. Understanding is
hard because in real world problem situations, the input for making a model is often fragmented, underspecified,
ambiguous and uncertain, and many sources of knowledge are required, including vision and pattern recognition,
language parsing, ontologies, knowledge graphs, discourse models, mental simulation, real world action and
episodic memory.
This paper reports on a way to measure progress in understanding. We frame the problem of understanding
in terms of a process of generating questions, reducing questions, and finding answers to questions. We
show how meta-level monitors can collect information so that we can quantitatively track the advances in
understanding. The paper is illustrated with an implemented system that combines knowledge from language,
ontologies, mental simulation and discourse memory to understand a cooking recipe phrased in natural
language (English).
1. Introduction edge and semantic web technology [3]. However,
there is one key issue which remains largely un-
The current wave of data-driven AI almost exclu- solved, namely how to construct the rich models on
sively employs reactive intelligence but deliberative which deliberative intelligence relies. For example,
AI, which was the core of knowledge-based systems how to extract from a recipe a model which is de-
in the 1970s and 1980s, is nevertheless needed to tailed enough to cook the recipe, answer questions,
achieve some of the properties argued to be central or come up with alternatives if ingredients are not
to human-centric AI, such as (i) providing explana- available.
tions comprehensible for humans, (ii) dealing with A rich model describes the problem situation and
outliers, (iii) learning by being told, (iv) being veri- possible paths to a solution from multiple perspec-
fiable and (v) seamlessly cooperating with humans tives using categories that are both understandable
[1]. to humans and a solid basis to support reasoning.
Using deliberative AI and integrating it with reac- For example, when cooking a dish from a recipe,
tive AI is a realistic target today because reactive AI understanding means to identify the ingredients and
has advanced significantly to be usable in real world the food manipulations in sufficient detail to effec-
applications and there is already a large number of tively cook the recipe and possibly choose variations
methods and technologies for deliberative AI from if ingredients are missing, the cooking process does
past decades of AI research. There has been signifi- not quite go the way it is described in the recipe,
cant research on grounding language and represen- or the cook wants to be creative [4]. In the case of
tations in sensory-motor data and behavior-based historical research, understanding an event such as
robotics [2] and technology for symbolic knowledge the French revolution means to construct a model
representation and logical inference is well estab- describing the key actors, their intentions and mo-
lished. Moreover, there has been a considerable tivations, the salient events, the causal relations
growth in computationally accessible knowledge, between these events and the social and governmen-
thanks to the crowdsourcing of encyclopedic knowl- tal changes they cause [5].
Understanding is the process of constructing rich
IJCAI 2022: Workshop on semantic techniques for models [6]. Understanding is hard because mak-
narrative-based understanding, July 24, 2022, Vienna, Aus- ing sense of data inputs about real world situa-
tria tions, either obtained through sensing or measuring
$ steels@arti.vub.ac.be (L. Steels);
or through narrations (texts, images, movies) con-
lara.verheyen@ai.vub.ac.be (L. Verheyen);
remi.vantrijp@sony.com (R. v. Trijp) structed by other agents to convey their account
© 2022 Copyright for this paper by its authors. Use permitted under of events, poses non-trivial epistemological chal-
Creative Commons License Attribution 4.0 International (CC BY
4.0). lenges. Typically the data or narrations are sparse,
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
36
fragmented, underspecified, ambiguous, sometimes This paper builds further on ongoing research
contradictory and almost always uncertain. into understanding. It does not discuss new techni-
cal advances to make understanding feasible by AI
systems but focuses instead on developing measures
for understanding. We want to define dynamically
evolving quantities that are increasing (or decreas-
ing) as the understanding process unfolds to even-
tually reach narrative closure or exhaustion of all
possible avenues. The paper is illustrated with a
concrete example from understanding a recipe for
preparing almond cookies worked out by Katrien
Beuls and Paul Van Eecke (for a webdemo, see [9]).
The example recipe goes as follows:
Recipe for almond cookies:
Ingredients: 226 grams butter, room
temperature. 116 grams sugar. 4
grams vanilla extract. 4 grams al-
Figure 1: Understanding is the process of constructing mond extract. 340 grams flour. 112
a rich model for deliberative intelligence from diverse,
grams almond flour. 29 grams pow-
fragmented, ambiguous, uncertain, and incomplete inputs
and using a variety of knowledge sources.
dered sugar
Instructions:
1. Beat the butter and the sugar to-
Our human mind counteracts these difficulties gether until light and fluffy.
by combining contributions from sensory processing 2. Add the vanilla and almond ex-
and measurement, vision and pattern recognition, tracts and mix.
language processing, ontologies, semantic memory 3. Add the flour and the almond flour.
of facts, discourse memory, action execution, mental 4. Mix thoroughly.
simulation and episodic memory (see Figure 1). But 5. Take generous tablespoons of the
each of these knowledge sources is in turn incom- dough and roll it into a small ball,
plete, uncertain and not necessarily reliable as well, about an inch in diameter, and then
so results cannot be taken at face value. Moreover, shape it into a crescent shape.
there can not be a linear progression where one 6. Place onto a parchment paper lined
algorithm feeds into another, as is common in the baking sheet.
pipelines of data-driven AI, because of a paradox 7. Bake at 175 degrees Celsius for 15
known as the hermeneutic circle: To understand - 20 minutes.
the whole we need to understand the parts but to 8. Dust with powdered sugar.
understand the parts we need to understand the
whole [7]. The experiment reported in this paper uses this
AI systems that understand need to use every recipe text as main input and applies language pars-
possible bit of information and every possible knowl- ing, ontologies, mental simulation and discourse
edge source as quickly as possible in order to arrive memory to develop a detailed model of the cooking
at the most coherent model that integrates all data steps. We do not elaborate the technical details of
and constraints. Because of the hermeneutic circle the example as developed by [9]. Neither do we con-
paradox, understanding typically unfolds as a spiral- sider the robotic sensori-motor system for actually
ing process. Starting from an initial examination of performing the actions of the recipe (which would
some input elements (with a lot of ambiguity, uncer- be possible along the lines of [4]) nor consider visual
tainty and indeterminacy) the first hypotheses of the processing of recipes which is also an important
whole are constructed, which then provide top-down source of information [10].
expectations to be tested by a more detailed exam-
ination of the same or additional input elements,
leading to a clearer view of the whole, which then
2. Narrative networks
leads back to the examination of additional parts, As elaborated in [11] we view understanding as a
etc., until a satisfactory level of understanding, a spiraling dialogical process of generating and finding
state known as narrative closure [8], is reached.
37
answers to questions. Different inputs and process- representation originating in the mid-1970s [12], a
ing achieve four things: (i) They introduce new frame is a data structure that describes the typical
questions, (ii) introduce answers to questions, (iii) features of a class of objects or events in terms of
introduce and exercise constraints on the answers a set of slots (also called roles) for entities. The
of questions, and (iv) shrink the set of questions by slots introduce questions that should be asked about
realizing that the answers to two different questions the entities belonging to the class covered by the
are in fact the same. frame. Following the common convention of object-
The main question posed and answered by the oriented systems, one slot of a frame, called the
Almond Cookies recipe is how to prepare almond self, designates the entity being described by the
cookies. Narrative closure is reached when all the frame.
information is found in order to do so. The main When a frame is used to describe a particular en-
question raises a host of other questions: what tity or set of entities it is instantiated. Frames and
utensils are needed (a baking tray, a bowl), where instances of frames are designated by symbols with
can things be found or put in the kitchen (freezer, square brackets. Names of instances have indices.
pantry), what ingredients are necessary (116 grams In the recipe example, there is for example a frame
of sugar, 4 grams almond extract), which objects for [bowl] with slots for the bowl itself, the con-
need to be prepared (a mix of flour and almond tents, the size, the cover, whether the bowl has been
flour, a small ball of dough), which actions need to used, etc. A specific bowl entity, e.g. , is
be performed (add flour, bake), and properties of described by a frame instance, e.g. [bowl-75].1
all these entities and actions.
We operationalize this framework as follows:
1. Questions are operationalized as variables. A
variable has a name, a domain of possible values
(possibly with probabilities for each value), a value,
also called a binding, with an associated degree
of certainty, and bookkeeping information about
how the value was derived. Following AI custom,
Figure 2: Small fragment of a narrative network built
the name of a variable is written as ?variable-name
up for the Almond Recipe. Frames have square brackets
where the variable-name is a symbol that is chosen
and inheritance links between frames are in red. Frame
to be meaningful for us. Variable-names typically instances also have square brackets but their names and
have subscripts, as in ?bowl-1, ?bowl-2, ... , which their slots are in black. Entities are in green and use
are presumably to be bound to specific bowls in the angular brackets. Binding relationships between variables
kitchen while cooking a recipe. are in double lined green, such as between ?self-bowl-75
2. Answers are operationalized in terms of enti- and ?source-37, and grounding relations are in dashed
ties. Entities are objects, events or (reified) concepts. green, such as between ?self-bowl-75 and the entity
They are also designated with a symbol, but now .
without a question mark and with angular brackets.
They also have a subscript, as in or Frames are organized in multiple inheritance hi-
. Entities are grounded either in real erarchies. For example, the [bowl] frame inher-
world observational data, for example a region in its from the [coverable-container] frame, which
an image or a segment of instrumentation data, as introduces a slot for the cover. This frame inher-
entities that may or may not exist in reality, or as en- its itself again from the [container] frame which
tities in a knowledge graph in which case we use the inherits from the [kitchen-entity] frame. The
URI (Universal Resource Identifier) as unique identi- [bowl] frame also inherits from the [reusable]
fier. Entities may have different states, for example frame, which introduces a slot whether the entity
butter could be solid or become fluid when melted. has been used (see Figure 2).
To represent this, an entity has a persistent id and A frame contains also default values for its slots
different temporal existences, marked with addi- and methods to determine a value from other val-
tional subscripts. For example, ues, stimulate the instantiation of other frames, or
with the persistent id might change change the certainty or justification of a binding.
after heating into with the same The methods associated with frames are activated
persistent id but different properties. either by explicitly calling them using a name (call
3. Constraints are operationalized in terms of
1
frames. In the tradition of frame-based knowledge All these indices are of course automatically constructed
by the understanding system itself.
38
by name) or by checking which slots have already 2. After tokenization, lemmatization and part of
values and then triggering the appropriate method speech tagging, lexical processing performs a map-
(pattern-directed invocation). Frames are symbolic ping from lexical stems to frames, because stems
datastructures that are matched and merged using act as frame invoking elements. These frames are
unification operators. They can be extracted from then instantiated and their various slots added as
large frame inventories such as FrameNet, Wordnet variables to the narrative network under construc-
or Propbank [13], or they can be learned, either from tion.
examples using anti-unification and pro-unification Grammatical processing can invoke additional
operators or through hypothesize-and-test strate- frames, for example related to tense, aspect mood
gies. For the present example, all frames have been and modality, but, more importantly, it can also
designed by hand. link parts of the narrative network together, which
Frame-instances, variables, entities and links be- means that the variables introduced by separate
tween them form a graph called a narrative network frame-instances are made co-referential. For exam-
(see Figure 2). Narrative networks quickly get very ple: ‘Beat the butter and the sugar together’ is
large, having hundreds of nodes and links, even for an example of a resultative construction where the
a short text. The experiment reported here uses a goal of the action is to fuse two substances, butter
scala of AI programming tools for the implemen- and sugar, such that they become one. Thanks to
tation of frames and narrative networks, based on this construction we know that the answer to the
the standard Common Lisp Object system (CLOS) question ‘what should be beaten’ is equal to the
[14]: the constraint propagation system IRL [15], answers to the questions ‘what butter amount is to
the BABEL architecture for organizing the overall be used’ and ‘what sugar amount is to be used’.
understanding process in terms of tasks [16] and 3. Mental simulation imagines the sequence of ac-
Fluid Construction Grammar [17, 18] for linguistic tions over time and records what consequences their
processing. execution has on the various objects involved in the
action. Mental simulation can either take the form
of physical simulation, for example with realistic
3. Knowledge Sources computer graphics engines, or qualitative simulation
[19]. In this experiment we only look at qualitative
The understanding process must rely on a wide
simulation. In the present experiment, qualitative
variety of knowledge sources in order to come up
simulation is implemented through pattern-directed
with questions and answers. In the experiment
methods associated with frames. These methods be-
reported here, we only focus on contributions from
come active when some variables have already been
ontologies, language (lexicon & grammar), discourse
bound and compute the values of other variables.
memory and mental simulation.
They also create additional objects and instantiate
1. An ontology defines the inventory of available
more frames that are linked into the network.
frames for describing objects, events, actions and
4. Discourse memory contains information about
properties of these. These frames contribute to the
the way a narrative unfolds. For example, it is well
construction of the narrative network by introducing
known from the study of pragmatics in linguistics
questions for their slots. The slots have often initial
that languages contain various cues that bring enti-
or default values in which case the questions they
ties into the attention span of the listener so that
pose can also be (tentatively) answered. Because
they suggest referents for pronouns or underspeci-
frames inherit from one or more other frames, all
fied descriptions [20]. The present experiment uses
slots of these parent frames are added as well.
only a rudimentary example of discourse memory,
For instance, given the example sentence ‘Beat
namely one which marks entities which have been
the butter and the sugar together until light and
mentioned directly or indirectly as being accessible
fluffy’ (sentence 1 in the instructions of the recipe),
entities which can then be referred to by pronouns
lexical processing of the verb ‘beat’ would find the
or general descriptions (such as ‘the butter’).
beat-frame. Consultation of the ontology intro-
duces questions (i.e. variables) from the slots of this
frame, namely what tool should be used to beat 4. Measuring progress in
(by default a whisker), the initial and final kitchen
state respectively before and after beating, what understanding
container contains the material to be beaten, what
There are many possible ways to measure the
the state of this container is after beating, when the
progress and quality of understanding. Here are
beating should stop, and more.
39
a few examples: Coverage - how much of input is grams butter, room temperature’ and consultations
handled; closure - how many open questions are of the ontology for the frames triggered by the words
left; fragmentation - how many unconnected sub- in this phrase starts triggering questions such as
graphs remain; ambiguity - how many choice points what bowl is to be used, what material has to be
could not be resolved; uncertainty - how much un- put in, what is the quantity and unit of measure-
certainty is left globally; dissonance - how much ment, at what temperature does the material have
of the outcome is incompatible with the frames in to be, etc. Some of these questions (for example
the ontology; anchorage - how many non-grounded the quantity and measurement unit) are directly
entities are left. answerable from the linguistic input, others require
In this paper we only focus on the increase and mental simulation and some are obtained from the
decrease in the number of questions that pop up ontology. After each set of parsing steps we see a
during understanding and the increase in the num- jump in available answers because mental simula-
ber of answers that are found. Both the questions tion is carried out after each sentence. Also the
and the answers are coming from different knowl- discourse model gets updated and is used to answer
edge sources but we can measure their contributions some of the questions later on. The discourse model
separately. also keeps raising its own questions, namely about
To collect data during understanding we use a what to do with elements that have been introduced
meta-level facility available in the BABEL architec- but not yet used in the cooking process.
ture [16] which allows for the definition of monitors The second experiment (see Figure 4) considers
that become active when a triggering condition, for the complete almond cooking recipe and now scales
example the addition of a new node or link to the values for questions and answers with respect to
narrative network, is detected. The monitors then the total number. Values are scaled to become
collect relevant information by observing the state comparable to other cases of understanding. For the
of understanding at that point, including which complete recipe there is a total of 337 questions (159
knowledge source was responsible. triggered by language, 37 by the discourse model
The first experiment considers only a subpart of and 141 by the ontology). There are 284 answers
the recipe, namely the first four ingredients and the (77 from language, 25 from the discourse model, 80
first two instructions: from mental simulation and 102 from the ontology).
All knowledge sources play an important role. There
Ingredients: 226 grams butter, are remaining questions at the end because there
room temperature. 116 grams sugar. is no activity of cleaning up the question, so the
4 grams vanilla extract questions are about what to do with the bowls that
Instructions: were used. Narrative closure is reached because the
1. Beat the butter and the sugar baking-tray contains the desired almond cookies.
together until light and fluffy. We see in these examples that ontologies and
2. Add the vanilla and almond mental simulation of cooking actions play important
extracts and mix. roles in addition to language. There are still other
knowledge sources that have not been incorporated
The graphs display absolute values both for the and are not explicitly mentioned in language but
number of questions and the number of answers. known from common sense. The most obvious one
The graph on the left of Figure 3 decomposes the is to take the baking tray out of the oven, let the
contributions by the different knowledge sources cookies cool off and put them in a bowl for later
with respect to questions and the graph next to it storage or immediate consumption.
decomposes them for answers. At the bottom of
the graphs we see the the names of the frames or 5. Conclusions
linguistic constructions that made the contribution.
We defined understanding as the construction of a
There is a total of 165 questions being posed for rich model of a problem situation based on frag-
this first part of the recipe. Before parsing the first mented, incomplete, uncertain and underspecified
sentence a complete kitchen-state with a baking sources. We explored a way to measure one central
tray, bowls, ingredients stored in the refrigerator aspect of the understanding process, namely track-
or pantry, etc. is instantiated. The ontology raises ing the addition, reduction or answering of questions
the first set of questions and the mental simulation by different knowledge sources. More concretely, we
starts to provide the first answers. Parsing of ‘226 focused on the use of ontologies, language, discourse
40
Figure 3: Fine-grained unscaled results of the understanding process for part of the recipe. Left: total number of
questions with decomposition of question contributions. Right: total number of questions with decomposition of
answer contributions. The y-axis maps to specific processing events, namely the application of constructions or
the interpretation of the meaning obtained by parsing a phrase. The bars on the y-axis show questions posed resp.
answers obtained. They are decomposed into sections with blue sections contributed by language processing, orange
ones by mental simulation, green ones by consultations of the discourse model and red ones by the ontology.
Figure 4: Coarse-grained scaled results of the understanding process for the complete recipe with decomposition of
answer contributions (left) and question contributions (right).
models and mental simulation. This work is just for this work: the Venice International University
one tiny step in building a quantitative infrastruc- (LS), the VUB AI lab (LV) and the Sony Computer
ture for tracking and evaluating understanding in Science Laboratories Paris (RvT). We thank Paul
AI systems. Having quantitative measures is useful Van Eecke and Katrien Beuls from the VUB AI Lab-
to pin down precisely the contribution of a particu- oratory for laying the basis for the almond cooking
lar knowledge source or to provide feedback to the case study used here. The experiment is part of the
attention mechanism that guides what knowledge EU Pathfinder project MUHAI on Meaning and Un-
sources should preferentially be used or what areas derstanding in Human-Centric AI. Lara Verheyen
of a narrative network should be the focus of atten- is funded by the ‘Onderzoeksprogramma Artificiële
tion. Quantitative measures also will play a role Intelligentie (AI) Vlaanderen’ of the Flemish Gov-
as feedback signal for improving the efficiency and ernment.
efficacy of understanding.
References
Acknowledgments
[1] A. Nowak, P. Lukowicz, P. Horodecki, Assess-
This paper was funded by the EU-Pathfinder Project ing artificial intelligence for humanity: Will
MUHAI and the authors thank the host laboratories AI be the our biggest ever advance? or the
41
biggest threat, IEEE Technology and Society in Robots., Springer-Verlag, New York, 2012,
Magazine 37(4) (2018) 26–34. pp. 159–178.
[2] L. Steels, M. Hild (Eds.), Language grounding [16] L. Steels, M. Loetzsch, Babel: A tool for run-
in robots, Springer Verlag, New York, 2012. ning experiments on the evolution of language,
[3] G. Antoniou, F. Van Harmelen, A semantic in: S. Nolfi, M. Mirolli (Eds.), Evolution of
Web Primer, The MIT Presss, Cambridge Ma, Communication and Language in Embodied
2008. Agents, Springer Verlag, New York, 2010, pp.
[4] M. Beetz, D. Jain, L. Mösenlechner, 307–313.
M. Tenorth, L. Kunze, N. Blodow, D. Panger- [17] L. Steels (Ed.), Computational Issues in Fluid
cic, Cognition-enabled autonomous robot con- Construction Grammar., volume 7249 of Lec-
trol for the realization of home chore task in- ture Notes in Computer Science, Springer Ver-
telligence, Proceedings of the IEEE 100 (2012) lag, Berlin, 2012.
2454–2471. [18] L. Steels, Basics of fluid construction grammar,
[5] R. van Trijp, I. Blin, Narratives in historical Constructions and Frames 2 (2017) 178–225.
sciences, in: L. Steels (Ed.), Foundations for [19] B. Kuipers, Qualitative simulation., Artificial
Incorporating Meaning and Understanding in Intelligence 3 (1986) 289–338.
Human-centric AI, MUHAI consortium, 2022. [20] K. von Heusinger, P. Schumacher, Discourse
[6] L. Steels, Conceptual foundations for human- prominence: Definition and application., Jour-
centric AI, in: M. Chetouani, V. Dignum, nal of Pragmatics (2010) 117–127.
P. Lukowicz, C. Sierra (Eds.), Advanced course
on Human-Centered AI. ACAI 2021, volume
LNAI Tutorial Lecture Series, Springer Verlag,
Berlin, 2022.
[7] H.-G. Gadamer, Hermeneutics and social sci-
ence, Cultural Hermeneutics 2(4) (1975) 207–
316.
[8] N. Carroll, Narrative closure, Philosophical
Studies. 135 (2007) 1–15.
[9] K. Beuls, P. Van Eecke, Understanding and ex-
ecuting recipes expressed in natural language,
2022. Web demonstration at https://ehai.ai.
vub.ac.be/demos/recipe-understanding/.
[10] J. Marin, A. Biswas, F. Ofli, N. Hynes,
A. Salvador, Y. Aytar, I. Weber, A. Tor-
ralba, Recipe1m+: A dataset for learning
cross-modal embeddings for cooking recipes
and food images, IEEE Transactions on pat-
tern analysis and machine intelligence (2021).
[11] L. Steels, A framework for understanding., in
preparation (2022).
[12] M. Minsky, A framework for representing
knowledge., in: P. H. Winston (Ed.), The
Psychology of Computer Vision, McGraw-Hill,
New York, 1975, pp. 211–277.
[13] M. Palmer, P. Kingsbury, D. Gildea, The
proposition bank: An annotated corpus of se-
mantic roles, Computational Linguistics 31(1)
(2005) 71–106.
[14] G. Kiczales, J. des Rivieres, D. Bobrow, The
Art of the Metaobject Protocol, The MIT
Presss, Cambridge Ma, 1991.
[15] M. Spranger, S. Pauw, M. Loetzsch, L. Steels,
Open-ended procedural semantics., in:
L. Steels, M. Hild (Eds.), Language Grounding
42