1. Introduction

An Experiment in Measuring Understanding

Luc Steels

Lara Verheyen

Remi van Trijp

2 0 Artificial Intelligence Laboratory, Vrije Universiteit Brussel , Brussels , Belgium 1 Barcelona Supercomputing Center , Barcelona , Spain 2 SONY Computer Science Laboratories , 6, rue Amyot, 75005 Paris , France

36 42

Human-centric AI requires not only data-driven pattern recognition methods but also reasoning. Reasoning requires rich models and we call the process of coming up with these models understanding. Understanding is hard because in real world problem situations, the input for making a model is often fragmented, underspecified, ambiguous and uncertain, and many sources of knowledge are required, including vision and pattern recognition, language parsing, ontologies, knowledge graphs, discourse models, mental simulation, real world action and episodic memory. This paper reports on a way to measure progress in understanding. We frame the problem of understanding in terms of a process of generating questions, reducing questions, and finding answers to questions. We show how meta-level monitors can collect information so that we can quantitatively track the advances in understanding. The paper is illustrated with an implemented system that combines knowledge from language, ontologies, mental simulation and discourse memory to understand a cooking recipe phrased in natural language (English).

1. Introduction

edge and semantic web technology [ 3 ]. However, there is one key issue which remains largely unThe current wave of data-driven AI almost exclu- solved, namely how to construct the rich models on sively employs reactive intelligence but deliberative which deliberative intelligence relies. For example, AI, which was the core of knowledge-based systems how to extract from a recipe a model which is dein the 1970s and 1980s, is nevertheless needed to tailed enough to cook the recipe, answer questions, achieve some of the properties argued to be central or come up with alternatives if ingredients are not to human-centric AI, such as (i) providing explana- available. tions comprehensible for humans, (ii) dealing with A rich model describes the problem situation and outliers, (iii) learning by being told, (iv) being veri- possible paths to a solution from multiple perspecifable and (v) seamlessly cooperating with humans tives using categories that are both understandable [ 1 ]. to humans and a solid basis to support reasoning.

Using deliberative AI and integrating it with reac- For example, when cooking a dish from a recipe, tive AI is a realistic target today because reactive AI understanding means to identify the ingredients and has advanced significantly to be usable in real world the food manipulations in suficient detail to efecapplications and there is already a large number of tively cook the recipe and possibly choose variations methods and technologies for deliberative AI from if ingredients are missing, the cooking process does past decades of AI research. There has been signifi- not quite go the way it is described in the recipe, cant research on grounding language and represen- or the cook wants to be creative [ 4 ]. In the case of tations in sensory-motor data and behavior-based historical research, understanding an event such as robotics [ 2 ] and technology for symbolic knowledge the French revolution means to construct a model representation and logical inference is well estab- describing the key actors, their intentions and molished. Moreover, there has been a considerable tivations, the salient events, the causal relations growth in computationally accessible knowledge, between these events and the social and governmenthanks to the crowdsourcing of encyclopedic knowl- tal changes they cause [ 5 ].

Understanding is the process of constructing rich IJCAI 2022: Workshop on semantic techniques for models [ 6 ]. Understanding is hard because maknarrative-based understanding, July 24, 2022, Vienna, Aus- ing sense of data inputs about real world situatria tions, either obtained through sensing or measuring $ steels@arti.vub.ac.be (L. Steels); or through narrations (texts, images, movies) conlraermai..vvearnhteryiejpn@@saoin.vyu.cbo.mac.(bRe.(vL..TVreijrph)eyen); structed by other agents to convey their account © 2022 Copyright for this paper by its authors. Use permitted under of events, poses non-trivial epistemological chalCreative Commons License Attribution 4.0 International (CC BY CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g 4C.0E).UR Workshop Proceedings (CEUR-WS.org) lenges. Typically the data or narrations are sparse, fragmented, underspecified, ambiguous, sometimes contradictory and almost always uncertain.

Recipe for almond cookies: Ingredients: 226 grams butter, room temperature. 116 grams sugar. 4 grams vanilla extract. 4 grams almond extract. 340 grams flour. 112 grams almond flour. 29 grams powdered sugar Instructions: 1. Beat the butter and the sugar together until light and flufy. 2. Add the vanilla and almond extracts and mix. 3. Add the flour and the almond flour. 4. Mix thoroughly. 5. Take generous tablespoons of the dough and roll it into a small ball, about an inch in diameter, and then shape it into a crescent shape. 6. Place onto a parchment paper lined baking sheet. 7. Bake at 175 degrees Celsius for 15 - 20 minutes. 8. Dust with powdered sugar.

Our human mind counteracts these dificulties by combining contributions from sensory processing and measurement, vision and pattern recognition, language processing, ontologies, semantic memory of facts, discourse memory, action execution, mental simulation and episodic memory (see Figure 1). But each of these knowledge sources is in turn incomplete, uncertain and not necessarily reliable as well, so results cannot be taken at face value. Moreover, there can not be a linear progression where one algorithm feeds into another, as is common in the pipelines of data-driven AI, because of a paradox known as the hermeneutic circle: To understand the whole we need to understand the parts but to understand the parts we need to understand the whole [ 7 ]. The experiment reported in this paper uses this

AI systems that understand need to use every recipe text as main input and applies language parspossible bit of information and every possible knowl- ing, ontologies, mental simulation and discourse edge source as quickly as possible in order to arrive memory to develop a detailed model of the cooking at the most coherent model that integrates all data steps. We do not elaborate the technical details of and constraints. Because of the hermeneutic circle the example as developed by [ 9 ]. Neither do we conparadox, understanding typically unfolds as a spiral- sider the robotic sensori-motor system for actually ing process. Starting from an initial examination of performing the actions of the recipe (which would some input elements (with a lot of ambiguity, uncer- be possible along the lines of [ 4 ]) nor consider visual tainty and indeterminacy) the first hypotheses of the processing of recipes which is also an important whole are constructed, which then provide top-down source of information [ 10 ]. expectations to be tested by a more detailed examination of the same or additional input elements, 2. Narrative networks leading to a clearer view of the whole, which then leads back to the examination of additional parts, As elaborated in [ 11 ] we view understanding as a etc., until a satisfactory level of understanding, a spiraling dialogical process of generating and finding state known as narrative closure [ 8 ], is reached.

This paper builds further on ongoing research into understanding. It does not discuss new technical advances to make understanding feasible by AI systems but focuses instead on developing measures for understanding. We want to define dynamically evolving quantities that are increasing (or decreasing) as the understanding process unfolds to eventually reach narrative closure or exhaustion of all possible avenues. The paper is illustrated with a concrete example from understanding a recipe for preparing almond cookies worked out by Katrien Beuls and Paul Van Eecke (for a webdemo, see [ 9 ]).

The example recipe goes as follows: answers to questions. Diefrent inputs and process- representation originating in the mid-1970s [ 12 ], a ing achieve four things: (i) They introduce new frame is a data structure that describes the typical questions, (ii) introduce answers to questions, (iii) features of a class of objects or events in terms of introduce and exercise constraints on the answers a set of slots (also called roles) for entities. The of questions, and (iv) shrink the set of questions by slots introduce questions that should be asked about realizing that the answers to two diferent questions the entities belonging to the class covered by the are in fact the same. frame. Following the common convention of object

The main question posed and answered by the oriented systems, one slot of a frame, called the Almond Cookies recipe is how to prepare almond self, designates the entity being described by the cookies. Narrative closure is reached when all the frame. information is found in order to do so. The main When a frame is used to describe a particular enquestion raises a host of other questions: what tity or set of entities it is instantiated. Frames and utensils are needed (a baking tray, a bowl), where instances of frames are designated by symbols with can things be found or put in the kitchen (freezer, square brackets. Names of instances have indices. pantry), what ingredients are necessary (116 grams In the recipe example, there is for example a frame of sugar, 4 grams almond extract), which objects for [bowl] with slots for the bowl itself, the conneed to be prepared (a mix of flour and almond tents, the size, the cover, whether the bowl has been lfour, a small ball of dough), which actions need to used, etc. A specific bowl entity, e.g. <bowl-75>, is be performed (add flour, bake), and properties of described by a frame instance, e.g. [bowl-75].1 all these entities and actions.

We operationalize this framework as follows: 1. Questions are operationalized as variables. A variable has a name, a domain of possible values (possibly with probabilities for each value), a value, also called a binding, with an associated degree of certainty, and bookkeeping information about how the value was derived. Following AI custom, the name of a variable is written as ?variable-name Figure 2: Small fragment of a narrative network built where the variable-name is a symbol that is chosen aunpdfoinrhtehreitAanlmceonlidnkRsebcieptew.eFenramfraemshesavaeresqinuarreedb.rFarcakmetes to be meaningful for us. Variable-names typically instances also have square brackets but their names and have subscripts, as in ?bowl-1, ?bowl-2, ... , which their slots are in black. Entities are in green and use are presumably to be bound to specific bowls in the angular brackets. Binding relationships between variables kitchen while cooking a recipe. are in double lined green, such as between ?self-bowl-75 2. Answers are operationalized in terms of enti- and ?source-37, and grounding relations are in dashed ties. Entities are objects, events or (reified) concepts. green, such as between ?self-bowl-75 and the entity They are also designated with a symbol, but now <bowl-75>. without a question mark and with angular brackets.

They also have a subscript, as in <butter-331> or Frames are organized in multiple inheritance hi<bowl-710>. Entities are grounded either in real erarchies. For example, the [bowl] frame inherworld observational data, for example a region in its from the [coverable-container] frame, which an image or a segment of instrumentation data, as introduces a slot for the cover. This frame inherentities that may or may not exist in reality, or as en- its itself again from the [container] frame which tities in a knowledge graph in which case we use the inherits from the [kitchen-entity] frame. The URI (Universal Resource Identifier) as unique identi- [bowl] frame also inherits from the [reusable] ifer. Entities may have diferent states, for example frame, which introduces a slot whether the entity butter could be solid or become fluid when melted. has been used (see Figure 2).

To represent this, an entity has a persistent id and A frame contains also default values for its slots diferent temporal existences, marked with addi- and methods to determine a value from other valtional subscripts. For example, <butter-331-1> ues, stimulate the instantiation of other frames, or with the persistent id <butter-331> might change change the certainty or justification of a binding. after heating into <butter-331-2> with the same The methods associated with frames are activated persistent id but diferent properties. either by explicitly calling them using a name (call 3. Constraints are operationalized in terms of frames. In the tradition of frame-based knowledge 1All these indices are of course automatically constructed by the understanding system itself. by name) or by checking which slots have already 2. After tokenization, lemmatization and part of values and then triggering the appropriate method speech tagging, lexical processing performs a map(pattern-directed invocation). Frames are symbolic ping from lexical stems to frames, because stems datastructures that are matched and merged using act as frame invoking elements. These frames are unification operators. They can be extracted from then instantiated and their various slots added as large frame inventories such as FrameNet, Wordnet variables to the narrative network under construcor Propbank [ 13 ], or they can be learned, either from tion. examples using anti-unification and pro-unification Grammatical processing can invoke additional operators or through hypothesize-and-test strate- frames, for example related to tense, aspect mood gies. For the present example, all frames have been and modality, but, more importantly, it can also designed by hand. link parts of the narrative network together, which

Frame-instances, variables, entities and links be- means that the variables introduced by separate tween them form a graph called a narrative network frame-instances are made co-referential. For exam(see Figure 2). Narrative networks quickly get very ple: ‘Beat the butter and the sugar together’ is large, having hundreds of nodes and links, even for an example of a resultative construction where the a short text. The experiment reported here uses a goal of the action is to fuse two substances, butter scala of AI programming tools for the implemen- and sugar, such that they become one. Thanks to tation of frames and narrative networks, based on this construction we know that the answer to the the standard Common Lisp Object system (CLOS) question ‘what should be beaten’ is equal to the [ 14 ]: the constraint propagation system IRL [ 15 ], answers to the questions ‘what butter amount is to the BABEL architecture for organizing the overall be used’ and ‘what sugar amount is to be used’. understanding process in terms of tasks [16] and 3. Mental simulation imagines the sequence of acFluid Construction Grammar [17, 18] for linguistic tions over time and records what consequences their processing. execution has on the various objects involved in the action. Mental simulation can either take the form of physical simulation, for example with realistic 3. Knowledge Sources computer graphics engines, or qualitative simulation [19]. In this experiment we only look at qualitative The understanding process must rely on a wide simulation. In the present experiment, qualitative variety of knowledge sources in order to come up simulation is implemented through pattern-directed with questions and answers. In the experiment methods associated with frames. These methods bereported here, we only focus on contributions from come active when some variables have already been ontologies, language (lexicon & grammar), discourse bound and compute the values of other variables. memory and mental simulation. They also create additional objects and instantiate 1. An ontology defines the inventory of available more frames that are linked into the network. frames for describing objects, events, actions and 4. Discourse memory contains information about properties of these. These frames contribute to the the way a narrative unfolds. For example, it is well construction of the narrative network by introducing known from the study of pragmatics in linguistics questions for their slots. The slots have often initial that languages contain various cues that bring entior default values in which case the questions they ties into the attention span of the listener so that pose can also be (tentatively) answered. Because they suggest referents for pronouns or underspeciframes inherit from one or more other frames, all ifed descriptions [ 20]. The present experiment uses slots of these parent frames are added as well. only a rudimentary example of discourse memory,

For instance, given the example sentence ‘Beat namely one which marks entities which have been the butter and the sugar together until light and mentioned directly or indirectly as being accessible lfufy’ (sentence 1 in the instructions of the recipe), entities which can then be referred to by pronouns lexical processing of the verb ‘beat’ would find the or general descriptions (such as ‘the butter’). beat-frame. Consultation of the ontology introduces questions (i.e. variables) from the slots of this frame, namely what tool should be used to beat 4. Measuring progress in (by default a whisker), the initial and final kitchen state respectively before and after beating, what understanding container contains the material to be beaten, what There are many possible ways to measure the the state of this container is after beating, when the progress and quality of understanding. Here are beating should stop, and more. a few examples: Coverage - how much of input is grams butter, room temperature’ and consultations handled; closure - how many open questions are of the ontology for the frames triggered by the words left; fragmentation - how many unconnected sub- in this phrase starts triggering questions such as graphs remain; ambiguity - how many choice points what bowl is to be used, what material has to be could not be resolved; uncertainty - how much un- put in, what is the quantity and unit of measurecertainty is left globally; dissonance - how much ment, at what temperature does the material have of the outcome is incompatible with the frames in to be, etc. Some of these questions (for example the ontology; anchorage - how many non-grounded the quantity and measurement unit) are directly entities are left. answerable from the linguistic input, others require

In this paper we only focus on the increase and mental simulation and some are obtained from the decrease in the number of questions that pop up ontology. After each set of parsing steps we see a during understanding and the increase in the num- jump in available answers because mental simulaber of answers that are found. Both the questions tion is carried out after each sentence. Also the and the answers are coming from diferent knowl- discourse model gets updated and is used to answer edge sources but we can measure their contributions some of the questions later on. The discourse model separately. also keeps raising its own questions, namely about

To collect data during understanding we use a what to do with elements that have been introduced meta-level facility available in the BABEL architec- but not yet used in the cooking process. ture [16] which allows for the definition of monitors The second experiment (see Figure 4) considers that become active when a triggering condition, for the complete almond cooking recipe and now scales example the addition of a new node or link to the values for questions and answers with respect to narrative network, is detected. The monitors then the total number. Values are scaled to become collect relevant information by observing the state comparable to other cases of understanding. For the of understanding at that point, including which complete recipe there is a total of 337 questions (159 knowledge source was responsible. triggered by language, 37 by the discourse model

The first experiment considers only a subpart of and 141 by the ontology). There are 284 answers the recipe, namely the first four ingredients and the (77 from language, 25 from the discourse model, 80 ifrst two instructions: from mental simulation and 102 from the ontology).

All knowledge sources play an important role. There

Ingredients: 226 grams butter, are remaining questions at the end because there room temperature. 116 grams sugar. is no activity of cleaning up the question, so the 4 grams vanilla extract questions are about what to do with the bowls that Instructions: were used. Narrative closure is reached because the 1. Beat the butter and the sugar baking-tray contains the desired almond cookies. together until light and flufy. We see in these examples that ontologies and 2. Add the vanilla and almond mental simulation of cooking actions play important extracts and mix. roles in addition to language. There are still other knowledge sources that have not been incorporated The graphs display absolute values both for the and are not explicitly mentioned in language but number of questions and the number of answers. known from common sense. The most obvious one The graph on the left of Figure 3 decomposes the is to take the baking tray out of the oven, let the contributions by the diferent knowledge sources cookies cool of and put them in a bowl for later with respect to questions and the graph next to it storage or immediate consumption. decomposes them for answers. At the bottom of the graphs we see the the names of the frames or 5. Conclusions linguistic constructions that made the contribution.

We defined understanding as the construction of a

There is a total of 165 questions being posed for rich model of a problem situation based on fragthis first part of the recipe. Before parsing the first mented, incomplete, uncertain and underspecified sentence a complete kitchen-state with a baking sources. We explored a way to measure one central tray, bowls, ingredients stored in the refrigerator aspect of the understanding process, namely trackor pantry, etc. is instantiated. The ontology raises ing the addition, reduction or answering of questions the first set of questions and the mental simulation by diferent knowledge sources. More concretely, we starts to provide the first answers. Parsing of ‘226 focused on the use of ontologies, language, discourse models and mental simulation. This work is just for this work: the Venice International University one tiny step in building a quantitative infrastruc- (LS), the VUB AI lab (LV) and the Sony Computer ture for tracking and evaluating understanding in Science Laboratories Paris (RvT). We thank Paul AI systems. Having quantitative measures is useful Van Eecke and Katrien Beuls from the VUB AI Labto pin down precisely the contribution of a particu- oratory for laying the basis for the almond cooking lar knowledge source or to provide feedback to the case study used here. The experiment is part of the attention mechanism that guides what knowledge EU Pathfinder project MUHAI on Meaning and Unsources should preferentially be used or what areas derstanding in Human-Centric AI. Lara Verheyen of a narrative network should be the focus of atten- is funded by the ‘Onderzoeksprogramma Artificiële tion. Quantitative measures also will play a role Intelligentie (AI) Vlaanderen’ of the Flemish Govas feedback signal for improving the eficiency and ernment. eficacy of understanding.

Acknowledgments This paper was funded by the EU-Pathfinder Project MUHAI and the authors thank the host laboratories

[1]

Nowak ,

Lukowicz ,

Horodecki , Assessing artificial intelligence for humanity: Will AI be the our biggest ever advance? or the biggest threat , IEEE Technology and Society in Robots. , Springer-Verlag, New York, 2012 , Magazine 37 ( 4 ) ( 2018 ) 26 - 34 . pp. 159 - 178 .

[2]

Steels , M. Hild (Eds.), Language grounding [16]

Steels ,

Loetzsch , Babel: A tool for runin robots , Springer Verlag, New York, 2012 . ning experiments on the evolution of language,

[3]

Antoniou ,

Van Harmelen , A semantic in: S. Nolfi, M. Mirolli (Eds.), Evolution of Web Primer, The MIT Presss, Cambridge Ma, Communication and Language in Embodied 2008. Agents, Springer Verlag, New York, 2010 , pp.

[4]

Beetz ,

Jain , L. Mösenlechner, 307 - 313 . M. Tenorth , L.

Kunze , N.

Blodow , D. Panger- [17] L.

Steels (Ed.), Computational Issues in Fluid cic, Cognition-enabled autonomous robot con- Construction Grammar ., volume 7249 of Lectrol for the realization of home chore task in- ture Notes in Computer Science, Springer Vertelligence, Proceedings of the IEEE 100 ( 2012 ) lag , Berlin, 2012 . 2454 - 2471 . [18]

Steels , Basics of fluid construction grammar,

[5] R. van Trijp , I. Blin , Narratives in historical Constructions and Frames 2 ( 2017 ) 178 - 225 . sciences, in: L. Steels (Ed.), Foundations for [19] B. Kuipers , Qualitative simulation ., Artificial Incorporating Meaning and Understanding in Intelligence 3 ( 1986 ) 289 - 338 . Human-centric

, MUHAI consortium, 2022 . [20] K. von Heusinger , P. Schumacher, Discourse

[6]

Steels , Conceptual foundations for human- prominence: Definition and application ., Jourcentric

, in: M. Chetouani , V. Dignum, nal of Pragmatics ( 2010 ) 117 - 127 . P. Lukowicz, C. Sierra (Eds.), Advanced course on Human-Centered AI . ACAI 2021, volume LNAI Tutorial Lecture Series , Springer Verlag, Berlin, 2022 .

[7]

H.-G.

Gadamer , Hermeneutics and social science, Cultural Hermeneutics 2 ( 4 ) ( 1975 ) 207 - 316 .

[8]

Carroll , Narrative closure, Philosophical Studies . 135 ( 2007 ) 1 - 15 .

[9]

Beuls , P. Van Eecke , Understanding and executing recipes expressed in natural language , 2022 . Web demonstration at https://ehai.ai. vub.ac.be/demos/recipe-understanding/.

[10]

Marin ,

Biswas ,

Ofli ,

Hynes ,

Salvador ,

Aytar , I. Weber , A. Torralba, Recipe1m+: A dataset for learning cross-modal embeddings for cooking recipes and food images , IEEE Transactions on pattern analysis and machine intelligence ( 2021 ).

[11]

Steels , A framework for understanding. , in preparation ( 2022 ).

[12]

Minsky , A framework for representing knowledge ., in: P. H. Winston (Ed.), The Psychology of Computer Vision , McGraw-Hill , New York, 1975 , pp. 211 - 277 .

[13]

Palmer ,

Kingsbury ,

Gildea , The proposition bank: An annotated corpus of semantic roles , Computational Linguistics 31 ( 1 ) ( 2005 ) 71 - 106 .

[14]

Kiczales , J. des Rivieres , D. Bobrow , The Art of the Metaobject Protocol , The MIT Presss, Cambridge Ma, 1991 .

[15]

Spranger ,

Pauw ,

Loetzsch ,

Steels , Open-ended procedural semantics ., in: L. Steels , M. Hild (Eds.), Language Grounding