An Experiment in Measuring Understanding Luc Steels1 , Lara Verheyen2 and Remi van Trijp3 1 Barcelona Supercomputing Center, Barcelona, Spain 2 Artificial Intelligence Laboratory, Vrije Universiteit Brussel, Brussels, Belgium 3 SONY Computer Science Laboratories, 6, rue Amyot, 75005 Paris, France Abstract Human-centric AI requires not only data-driven pattern recognition methods but also reasoning. Reasoning requires rich models and we call the process of coming up with these models understanding. Understanding is hard because in real world problem situations, the input for making a model is often fragmented, underspecified, ambiguous and uncertain, and many sources of knowledge are required, including vision and pattern recognition, language parsing, ontologies, knowledge graphs, discourse models, mental simulation, real world action and episodic memory. This paper reports on a way to measure progress in understanding. We frame the problem of understanding in terms of a process of generating questions, reducing questions, and finding answers to questions. We show how meta-level monitors can collect information so that we can quantitatively track the advances in understanding. The paper is illustrated with an implemented system that combines knowledge from language, ontologies, mental simulation and discourse memory to understand a cooking recipe phrased in natural language (English). 1. Introduction edge and semantic web technology [3]. However, there is one key issue which remains largely un- The current wave of data-driven AI almost exclu- solved, namely how to construct the rich models on sively employs reactive intelligence but deliberative which deliberative intelligence relies. For example, AI, which was the core of knowledge-based systems how to extract from a recipe a model which is de- in the 1970s and 1980s, is nevertheless needed to tailed enough to cook the recipe, answer questions, achieve some of the properties argued to be central or come up with alternatives if ingredients are not to human-centric AI, such as (i) providing explana- available. tions comprehensible for humans, (ii) dealing with A rich model describes the problem situation and outliers, (iii) learning by being told, (iv) being veri- possible paths to a solution from multiple perspec- fiable and (v) seamlessly cooperating with humans tives using categories that are both understandable [1]. to humans and a solid basis to support reasoning. Using deliberative AI and integrating it with reac- For example, when cooking a dish from a recipe, tive AI is a realistic target today because reactive AI understanding means to identify the ingredients and has advanced significantly to be usable in real world the food manipulations in sufficient detail to effec- applications and there is already a large number of tively cook the recipe and possibly choose variations methods and technologies for deliberative AI from if ingredients are missing, the cooking process does past decades of AI research. There has been signifi- not quite go the way it is described in the recipe, cant research on grounding language and represen- or the cook wants to be creative [4]. In the case of tations in sensory-motor data and behavior-based historical research, understanding an event such as robotics [2] and technology for symbolic knowledge the French revolution means to construct a model representation and logical inference is well estab- describing the key actors, their intentions and mo- lished. Moreover, there has been a considerable tivations, the salient events, the causal relations growth in computationally accessible knowledge, between these events and the social and governmen- thanks to the crowdsourcing of encyclopedic knowl- tal changes they cause [5]. Understanding is the process of constructing rich IJCAI 2022: Workshop on semantic techniques for models [6]. Understanding is hard because mak- narrative-based understanding, July 24, 2022, Vienna, Aus- ing sense of data inputs about real world situa- tria tions, either obtained through sensing or measuring $ steels@arti.vub.ac.be (L. Steels); or through narrations (texts, images, movies) con- lara.verheyen@ai.vub.ac.be (L. Verheyen); remi.vantrijp@sony.com (R. v. Trijp) structed by other agents to convey their account © 2022 Copyright for this paper by its authors. Use permitted under of events, poses non-trivial epistemological chal- Creative Commons License Attribution 4.0 International (CC BY 4.0). lenges. Typically the data or narrations are sparse, CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 36 fragmented, underspecified, ambiguous, sometimes This paper builds further on ongoing research contradictory and almost always uncertain. into understanding. It does not discuss new techni- cal advances to make understanding feasible by AI systems but focuses instead on developing measures for understanding. We want to define dynamically evolving quantities that are increasing (or decreas- ing) as the understanding process unfolds to even- tually reach narrative closure or exhaustion of all possible avenues. The paper is illustrated with a concrete example from understanding a recipe for preparing almond cookies worked out by Katrien Beuls and Paul Van Eecke (for a webdemo, see [9]). The example recipe goes as follows: Recipe for almond cookies: Ingredients: 226 grams butter, room temperature. 116 grams sugar. 4 grams vanilla extract. 4 grams al- Figure 1: Understanding is the process of constructing mond extract. 340 grams flour. 112 a rich model for deliberative intelligence from diverse, grams almond flour. 29 grams pow- fragmented, ambiguous, uncertain, and incomplete inputs and using a variety of knowledge sources. dered sugar Instructions: 1. Beat the butter and the sugar to- Our human mind counteracts these difficulties gether until light and fluffy. by combining contributions from sensory processing 2. Add the vanilla and almond ex- and measurement, vision and pattern recognition, tracts and mix. language processing, ontologies, semantic memory 3. Add the flour and the almond flour. of facts, discourse memory, action execution, mental 4. Mix thoroughly. simulation and episodic memory (see Figure 1). But 5. Take generous tablespoons of the each of these knowledge sources is in turn incom- dough and roll it into a small ball, plete, uncertain and not necessarily reliable as well, about an inch in diameter, and then so results cannot be taken at face value. Moreover, shape it into a crescent shape. there can not be a linear progression where one 6. Place onto a parchment paper lined algorithm feeds into another, as is common in the baking sheet. pipelines of data-driven AI, because of a paradox 7. Bake at 175 degrees Celsius for 15 known as the hermeneutic circle: To understand - 20 minutes. the whole we need to understand the parts but to 8. Dust with powdered sugar. understand the parts we need to understand the whole [7]. The experiment reported in this paper uses this AI systems that understand need to use every recipe text as main input and applies language pars- possible bit of information and every possible knowl- ing, ontologies, mental simulation and discourse edge source as quickly as possible in order to arrive memory to develop a detailed model of the cooking at the most coherent model that integrates all data steps. We do not elaborate the technical details of and constraints. Because of the hermeneutic circle the example as developed by [9]. Neither do we con- paradox, understanding typically unfolds as a spiral- sider the robotic sensori-motor system for actually ing process. Starting from an initial examination of performing the actions of the recipe (which would some input elements (with a lot of ambiguity, uncer- be possible along the lines of [4]) nor consider visual tainty and indeterminacy) the first hypotheses of the processing of recipes which is also an important whole are constructed, which then provide top-down source of information [10]. expectations to be tested by a more detailed exam- ination of the same or additional input elements, leading to a clearer view of the whole, which then 2. Narrative networks leads back to the examination of additional parts, As elaborated in [11] we view understanding as a etc., until a satisfactory level of understanding, a spiraling dialogical process of generating and finding state known as narrative closure [8], is reached. 37 answers to questions. Different inputs and process- representation originating in the mid-1970s [12], a ing achieve four things: (i) They introduce new frame is a data structure that describes the typical questions, (ii) introduce answers to questions, (iii) features of a class of objects or events in terms of introduce and exercise constraints on the answers a set of slots (also called roles) for entities. The of questions, and (iv) shrink the set of questions by slots introduce questions that should be asked about realizing that the answers to two different questions the entities belonging to the class covered by the are in fact the same. frame. Following the common convention of object- The main question posed and answered by the oriented systems, one slot of a frame, called the Almond Cookies recipe is how to prepare almond self, designates the entity being described by the cookies. Narrative closure is reached when all the frame. information is found in order to do so. The main When a frame is used to describe a particular en- question raises a host of other questions: what tity or set of entities it is instantiated. Frames and utensils are needed (a baking tray, a bowl), where instances of frames are designated by symbols with can things be found or put in the kitchen (freezer, square brackets. Names of instances have indices. pantry), what ingredients are necessary (116 grams In the recipe example, there is for example a frame of sugar, 4 grams almond extract), which objects for [bowl] with slots for the bowl itself, the con- need to be prepared (a mix of flour and almond tents, the size, the cover, whether the bowl has been flour, a small ball of dough), which actions need to used, etc. A specific bowl entity, e.g. , is be performed (add flour, bake), and properties of described by a frame instance, e.g. [bowl-75].1 all these entities and actions. We operationalize this framework as follows: 1. Questions are operationalized as variables. A variable has a name, a domain of possible values (possibly with probabilities for each value), a value, also called a binding, with an associated degree of certainty, and bookkeeping information about how the value was derived. Following AI custom, Figure 2: Small fragment of a narrative network built the name of a variable is written as ?variable-name up for the Almond Recipe. Frames have square brackets where the variable-name is a symbol that is chosen and inheritance links between frames are in red. Frame to be meaningful for us. Variable-names typically instances also have square brackets but their names and have subscripts, as in ?bowl-1, ?bowl-2, ... , which their slots are in black. Entities are in green and use are presumably to be bound to specific bowls in the angular brackets. Binding relationships between variables kitchen while cooking a recipe. are in double lined green, such as between ?self-bowl-75 2. Answers are operationalized in terms of enti- and ?source-37, and grounding relations are in dashed ties. Entities are objects, events or (reified) concepts. green, such as between ?self-bowl-75 and the entity They are also designated with a symbol, but now . without a question mark and with angular brackets. They also have a subscript, as in or Frames are organized in multiple inheritance hi- . Entities are grounded either in real erarchies. For example, the [bowl] frame inher- world observational data, for example a region in its from the [coverable-container] frame, which an image or a segment of instrumentation data, as introduces a slot for the cover. This frame inher- entities that may or may not exist in reality, or as en- its itself again from the [container] frame which tities in a knowledge graph in which case we use the inherits from the [kitchen-entity] frame. The URI (Universal Resource Identifier) as unique identi- [bowl] frame also inherits from the [reusable] fier. Entities may have different states, for example frame, which introduces a slot whether the entity butter could be solid or become fluid when melted. has been used (see Figure 2). To represent this, an entity has a persistent id and A frame contains also default values for its slots different temporal existences, marked with addi- and methods to determine a value from other val- tional subscripts. For example, ues, stimulate the instantiation of other frames, or with the persistent id might change change the certainty or justification of a binding. after heating into with the same The methods associated with frames are activated persistent id but different properties. either by explicitly calling them using a name (call 3. Constraints are operationalized in terms of 1 frames. In the tradition of frame-based knowledge All these indices are of course automatically constructed by the understanding system itself. 38 by name) or by checking which slots have already 2. After tokenization, lemmatization and part of values and then triggering the appropriate method speech tagging, lexical processing performs a map- (pattern-directed invocation). Frames are symbolic ping from lexical stems to frames, because stems datastructures that are matched and merged using act as frame invoking elements. These frames are unification operators. They can be extracted from then instantiated and their various slots added as large frame inventories such as FrameNet, Wordnet variables to the narrative network under construc- or Propbank [13], or they can be learned, either from tion. examples using anti-unification and pro-unification Grammatical processing can invoke additional operators or through hypothesize-and-test strate- frames, for example related to tense, aspect mood gies. For the present example, all frames have been and modality, but, more importantly, it can also designed by hand. link parts of the narrative network together, which Frame-instances, variables, entities and links be- means that the variables introduced by separate tween them form a graph called a narrative network frame-instances are made co-referential. For exam- (see Figure 2). Narrative networks quickly get very ple: ‘Beat the butter and the sugar together’ is large, having hundreds of nodes and links, even for an example of a resultative construction where the a short text. The experiment reported here uses a goal of the action is to fuse two substances, butter scala of AI programming tools for the implemen- and sugar, such that they become one. Thanks to tation of frames and narrative networks, based on this construction we know that the answer to the the standard Common Lisp Object system (CLOS) question ‘what should be beaten’ is equal to the [14]: the constraint propagation system IRL [15], answers to the questions ‘what butter amount is to the BABEL architecture for organizing the overall be used’ and ‘what sugar amount is to be used’. understanding process in terms of tasks [16] and 3. Mental simulation imagines the sequence of ac- Fluid Construction Grammar [17, 18] for linguistic tions over time and records what consequences their processing. execution has on the various objects involved in the action. Mental simulation can either take the form of physical simulation, for example with realistic 3. Knowledge Sources computer graphics engines, or qualitative simulation [19]. In this experiment we only look at qualitative The understanding process must rely on a wide simulation. In the present experiment, qualitative variety of knowledge sources in order to come up simulation is implemented through pattern-directed with questions and answers. In the experiment methods associated with frames. These methods be- reported here, we only focus on contributions from come active when some variables have already been ontologies, language (lexicon & grammar), discourse bound and compute the values of other variables. memory and mental simulation. They also create additional objects and instantiate 1. An ontology defines the inventory of available more frames that are linked into the network. frames for describing objects, events, actions and 4. Discourse memory contains information about properties of these. These frames contribute to the the way a narrative unfolds. For example, it is well construction of the narrative network by introducing known from the study of pragmatics in linguistics questions for their slots. The slots have often initial that languages contain various cues that bring enti- or default values in which case the questions they ties into the attention span of the listener so that pose can also be (tentatively) answered. Because they suggest referents for pronouns or underspeci- frames inherit from one or more other frames, all fied descriptions [20]. The present experiment uses slots of these parent frames are added as well. only a rudimentary example of discourse memory, For instance, given the example sentence ‘Beat namely one which marks entities which have been the butter and the sugar together until light and mentioned directly or indirectly as being accessible fluffy’ (sentence 1 in the instructions of the recipe), entities which can then be referred to by pronouns lexical processing of the verb ‘beat’ would find the or general descriptions (such as ‘the butter’). beat-frame. Consultation of the ontology intro- duces questions (i.e. variables) from the slots of this frame, namely what tool should be used to beat 4. Measuring progress in (by default a whisker), the initial and final kitchen state respectively before and after beating, what understanding container contains the material to be beaten, what There are many possible ways to measure the the state of this container is after beating, when the progress and quality of understanding. Here are beating should stop, and more. 39 a few examples: Coverage - how much of input is grams butter, room temperature’ and consultations handled; closure - how many open questions are of the ontology for the frames triggered by the words left; fragmentation - how many unconnected sub- in this phrase starts triggering questions such as graphs remain; ambiguity - how many choice points what bowl is to be used, what material has to be could not be resolved; uncertainty - how much un- put in, what is the quantity and unit of measure- certainty is left globally; dissonance - how much ment, at what temperature does the material have of the outcome is incompatible with the frames in to be, etc. Some of these questions (for example the ontology; anchorage - how many non-grounded the quantity and measurement unit) are directly entities are left. answerable from the linguistic input, others require In this paper we only focus on the increase and mental simulation and some are obtained from the decrease in the number of questions that pop up ontology. After each set of parsing steps we see a during understanding and the increase in the num- jump in available answers because mental simula- ber of answers that are found. Both the questions tion is carried out after each sentence. Also the and the answers are coming from different knowl- discourse model gets updated and is used to answer edge sources but we can measure their contributions some of the questions later on. The discourse model separately. also keeps raising its own questions, namely about To collect data during understanding we use a what to do with elements that have been introduced meta-level facility available in the BABEL architec- but not yet used in the cooking process. ture [16] which allows for the definition of monitors The second experiment (see Figure 4) considers that become active when a triggering condition, for the complete almond cooking recipe and now scales example the addition of a new node or link to the values for questions and answers with respect to narrative network, is detected. The monitors then the total number. Values are scaled to become collect relevant information by observing the state comparable to other cases of understanding. For the of understanding at that point, including which complete recipe there is a total of 337 questions (159 knowledge source was responsible. triggered by language, 37 by the discourse model The first experiment considers only a subpart of and 141 by the ontology). There are 284 answers the recipe, namely the first four ingredients and the (77 from language, 25 from the discourse model, 80 first two instructions: from mental simulation and 102 from the ontology). All knowledge sources play an important role. There Ingredients: 226 grams butter, are remaining questions at the end because there room temperature. 116 grams sugar. is no activity of cleaning up the question, so the 4 grams vanilla extract questions are about what to do with the bowls that Instructions: were used. Narrative closure is reached because the 1. Beat the butter and the sugar baking-tray contains the desired almond cookies. together until light and fluffy. We see in these examples that ontologies and 2. Add the vanilla and almond mental simulation of cooking actions play important extracts and mix. roles in addition to language. There are still other knowledge sources that have not been incorporated The graphs display absolute values both for the and are not explicitly mentioned in language but number of questions and the number of answers. known from common sense. The most obvious one The graph on the left of Figure 3 decomposes the is to take the baking tray out of the oven, let the contributions by the different knowledge sources cookies cool off and put them in a bowl for later with respect to questions and the graph next to it storage or immediate consumption. decomposes them for answers. At the bottom of the graphs we see the the names of the frames or 5. Conclusions linguistic constructions that made the contribution. We defined understanding as the construction of a There is a total of 165 questions being posed for rich model of a problem situation based on frag- this first part of the recipe. Before parsing the first mented, incomplete, uncertain and underspecified sentence a complete kitchen-state with a baking sources. We explored a way to measure one central tray, bowls, ingredients stored in the refrigerator aspect of the understanding process, namely track- or pantry, etc. is instantiated. The ontology raises ing the addition, reduction or answering of questions the first set of questions and the mental simulation by different knowledge sources. More concretely, we starts to provide the first answers. Parsing of ‘226 focused on the use of ontologies, language, discourse 40 Figure 3: Fine-grained unscaled results of the understanding process for part of the recipe. Left: total number of questions with decomposition of question contributions. Right: total number of questions with decomposition of answer contributions. The y-axis maps to specific processing events, namely the application of constructions or the interpretation of the meaning obtained by parsing a phrase. The bars on the y-axis show questions posed resp. answers obtained. They are decomposed into sections with blue sections contributed by language processing, orange ones by mental simulation, green ones by consultations of the discourse model and red ones by the ontology. Figure 4: Coarse-grained scaled results of the understanding process for the complete recipe with decomposition of answer contributions (left) and question contributions (right). models and mental simulation. This work is just for this work: the Venice International University one tiny step in building a quantitative infrastruc- (LS), the VUB AI lab (LV) and the Sony Computer ture for tracking and evaluating understanding in Science Laboratories Paris (RvT). We thank Paul AI systems. Having quantitative measures is useful Van Eecke and Katrien Beuls from the VUB AI Lab- to pin down precisely the contribution of a particu- oratory for laying the basis for the almond cooking lar knowledge source or to provide feedback to the case study used here. The experiment is part of the attention mechanism that guides what knowledge EU Pathfinder project MUHAI on Meaning and Un- sources should preferentially be used or what areas derstanding in Human-Centric AI. Lara Verheyen of a narrative network should be the focus of atten- is funded by the ‘Onderzoeksprogramma Artificiële tion. Quantitative measures also will play a role Intelligentie (AI) Vlaanderen’ of the Flemish Gov- as feedback signal for improving the efficiency and ernment. efficacy of understanding. References Acknowledgments [1] A. Nowak, P. Lukowicz, P. Horodecki, Assess- This paper was funded by the EU-Pathfinder Project ing artificial intelligence for humanity: Will MUHAI and the authors thank the host laboratories AI be the our biggest ever advance? or the 41 biggest threat, IEEE Technology and Society in Robots., Springer-Verlag, New York, 2012, Magazine 37(4) (2018) 26–34. pp. 159–178. [2] L. Steels, M. Hild (Eds.), Language grounding [16] L. Steels, M. Loetzsch, Babel: A tool for run- in robots, Springer Verlag, New York, 2012. ning experiments on the evolution of language, [3] G. Antoniou, F. Van Harmelen, A semantic in: S. Nolfi, M. Mirolli (Eds.), Evolution of Web Primer, The MIT Presss, Cambridge Ma, Communication and Language in Embodied 2008. Agents, Springer Verlag, New York, 2010, pp. [4] M. Beetz, D. Jain, L. Mösenlechner, 307–313. M. Tenorth, L. Kunze, N. Blodow, D. Panger- [17] L. Steels (Ed.), Computational Issues in Fluid cic, Cognition-enabled autonomous robot con- Construction Grammar., volume 7249 of Lec- trol for the realization of home chore task in- ture Notes in Computer Science, Springer Ver- telligence, Proceedings of the IEEE 100 (2012) lag, Berlin, 2012. 2454–2471. [18] L. Steels, Basics of fluid construction grammar, [5] R. van Trijp, I. Blin, Narratives in historical Constructions and Frames 2 (2017) 178–225. sciences, in: L. Steels (Ed.), Foundations for [19] B. Kuipers, Qualitative simulation., Artificial Incorporating Meaning and Understanding in Intelligence 3 (1986) 289–338. Human-centric AI, MUHAI consortium, 2022. [20] K. von Heusinger, P. Schumacher, Discourse [6] L. Steels, Conceptual foundations for human- prominence: Definition and application., Jour- centric AI, in: M. Chetouani, V. Dignum, nal of Pragmatics (2010) 117–127. P. Lukowicz, C. Sierra (Eds.), Advanced course on Human-Centered AI. ACAI 2021, volume LNAI Tutorial Lecture Series, Springer Verlag, Berlin, 2022. [7] H.-G. Gadamer, Hermeneutics and social sci- ence, Cultural Hermeneutics 2(4) (1975) 207– 316. [8] N. Carroll, Narrative closure, Philosophical Studies. 135 (2007) 1–15. [9] K. Beuls, P. Van Eecke, Understanding and ex- ecuting recipes expressed in natural language, 2022. Web demonstration at https://ehai.ai. vub.ac.be/demos/recipe-understanding/. [10] J. Marin, A. Biswas, F. Ofli, N. Hynes, A. Salvador, Y. Aytar, I. Weber, A. Tor- ralba, Recipe1m+: A dataset for learning cross-modal embeddings for cooking recipes and food images, IEEE Transactions on pat- tern analysis and machine intelligence (2021). [11] L. Steels, A framework for understanding., in preparation (2022). [12] M. Minsky, A framework for representing knowledge., in: P. H. Winston (Ed.), The Psychology of Computer Vision, McGraw-Hill, New York, 1975, pp. 211–277. [13] M. Palmer, P. Kingsbury, D. Gildea, The proposition bank: An annotated corpus of se- mantic roles, Computational Linguistics 31(1) (2005) 71–106. [14] G. Kiczales, J. des Rivieres, D. Bobrow, The Art of the Metaobject Protocol, The MIT Presss, Cambridge Ma, 1991. [15] M. Spranger, S. Pauw, M. Loetzsch, L. Steels, Open-ended procedural semantics., in: L. Steels, M. Hild (Eds.), Language Grounding 42