=Paper=
{{Paper
|id=Vol-3906/paper1_BAILAR
|storemode=property
|title=What am I? – Complementing a robot’s task-solving capabilities with a mental model using a cognitive architecture (short paper)
|pdfUrl=https://ceur-ws.org/Vol-3906/paper1_BAILAR.pdf
|volume=Vol-3906
|authors=Thomas Sievers,Nele Russwinkel,Ralf Möller
|dblpUrl=https://dblp.org/rec/conf/ro-man/SieversR024
}}
==What am I? – Complementing a robot’s task-solving capabilities with a mental model using a cognitive architecture (short paper)==
<pdf width="1500px">https://ceur-ws.org/Vol-3906/paper1_BAILAR.pdf</pdf>
<pre>
                         What am I? – Complementing a robot’s task-solving
                         capabilities with a mental model using a cognitive
                         architecture
                         Thomas Sievers1,* , Nele Russwinkel1 and Ralf Möller2
                         1
                             Institute of Information Systems, University of Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany
                         2
                             CHAI-Institut, Universität Hamburg, 20354 Hamburg, Germany


                                        Abstract
                                        One way to improve Human-Robot Interaction (HRI) and increase trust, acceptance and mutual understanding is
                                        to make the behavior of a social robot more comprehensible and understandable for humans. This is particularly
                                        important if humans and machines are to work together as partners. To be able to do this, both must have
                                        the same basic understanding of the task and the current situation. We created a model within a cognitive
                                        architecture connected to the robot. The cognitive model processed relevant conversational data during a dialog
                                        with a human to create a mental model of the situation. The dialog parts of the robot were generated with a Large
                                        Language Model (LLM) from OpenAI using suitable prompts. An ACT-R model evaluated the data received by the
                                        robot according to predefined criteria – in our example application, hierarchical relationships were established
                                        and remembered – and provided feedback to the LLM via the application for prompt augmentation with the
                                        purpose of adapting or fine-tuning the request. Initial tests indicated that this approach may have advantages
                                        for dialogic tasks and can compensate for weaknesses in terms of a deeper understanding or “blind spots” on the
                                        part of the LLM.

                                        Keywords
                                        cognitive architecture, human-robot interaction, ACT-R, mental model, ChatGPT


                         1. Introduction
                         AI technologies are finding their way into our daily lives ever more quickly and to an ever greater
                         extent. As a result, there is a greater need for AI systems that work more or less “at eye level” – on
                         equal terms – with humans. Robots that interact with humans and solve tasks with a human partner
                         must have some kind of model of the world, the situation, the task to be solved and the person they
                         are interacting with. These robots require cognitive capabilities that may need to go beyond Cartesian
                         mind-body dualism in order for the humans cooperating with these artificial cognitive agents to build a
                         growing level of personal trust and mutual accountability [1].
                           Robots today already have impressive abilities in terms of navigation, interaction with objects and
                         social interaction [2]. Kambhampati uses the term human-aware AI systems to describe methods that
                         pay more attention to the aspects of intelligence that enable successful collaboration between people,
                         including modeling the mental states of humans in the loop [3]. An inclusion of mental models also
                         makes sense in terms of human-centered AI (HCAI), which aims to create AI systems that enhance and
                         complement human capabilities rather than replace them [4]. The aim is to move from a mindset that
                         focuses on algorithms to a human-centered perspective that also improves trust in and acceptance of
                         social robots in Human-Robot Interaction (HRI).
                           In our opinion, cognitive architectures, with their ability to create mental models based on human
                         cognitive abilities, can be used to provide robotic applications with a “human component” [5]. A

                         ALTRUIST, BAILAR, SCRITA, WARN 2024: Workshop on sociAL roboTs for peRsonalized, continUous and adaptIve aSsisTance,
                         Workshop on Behavior Adaptation and Learning for Assistive Robotics, Workshop on Trust, Acceptance and Social Cues in Human-
                         Robot Interaction, and Workshop on Weighing the benefits of Autonomous Robot persoNalisation. August 26, 2024, Pasadena, USA
                         *
                           Corresponding author.
                         " t.sievers@uni-luebeck.de (T. Sievers); nele.russwinkel@uni-luebeck.de (N. Russwinkel); ralf.moeller@uni-hamburg.de
                         (R. Möller)
                          0000-0002-8675-0122 (T. Sievers); 0000-0003-2606-9690 (N. Russwinkel); 0000-0002-1174-3323 (R. Möller)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
combination of robot sensing and data processing with such an architecture offers the possibility to
use real-world data from the robot in mental models. Creating such models enables the robot to better
understand the mindset of a human partner or to act in a way that is perceived as more natural by
humans.
   Cognitive architectures refer both to a theory about the structure of the human mind and to a
computational realization of such a theory. Their formalized models can be used to flexibly react
to actions of the human collaboration partner and to develop situation understanding for adequate
reactions. Well-known and successfully used cognitive architectures are ACT-R (Adaptive Control of
Thought - Rational) and SOAR [6]. ACT-R has an instance-based learning mechanism that enables
intuitive decision-making and exploration of the cognitive processes and representations that underlie
human behavior [7]. Pipitone et. al., for example, used ACT-R to implement a method for a robotic
self-recognition by inner speech [8].
   We thought of on an idea for using an ACT-R model to control and manage the dialog parts of the
robot in a guessing game called “What am I?”. With the use of large language models (LLMs) from
OpenAI’s Generative Pretrained Transformer (GPT, commonly known as ChatGPT) [9] we generated
what the robot was saying via ChatGPT, created a mental model of essential parts of the dialog in
parallel to control the dialog by feedback of the results and outputs of the model, which flowed into the
generation of the ChatGPT prompt for the next robot utterance.
   In the following, we explain our ideas on this topic, provide an insight into the ongoing work and
summarize our initial findings.


2. Methodology
When using large language models, the challenge remains to generate texts that fulfill complex con-
straints. As an example to overcome this challenge, Zhang et al. proposed the application of lexical
constraints in such language models using tractable probabilistic models (TPMs) [10].
   Our approach to address the constraint problem with ChatGPT was to use a cognitive architecture to
control the LLM’s utterances. Since ChatGPT does not have a dynamic cognitive model of the human
conversation partner and therefore quickly reaches its limits in cognitive processes, we used an ACT-R
model that tracks the course of the conversation and intervenes if necessary.
   We have only just started working on this. However, since it is possible to integrate an ACT-R model
with bidirectional communication into a robot application as described by Sievers et. al. [11] and thus
also establish the connection to an LLM, we hypothesize the following:
   Hypothesis It is possible to control the output of an LLM via prompts that are influenced by a
cognitive model.
   This approach should result in an advantage in terms of a more human-like chain of thought and
thus a more human-like behavior – in our test scenario so far only in relation to the robot’s utterances –
followed by a better understanding and thus also greater acceptance by humans.

2.1. ACT-R
The basic mechanism of ACT-R consists of the main components modules, buffers and pattern matcher
[12]. There are two types of modules: Perceptual-motor modules forming the interface with the real
world (motor module and visual module), and the memory modules comprising declarative memory
consisting of facts and procedural memory consisting of productions. Productions represent knowledge
about how something should be done. Figure 1 gives an overview of the main components.
  ACT-R accesses its modules (with the exception of the procedural memory) via special buffers. The
buffers form the interface to this module. The buffer content represents the state of ACT-R over time.
The pattern matcher attempts to find a production that corresponds to the current state of the buffers.
Only one production can be executed at a time. Productions can modify the buffers during execution
and thus change the state of the system. Cognition is therefore represented in ACT-R as a sequence of
production firings.
  In our approach, we did not use the visual and motor modules to provide input to the system. Our
application wrote the relevant data from the LLM utterances directly into the declarative memory of
the ACT-R model. In this way, new memories were created that could be accessed later. The results of
the model’s considerations were fed back to the robot application via the goal buffer.


Figure 1: ACT-R modules, buffers and pattern matcher


2.2. Test scenario
The scenario for our test case was a guessing game called “What am I?”. A participant – in this case
the human – thought of a profession that the robot has to guess. The robot asked questions about the
field of activity, etc., and the human could only answer yes or no to these questions. We used GPT-4o
to create the conversational parts of a social robot. The LLM was instructed via system prompts to
understand the rules of the game and everything we needed it to output.
   For our studies we used the humanoid social robot Pepper [13], which is optimized for human
interaction. In a dialog with humans, the robot application we created forwarded the utterances of
the human dialog partner as input to the OpenAI API, which returned ChatGPT’s answer as response.
With each API call, the entire dialog was transferred to the GPT model. This allowed the model to
constantly ‘remember’ what was previously said and refer to it as the dialog progressed. The text
returned by the API was forwarded to the robot’s voice and tablet output. The OpenAI API provides
various hyperparameters that can be used to control the model behavior during an API call. To obtain
consistent responses and exclude any randomness as far as possible we set the value for temperature to
0.
   Figure 2 shows the setup for the bidirectional connection between ACT-R and the client application
on the robot. The dispatcher as a part of the ACT-R framework acts as a kind of server and is necessary
to establish a remote connection between the robot application as a client and ACT-R.
Figure 2: Connection between ACT-R / Dispatcher and the robot respectively the client application


  In order to determine possible differences in guessing behavior, we compared game runs in which
questions and responses were generated exclusively by ChatGPT itself with runs in which an ACT-R
model was integrated that picks up and processes job-specific terms from the ChatGPT utterances.
These job-specific terms must also be generated dynamically by ChatGPT.

2.3. Prompting the LLM
We used prompts for the system role to instruct GPT-4o to execute the tasks as a completion task.
Zero-shot prompting [14] was used for this tasks. The system prompt for instructing the LLM consisted
of the explanations on how to play the guessing game. In addition, the LLM was instructed to output
keywords that characterize the profession, separated by commas in square brackets, for example
[creative, computer, designer]. These job-specific keywords were transferred from the robot application
to the ACT-R model and stored as a chunk in the declarative memory as sketched in Figure 3.


Figure 3: Transfer of the last two keywords to the ACT-R model and possible feedback to the LLM


  When the robot made a chain of incorrect guesses, the system prompt was modified and provided
with information from the mental model of the guessing process stored in ACT-R’s declarative memory.
This type of prompt augmentation ensures that a system prompt appears with information such as "Your
guess is currently going in the wrong direction" and additionally a stored superordinate job-specific
keyword to help ChatGPT out of a possible dead end.
   Figure 4 shows a typical progression of a profession guessing with ChatGPT and the ACT-R component
in German, here in the emulator of the software development kit (SDK) for the Pepper robot. The LLM
had focused on the first two keywords “office” and “computer” and the guessing went in the wrong
direction. Then there was a rebound in which the second keyword was replaced by a new one, in this
case “finance”. This break or regression in the hierarchy of job-specific keywords is characterized by a
frame. The keyword “computer” was omitted and a new start was made from the generic term “office”.
This finally led to the correct guess of the profession searched for with the keywords [office, marketing,
creative, SEO, specialist]. Such a change would not have taken place without an intervention by the
mental model in ACT-R.


Figure 4: Dialog view of the robot emulator for testing the application with an intervention of the ACT-R model
marked by a frame


2.4. Mental Modeling with ACT-R
The ACT-R model we developed for initial tests was rather simple. It used the goal buffer to exchange
data with the robot application. There were no pre-created chunks in the declarative memory, all
memory contents were created as soon as they were mentioned as keywords by the ChatGPT utterances.
However, in principle, additional a-priori knowledge could be provided in the declarative memory.
   For each guess attempt, the last two keywords in the square brackets were passed to the model as
an object-category pair and stored. In addition, this chunk was specified as the current goal in order
to focus the model’s attention on it. During the course of the guessing game, the model was able to
establish a reference to higher-level keywords by linking these keyword pairs and thus maintained an
overview of the whole structure of the guessing process.
   The tracer output of the running ACT-R model in Figure 5 shows an example of how memory
chunks are created in declarative memory using job-specific keywords. In addition, the productions
that were triggered are displayed as well as older chunks retrieved from memory with previously stored
object-category pairs. The production called chain-category searches the stored memory chunks for
matching keywords in order to identify term concatenations and, if successful, to trace the chain back
to the source term and transfer it to the robot application. The application is then able to use this “chain
of thought” with the next ChatGPT prompt for helpful hints and influence the guessing process.


Figure 5: Tracer output of the running ACT-R model showing memory chunks and firing productions


  This knowledge of original and hierarchically superordinate job-specific keywords can be viewed as a
mental model of the course of the game and used to give ChatGPT hints via appropriately supplemented
system prompts if, for example, it has maneuvered itself into a dead end while guessing. We also used
the possibility of influencing the activation of individual memory chunks in a positive or negative
direction depending on the answers (yes or no) of the human player.


3. Findings
After initial tests, it seems promising to pursue this approach further. Our hypothesis about the
possibility of using a cognitive model to influence prompts to control the output of an LLM appears to
be valid. Initial trials with our guessing game variants have shown advantages of the variant with the
ACT-R model as support on some occasions. We tried out 16 different professions in the guessing game,
both with and without ACT-R support. We only tried each profession once. With the support of the
mental model, all professions were guessed. Without ACT-R support, 5 professions were not guessed,
the LLM always repeated the same questions at some point without reaching the goal. However, we
still need more test runs, more detailed investigations and experiments with variations of different
parameters in the LLM (e.g. temperature) and ACT-R model (e.g. activation) in order to make clear
comparative statements.


4. Conclusion
Since ChatGPT showed deviations in the questions and formulations during different rounds of the
guessing game, even with the most deterministic basic setting possible, the profession searched for
might be guessed more or less quickly, regardless of whether the ACT-R model is used. It is therefore
difficult to assess in individual cases whether our approach actually brings benefits – both in relation to
our test scenario and beyond. A larger number of test runs – later also in different application scenarios
– should provide more clarity here.
   The basic idea of creating and using a mental model with the help of a cognitive architecture is
easily transferable to other applications and can therefore be used quite universally. The mental model
that the robot builds of the current situation could be enriched with further real-world data, e.g. with
emotion recognition and timing, in which the robot incorporates the obvious mood of the human
counterpart into its considerations, or also with image data. In addition to language, the results of a
mental model could also control other aspects of a social robot’s behavior, such as movements and
gestures. Technically, it would be advantageous for real-world applications if the ACT-R model could
run on the robot’s hardware. This should be possible with a robot whose programming is based on
Python with a Python implementation of ACT-R.


References
 [1] G. Sandini, A. Sciutti, P. Morasso, Artificial cognition vs. artificial intelligence for next-generation
     autonomous robotic agents, Frontiers in Computational Neuroscience 18 (2024). URL: https://www.
     frontiersin.org/journals/computational-neuroscience/articles/10.3389/fncom.2024.1349408. doi:10.
     3389/fncom.2024.1349408.
 [2] A. Rossi, F. Garcia, A. Cruz Maya, K. Dautenhahn, K. Koay, M. Walters, A. K. Pandey, Investigating
     the Effects of Social Interactive Behaviours of a Robot on People’s Trust During a Navigation Task,
     2019, pp. 349–361. doi:10.1007/978-3-030-23807-0_29.
 [3] S. Kambhampati, Challenges of human-aware ai systems, 2019. URL: https://arxiv.org/abs/1910.
     07089. arXiv:1910.07089.
 [4] B. Shneiderman, Human-Centered AI, Oxford University Press, 2022.
 [5] A. Werk, S. Scholz, T. Sievers, N. Russwinkel, How to provide a dynamic cognitive person model
     of a human collaboration partner to a pepper robot, Society for Mathematical Psychology, 2024
     forthcoming.
 [6] J. R. Anderson, D. Bothell, M. D. Byrne, S. Douglass, C. Lebiere, Y. Qin, An integrated theory of
     the mind 111 (2004) 1036–1060. doi:10.1037/0033-295X.111.4.1036.
 [7] R. Thomson, C. Lebiere, J. R. Anderson, J. Staszewski, A general instance-based learning framework
     for studying intuitive decision-making in a cognitive architecture, Journal of Applied Research in
     Memory and Cognition 4 (2015) 180–190. doi:https://doi.org/10.1016/j.jarmac.2014.
     06.002, modeling and Aiding Intuition in Organizational Decision Making.
 [8] A. Pipitone, A. Chella, Robot passes the mirror test by inner speech, Robotics and Autonomous
     Systems 144 (2021) 103838. doi:10.1016/j.robot.2021.103838.
 [9] OpenAI, Transforming work and creativity with AI, Technical Report, 2024. URL: https://openai.
     com/product.
[10] H. Zhang, M. Dang, N. Peng, G. Broeck, Tractable control for autoregressive language generation,
     2023. doi:10.48550/arXiv.2304.07438.
[11] T. Sievers, N. Russwinkel, How to use a cognitive architecture for a dynamic person model with a
     social robot in human collaboration (2024 forthcoming).
[12] R. Budiu, ACT-R / About, Technical Report, 2024. URL: http://act-r.psy.cmu.edu/about.
[13] Aldebaran, United Robotics Group and Softbank Robotics, Pepper, Technical Report, 2024. URL:
     https://www.aldebaran.com/en/pepper.
[14] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
     G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
     D. Ziegler, J. Wu, C. Winter, D. Amodei, Language models are few-shot learners, 2020. doi:10.
     48550/arXiv.2005.14165.

</pre>