The Role of Gestures and Movement in Computational, Embodied Storytelling Philipp Wicke1,∗ 1 Centre for Language and Information Processing (CIS), Ludwig-Maximilians-Universität München (LMU), Oettingenstraße 67, Munich, 80538, Germany Abstract With advances of deep learning systems, robust and convincing models for text, image and movement generation are revolutionising many multi-modal forms of generative art. The domain of artificial story-generation has seen this revolution through the development of generative pretrained transformer models. Moreover, storytelling robots can benefit from these deep learning systems not only relying on high-quality textual artefacts, but also on movement and voice synthesis provided by these systems. Yet, the development from symbolic systems to neural systems has burdened the perception of artificial creativity with an opaqueness. Where symbolic systems expressed the intentional use of a conceptual system, neural models bury these conceptual systems in their intractable statistical depth. This research proposes the use of symbolic gestures to augment artificial, robotic storytelling in order to explicate the hidden conceptual structures of artificial storytelling. Building on gesture research and image schemas, we show how the symbolic nature of schematic movement and spatial metaphor can aid the perception of artificial creativity. Keywords Computational Creativity, Gestures, Embodiment, Storytelling 1. Introduction The current advances in Artificial Intelligence (AI) research are largely caused by developments in Machine Learning which enables ever more complex, ever deeper neural network architectures [1]. These architectures produce systems of unprecedented performance in a variety of domains [2, 3, 4]. Most notably, the effectiveness of these systems relies on great amounts of data, large quantities of computing power and neural networks which are optimised to learn the complex correlations hidden in the data. One major breakthrough in language modelling has been the generative pretrained transformer (GPT) architecture [2]. Its attention mechanism and its ability to scale its performance with an increasing number of statistical parameters makes these systems multitask, few-shot learners [3]. This means that these language models are able to generate text, answer questions or translate without being trained specifically on certain tasks. Consequently, large language models have been used to implement a variety ICCC’22 Workshop: The Role of Embodiment in the Perception of Human & Artificial Creativity, June 27–28, 2022, Bozen, Italy ∗ Corresponding author. Envelope-Open pwicke@cis.uni-muenchen.de (P. Wicke) GLOBE www.phil-wicke.com (P. Wicke) Orcid 0000-0001-9891-5353 (P. Wicke) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) of “creative” language systems [4, 5, 6]. The biggest improvement of these systems is their capability to create long-range dependencies and entity coherence within the created texts whilst forming non-generic plotlines. Yet, advancements of large statistical models come with a cost. Not only are these systems computationally expensive in training and data (making them irreproducible for conventional research labs), they are also - as of now - incapable of explaining their reasoning or underlying conceptual structures. Our research argues that this deficiency questions whether or not these systems can be evaluated in terms of artificial creativity and proposes an embodied approach for robotic storytelling which relies on the use of symbolic gestures in order to recover some power over the analysis of these opaque statistical systems. 2. Understanding Creative Processes To define the role of embodiment in the perception of human and artificial creativity, this research adopts the theory of embodied cognition [7]. Under this premise, many features of cognition are embodied in that they are critically dependent upon characteristics of the body of an agent [8]. Investigating creativity, for both machines and humans, therefore requires an investigation of those bodily characteristics which can influence (or enable) creative cognition. The field of Computational Creativity (CC) explores human and machine creativity by con- structing autonomous systems capable of producing novel and useful artefacts [9]. Hence, the acknowledgment of embodiment has given rise to various research efforts in the field of CC (for an overview, see [10]). With respect to embodiment and artificial creativity, our current research investigates the role of gestures in the domain of robotic storytelling. The next section presents a short overview of related works bridging gesture studies and creative storytelling. 3. Related Works Computational Storytelling An embodied storyteller requires a story to be told and an instruction of how the story is supposed to be told. The former is the fabula (“what is told”) and the latter is the discourse (“how it is told”) [11]. Tasking machines to generate a fabula has been studied since the 1960’s [12] and in the following decades, symbolic systems have modelled the intricate deep structures of plot generation to achieve this task. In contrast, generative neural networks entered the field only in the past decade and, whilst abandoning a focus on deep structures, have brought forth high quality textual outputs [2, 3]. Three selected systems will briefly be reviewed to explain our choice for the Scéalextric story generator [13] in our gesture studies. The MEXICA story generator by Pérez y Pérez [14] has been in development for more than 20 years and models various mechanisms of plot development, e.g. rule systems to describe a story grammar, implementing the Engagement-Reflection model of creative writing [15] and establishing long-range dependencies through emotional links between the characters. These very rich deep structures that incorporate mechanisms informed by human writers provide insights into creative computation, yet the resulting plots are too reflective of the static deep structures and cannot give rise to rich and diverse stories on the surface level of text. In contrast, large language models such as GPT-3 [3] are argued to excel in various language tasks and possess writing skills that can compete with basic human writing skills [16]. Nonetheless, neural models fail to make underlying creative processes explicit or accessible. Moreover, a lack of symbolic hooks does not allow performers to retrieve or insinuate further instructions in the generated stories. In short, GPT models provide excellent surface forms, but no deep structures. The Scéalextric story generator [13] can be placed conceptually between these systems. It is a knowledge-base system with declarative knowledge structures. The system has an easily extensible, symbolic knowledge base that can provide access points for performers to include mixed modalities such as emotion, movement and gestures. Its deep structure is based on causally connected plot points, which have been derived from Cook’s PLOTTO book [17] which contains a large set of plausible plot elements linked in a coherent fashion. These basic plot elements could be further augmented with idiomatic renderings retrieved from large language models. Providing easily accessible deep structures and an extensible surface structure, a system such as Scéalextric is a viable candidate to catalyse a creative fusion between computational (neural) story generation and coherent embodied performances with robots as actors [18]. Gestures and Movement in Computational Storytelling So far, we have identified a suitable system to generate the fabula, but storytelling requires the plot to be enacted. Guckels- berger et al. [10] review how various studies in CC have used robotic or virtual agents to enact creative performances such as storytelling, improv or dance. Many modalities can be chosen to enact a textual plot, e.g. movement of the body, sound, images, facial expressions or gestures. Among the non-verbal expressions, gestures and body movement have been studied by Rond et al. [19] through ImprovBot. Their robot uses simple schematic movements in a narrative context to enhance an improv performance. Likewise, the rapping robot Shimon [20] uses gestures as building blocks for musical creation and synchronisation with human collaborators. In robotic storytelling, Ham et al. [21] craft 21 gestures along with 8 gazing behaviours in collaboration with a professional stage actor. These non-verbal movements are assessed in a storytelling context and the authors provide empirical evidence for a positive, synergistic effect of gesture and gaze. Similarly, studies by Sugimoto et al. [22] and Catala et al. [23] highlight the different positive effects of moving actors/robots in different storytelling contexts. 4. Arguments for Robotic Movement Our research argues that gestures and spatial movement can not only enhance artificial story- telling performances, but that they can and should be used in any robotic performance that can afford to do so. Here, we provide and outline three core arguments derived from our research: 1. Conceptual structures can be explicated and support the study of creative performance. 2. Interpretation of performance can be controlled in more detail. 3. Coherence in gestures and movement can only enhance performances. Conceptual Structures Whether improv, dance or storytelling, a performing agent uses its means of verbal and bodily expression to enact concepts. Those concepts must undergo changes and abstractions in order to be translated onto the stage. In line with the performative motto “Show, don’t tell”, spatial movement and gestures can show to explicate, underline or contradict what is told. Spatial movement and gestures can be highly schematic and their meaning is grounded in physical experience. Various studies have shown how humans are able to interpret and appreciate the meaning and use of simple spatial movements in robot performances (for drones [24]) [25, 26]. In order to explicate the conceptual structures of an underlying idea (e.g. a plot element), the image-schematic nature of spatial movement and gestures needs to be applied. This relies on the assumption that schemas such as UP, DOWN, NEAR or FAR are fundamental semantic building blocks of human reasoning and concept formation [27]. Image schemas structure spatial movement and evidence suggests they also structure gestures [28, 29]. McNeill [30] describes them as fundamental asset to study human’s conceptualising capacities. The related works in textual story generation show that most symbolic story generators produce rich conceptual structures, but cannot compete with neural language models’ high quality surface level texts, which in return miss access to low-level representations. Spatial movement and gestures can insert themselves between generated conceptual structures and surface level text. We have shown this ability in various studies in the context of storytelling [26, 31, 32] and argue here that this can also translate to other performative domains. Interpretation Spatial movement and gestures allow the performing agent to connect with the human observer’s knowledge of physical experience. As an example, the robotic agent relies on the human’s experience of gravity, proximity, depth and locomotion. Intuition of these experiences links the performer with the observer. The performer can show actions which would otherwise only be told, for example “She will give him a warm welcome” can be depicted with a simulated hug or open arms for one actor or an actual hug for two performing agents. The same message can be construed using spatial movements or gestures. Alternatively, the actor might repeatedly slam their fist into its open hand. This metaphorical gesture makes use of the FORCE/CONTACT schema, which construes the message to mean a violent welcome. Lastly, two actors can move further apart or closer together in order to indicate whether the message is meant to be threatening or affectionate. In that case, the spatial movement becomes the metaphor PHYSICAL DISTANCE is EMOTIONAL DISTANCE. Further examples of how the interpretation can be construed using movement/gestures are described in the Scéalextric storytelling performance in the final section. Coherence Adding multiple modalities (e.g. gestures, movement) to a performance arguably distributes the audience’s attention. Hence, it is important that additional augmentation of the underlying script only enhances the expressivity on stage and does not diminish or dilute it. Moreover, multiple modalities should work in synchrony and coherence in order to support or modulate each other. In fact, gestures depend on and provide context for accompanying speech acts [33]. With these properties, gestures can improve the performance to create meaningful connections between verbal and non-verbal acts. Likewise, if spatial movement is used in accor- dance with verbal acts, even in metaphorical ways (e.g. PHYSICAL DISTANCE is EMOTIONAL DISTANCE), our studies have shown that the audience appreciates the combination of coherent spatial movement and gestures over one of these movement types alone [26]. Ultimately, move- ments and gestures allow the actor to tap into the audience’s embodied intuitions in expressing conceptual structures that hide behind a mere textual presentation of a plot. Figure 1: The two robotic actors augment the generated plot with gestures and spatial movement in order to explicate conceptual structures, coherence and facilitate metaphorical interpretation. 5. Evidence from Computational Storytelling Performances The implementation of a system that produces robotic storytelling performances, which can be scaled to include multiple agents with varying degrees of embodiment and movability, has been presented in our work on Scéalability [34]. This research exemplifies the proposed strategy of utilising spatial movement and gestures in order to express coherence, explicate conceptual structures and provide pathways for metaphorical interpretation for audience and actors. The underlying story generator combines plot elements in a causally coherent manner and provides a chain of these elements as plot sequences. Each plot element is an action triplet A-action-B with A and B as placeholders for the two characters in each plot. A resulting 3- element plot might be: A-gave-kiss-to-B, B-fell-in-love-with-A, A-married-B. This underlying conceptual structure of the plot is then augmented with various modalities. A and B become characters from the same fictional universe by consulting Veale’s NOC knowledge-base [35]. The actions (e.g. gave-kiss) are rendered with idiomatic sentences (e.g. “A gave B a romantic kiss on the lips”), augmented with appropriate movements (e.g. One actor gives the other a kiss) and the emotional valence of the action is registered (e.g. A forms a romantic bond with B). Over the course of the performance semantic insertions are tracked and allow the actors to choose alternative modes of enactment. For example, a character in the story can be repeatedly rebuffed by the other character and might grow increasingly frustrated. This increasing emotional pressure should be reflected in the actors’ motion and mode of enactment. In [35], we show how access to the emotional valence allows the performing actors to insert irony or metaphorical expressions in their movement. For example, a deteriorating relationship can be reflected by increasing physical distance between the actors (e.g. “repeatedly slams fist into hand” accompanying a “greeting” to construe its meaning). A video of the performance is available at bit.ly/2Ud0GYx and a snapshot can be seen in Figure 1, which shows the schematic, spatial movement. In the Figure, schemas within the gestures have been marked with arrows. 6. Conclusion In this paper, we argued that the use of gestures and spatial movement in embodied performances can connect discrete conceptual structures of plot generation with more abstract means of performance. The use of the moving body enables performing agents to explore alternative interpretations, manifestations of conceptual structures and overall coherence, which is known to enhance their expressivity. Novel approaches in AI are distorting the perception of creative text generation (i.e. high-quality creative text generation becomes increasingly creative), but the true understanding of deeper semantic structures remains hidden within their computational intractability. Ultimately, this research suggests that embodiment (here: spatial movement and gestures) can play a vital role in explaining hidden structures of meaning. While the presented example is built upon a symbolic approach for storytelling, the principles of coherence, interpretation and conceptual structures are applicable to other forms of performances (e.g. dance, improv) and other types of generative systems (e.g. neural language models). The implementation of these suggestions is the study of future work. Acknowledgments Robotic experiments referenced in this work and the initial research proposal has been conducted at UCD (Dublin, IRL) under supervision of Prof. Veale at the Creative Language Systems Group. References [1] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (2015) 436–444. [2] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (2019) 9. [3] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901. [4] T. Winters, P. Delobelle, Survival of the wittiest: Evolving satire with language models., in: ICCC, 2021, pp. 82–86. [5] Y. Agafonova, A. Tikhonov, I. P. Yamshchikov, Paranoid transformer: Reading narrative of madness as computational approach to creativity, Future Internet 12 (2020) 182. [6] A. Calderwood, V. Qiu, K. I. Gero, L. B. Chilton, How novelists use generative language models: An exploratory user study., in: HAI-GEN+ user2agent@ IUI, 2020. [7] F. J. Varela, E. Thompson, E. Rosch, The embodied mind, revised edition: Cognitive science and human experience, MIT press, 2017. [8] L. Foglia, R. A. Wilson, Embodied cognition, Wiley Interdisciplinary Reviews: Cognitive Science 4 (2013) 319–325. [9] T. Veale, F. A. Cardoso, Computational creativity: The philosophy and engineering of autonomously creative systems, Springer, 2019. [10] C. Guckelsberger, A. Kantosalo, S. Negrete-Yankelevich, T. Takala, et al., Embodiment and computational creativity, in: International Conference on Computational Creativity, 2021. [11] P. Gervás, Computational approaches to storytelling and creativity, AI Magazine 30 (2009) 49–49. [12] J. E. Grimes, Linguistic and anthropological projects using the computer, The Use of Computers in Anthropology (1965) 515–516. [13] T. Veale, Déja vu all over again, in: Proceedings of the International Conference on Computational Creativity, 2017. [14] R. Pérez y Pérez, MEXICA: a computer model of creativity in writing., Ph.D. thesis, Uni- versity of Sussex, 1999. [15] M. Sharples, How we write: Writing as creative design, Routledge, 2002. [16] K. Elkins, J. Chun, Can gpt-3 pass a writer’s turing test?, Journal of Cultural Analytics 5 (2020) 17212. [17] W. Cook, PLOTTO: the master book of all plots, Tin House Books, 2011. [18] P. Wicke, Computational storytelling as an embodied robot performance with gesture and spatial metaphor, in: PhD Thesis, University College Dublin. School of Computer Science, 2021. [19] J. Rond, A. Sanchez, J. Berger, H. Knight, Improv with robots: creativity, inspiration, co-performance, in: 2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), IEEE, 2019, pp. 1–8. [20] R. Savery, L. Zahray, G. Weinberg, Shimon the rapper: A real-time system for human-robot interactive rap battles, arXiv preprint arXiv:2009.09234 (2020). [21] J. Ham, R. Bokhorst, R. Cuijpers, D. v. d. Pol, J.-J. Cabibihan, Making robots persuasive: the influence of combining persuasive strategies (gazing and gestures) by a storytelling robot on its persuasive power, in: International conference on social robotics, Springer, 2011, pp. 71–83. [22] M. Sugimoto, T. Ito, T. N. Nguyen, S. Inagaki, Gentoro: a system for supporting children’s storytelling using handheld projectors and a robot, in: Proceedings of the 8th International Conference on Interaction Design and Children, 2009, pp. 214–217. [23] A. Catala, M. Theune, H. Gijlers, D. Heylen, Storytelling as a creative activity in the classroom, in: Proceedings of the 2017 ACM SIGCHI Conference on Creativity and Cognition, 2017, pp. 237–242. [24] A. Bevins, B. A. Duncan, Aerial flight paths for communication: How participants perceive and intend to respond to drone movements, in: Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, 2021, pp. 16–23. [25] H. Nakanishi, Y. Murakami, D. Nogami, H. Ishiguro, Minimum movement matters: impact of robot-mounted cameras on social telepresence, in: Proceedings of the 2008 ACM conference on Computer supported cooperative work, 2008, pp. 303–312. [26] P. Wicke, T. Veale, The show must go on: On the use of embodiment, space and gesture in computational storytelling, New Generation Computing 38 (2020) 565–592. [27] G. Lakoff, Women, fire, and dangerous things: What categories reveal about the mind, University of Chicago press, 2008. [28] A. Cienki, Image schemas and mimetic schemas in cognitive linguistics and gesture studies, Review of Cognitive Linguistics. Published Under the Auspices of the Spanish Cognitive Linguistics Association 11 (2013) 417–432. [29] I. Mittelberg, Gestures as image schemas and force gestalts: A dynamic systems approach augmented with motion-capture data analyses, Cognitive semiotics 11 (2018). [30] D. McNeill, Hand and mind1, Advances in Visual Semiotics (1992) 351. [31] P. Wicke, T. Veale, Show, don’t (just) tell: Embodiment and spatial metaphor in computa- tional story-telling., in: ICCC, 2020, pp. 268–275. [32] P. Wicke, T. Veale, Walk the line: Digital storytelling as embodied spatial performance, in: 7th Computational Creativity Symposium at AISB, 2020. [33] S. D. Kelly, D. J. Barr, R. B. Church, K. Lynch, Offering a hand to pragmatic understanding: The role of speech and gesture in comprehension and memory, Journal of memory and Language 40 (1999) 577–592. [34] P. Wicke, T. Veale, Interview with the robot: Question-guided collaboration in a storytelling system., in: ICCC, 2018, pp. 56–63. [35] T. Veale, P. Wicke, Metaphor, blending and irony in action: Creative performance as interpretation and emotionally-grounded choice., in: ICCC, 2021, pp. 319–326.