=Paper=
{{Paper
|id=Vol-2473/invited1
|storemode=property
|title=Word Guessing Game with a Social Robotic Head
|pdfUrl=https://ceur-ws.org/Vol-2473/invited1.pdf
|volume=Vol-2473
|authors=Štefan Beňuš,Róbert Sabo,Marián Trnka
|dblpUrl=https://dblp.org/rec/conf/itat/BenusST19
}}
==Word Guessing Game with a Social Robotic Head==
Word guessing game with a social robotic head Štefan Beňuš1,2, Róbert Sabo2, Marián Trnka2 1 Constantine the Philosopher University in Nitra, Tr. A. Hlinku 1, 94901 Nitra, Slovakia 2Institute of Informatics of the Slovak Academy of Sciences, Dúbravská cesta 9, 841 04 Bratislava, Slovakia 1sbenus@ukf.sk, 2robert.sabo@savba.sk, 2trnka@savba.sk Abstract. In this paper we address three limitations of our real environments and situations and in adverse acoustic previous implementations in human-machine spoken conditions. Transferring HMI applications from the interaction in Slovak: low prosodic variability, limited laboratory to real environment and testing its usability is naturalness, and deployment in real acoustically challenging situations. We designed a word-guessing game in which important if spoken HMI technology should be usable in a subjects provide verbal cues for a target animal and social wide scale of users and situations. robotic head Furhat guesses the animal. We then deployed it in All of these potential limitations are addressed in our an open-air science festival with over 60 subjects playing the new experimental setup in which we designed a simple game. We describe the implementation and initial observations games ‘Guess my animal’ and tested how people interact from the design, deployment, and user evaluation. with the socially expressive robotic head Furhat with 1. Introduction1 implemented backchanneling behavior in real acoustically challenging condition of open-air science festival with high Spoken interactions between humans and machines babble noise and loud-speaker noise. Additionally, we (HMI) are becoming ever more common in everyday lives. tested the relevance of small talk prior to the target Predictions for near future include common deployment of interaction, which was hypothesized to facilitate the not only personal assistants in smart phones but also establishment positive social contact, engagement of the companions for social well-being for senior citizens, users and perceived naturalness of the interaction. applications participating in diagnosing health issues, or teaming between humans and robots for various tasks. 2. Methods While most of the research in this area is done in English or We designed a simple game ‘Guess an animal’ in which a other major languages, understanding both cognitive as user selects a card with a name of the animal, provides cues well as engineering aspects of deployment in less studies and information about the animal without reveling its languages is also warranted. In this paper we address identity, and a machine guesses the animal. We employed limitations of our previous implementations in human- the robotic head Furhat equipped with Slovak TTS and machine spoken interactions in Slovak and describe the ASR and deployed in an open-air science festival with over implementation and initial observations from a novel 60 subjects playing the game. In this section we describe in research tool for spoken interaction between humans and turn the hardware, the robotic head and its setup, the robots. Our long-term goal is to understand better the software in terms of ASR, TTS and the implementation of intricacies and challenges of HMI in Slovak and inform the game, and the procedure for data collection. thus designers developing real-world applications in this area. 2.1. Hardware In previous experiments with spoken HMI in our lab we used a simple card game GoFish, with a closed set of 2.1.1. Furhat robotic head questions the user could ask the machine, or one-person The robotic head FURHAT is a 3D humanoid agent that adventure game motivated by Harry Potter, with a less employs the optical projection of an animated facial model constrained options for user to respond to the prompts from ([5], furhatrobotics.com). The neck uses two degrees of the machine. Our experience pointed to three areas of freedom, which enables simple gestures like nodding or limitations relevant to the current study. First, when shaking and the movement of the head into any direction humans interacted with spoken dialogues systems in these within a reasonable viewing angle. The flexibility of the task-oriented game-like scenarios ([1], [2], [3]) they tended facial animation model, allowing for gaze changes, eye- to use speech with limited prosodic variability and brow movement, blinking, syncing lip movements with engagement. We hypothesized that the absence of speech, or various emotional gestures like disgust or emotional attachment and social contact prior to the target happiness, together with the neck flexibility enable FURHAT dialogues might negatively affect both prosodic variability to participate in social spoken interactions with humans and of humans and potential for speech entrainment between signal various intents and behaviors. the humans and the machines ([2], [4]). In terms of hardware, FURHAT is a computer with a Second, the limited naturalness of the dialogues in our mounted model of human head, the face is back-projected previous scenarios could also be attributed to the lack of on the front mask; see Fig 2. In our experiment we used the backchanneling and conversational fillers that are so 1st generation of FURHAT [https://docs.furhat.io/gen1]. ubiquitous in human-human dialogues but were missing in Camera or Microsoft Kinect sensor using face recognition our previous implementations. allows tracking the face of the user(s) who enter or remain Third, the experiments were conducted in laboratory in the Furhat’s visual field and subsequent controlling of environment without testing the deployment capability in the head movement and gaze for eye contact functionality. Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). We used the default artificial face ‘Bertil’ and the control of trained on newspaper database collected from internet with the visual modality was kept at a minimum in the current 550k vocabulary items [8] and the acoustic model trained experiment. on TV news [9], [10]. The advantage of speech in TV news FURHAT comes pre-equipped with various software is that it contains spontaneous speech and background features. Primarily, it uses commercially available speech noise similar to our experiment. For this first experiment recognition (ASR), e.g. by Google, and speech synthesis the acoustic and language models were not adapted in any (TTS), e.g. by Amazon. Our long-term goal is to use way to the game domain. To make communication with the Slovak ASR and TTS in experiments investigating how robot as natural as possible and to minimize the time delay, modifications of TTS and improvements in ASR affect user the recognizer runs in online mode. Hence, speech was experience. For this reason, although Slovak ASR is recognized continually. available with Google, we equip FURHAT with our own 2.2.2. Slovak TTS Slovak TTS and ASR so that we have greater control over the characteristics and deployment of Slovak ASR and To make speech synthesis fast we used a statistical text- TTS. to-speech system wherein speech is synthesized directly The robot can be programmed using FURHAT Legacy from hidden Markov models [11], [12]. The statistical SDK [http://www.furhat.io] statechart XML-based model for Slovak female voice is trained on a newly framework for developing multi-modal interactive systems created phonetically balanced speech corpora consisting of developed from the original IrisTK system [6]. This SDK is 10k sentences spoken by professional actor. The TTS was especially designed to be a powerful framework for social implemented in FURHAT using Microsoft Speech API 5.4 robotics applications. In our initial effort, however, we that supports syncing lip movements with audio through 21 chose to control FURHAT over TCP/IP using Broker visemes that were mapped to the Slovak ones [13]. functionality [https://docs.furhat.io/gen1/tutorials/ 2.2.3. Game design and implementation tutorial_broker_cshar]. This solution allows us to develop the game in any programming language and for the time The game is construed as a research tool for being to circumvent technical problems in implementing investigating various aspects of human-machine spoken our own Slovak ASR onto FURHAT. interactions that can be deployed and adjusted to various situations. For this reason, in this first implementation we 2.1.2. Hardware setup aimed at a simple algorithm to test the functionality of the In the current experiment, the architecture of the basic building blocks rather than a sophisticated and hardware is illustrated in Fig. 1. The human user is sitting complex dialogue manager. in front of the robot at a distance of about one meter and The basis of the game is the table that lists the animals his/her speech is recorded with a head mounted and describes their keywords. There were 12 animals in microphone that is connected to a laptop. The recorded total in this version of the game. We have grouped the sound is sent to the ASR server that outputs the text keywords into seven attribute categories: {legs, food, area, transcript of the recognized utterances. The game itself and species, characteristics, look, alternative}. We have pre- the dialogue manager (see section 2.2.3) processes the ASR tested with several naïve subjects and added the keywords output and selects the robot’s behavior consisting of used by the pilot subjects. A subset of the table relevant for speech, non-verbal audio expression like hesitation or a ‘tiger’ is illustrated in Table I. backchannel, visual expression of the face, or a combination of the above. Table 1. Attributes and keywords for the animal “Tiger” Attribute Keyword Legs four, four-legged Food meat, carnivore Asia Asian India Indian Bangladesh Area Russia Species Mammal vertebrae felidae beast cat feline hunter hunt predator Characteristic forest prairie jungle rainforest taiga teeth endangered Look stripe black brown orange white large Alternative Leopard $ The ASR output is fed to the lemmatization process for Fig. 1. Schematized experimental setup Slovak [14] since Slovak has a rich system of inflectional morphology. The algorithm then searches for keywords 2.2. Software representing the attributes in the lemmatized ASR output. If a relevant keyword is recognized, the total score for each 2.2.1. Slovak ASR animal with such a keyword is increased. Commonly, a Slovak ASR was implemented based on Kaldi toolkit for keyword is applicable to several animals, e.g. ‘big teeth’ speech recognition [7]. We used a generic language model can describe both the tiger and the crocodile, but some attributes are assumed to be unique to a single animal, e.g. 2.3. Procedure ‘meow’ should be used only with the cat. After pre-testing If a passer-by was interested in interacting with FURHAT we have also included special keywords such as reptile, and FURHAT was not engaged with another player, the mammal, amphibian, bird, insect, carnivore, four-legged, experimenter seated the subject in front of the FURHAT and double-legged, bark, meow, wing, fly, ride, etc. These form fitted him/her with a head-mounted microphone. After a subset of the keywords such that the animals including briefly explaining the rationale of the game, the these keywords will add an extra point but the animals experimented started the external recorder and prompted without these keywords will have one point deducted from the subject to test the game with the experimenter. The goal their score. was not only for the subject to try out and get comfortable The progress from the start to the FURHAT’s guess is with giving cues about animals, but also to obtain the controlled with several internal parameters. First, the baseline of the user’s speaking behavior. For later player must use a minimal number of words. If this limit is experiments, we plan to use this baseline in testing how the not reached and FURHAT detects silence, it prompts the speaking behavior changes when interacting with an player with general cues like ‘Tell me something more’ or experimenter and with FURHAT. ‘What else?’. Second, a minimal number of questions from After this trial run, the experimenter explained that the FURHAT has to be asked. In these questions FURHAT subject will play the game three times with FURHAT, that determines the animal with the highest score of the the experimenter will not participate in the game, and keywords as the most probable guess, checks the attributes prompted the subject to use full sentences and rich input for for which it detected no keywords, and asks randomly the FURHAT. Importantly, motivation of a small gift was about one of these attributes. For example, ‘Tell me how indicated if FURHAT is successful at guessing the animals. many legs this animal has’ or ‘Where does this animal live’. Third, FURHAT tries to achieve a minimal difference between the top animal with the highest score and the next best one. The interaction of these three parameters keeps the game to a manageable length, and guarantees dialogue- like turn-exchanges and sufficient input from the player. FURHAT then proceeds to the guess, which is either unique, if a single animal reaches the highest score or includes alternatives if multiple animals reached the top score FURHAT checks with the player if its guess was correct and based on ASR of the response appends the count of the correct responses. In addition to the internal parameters, two external parameters were also implemented. First, to test the effect $ of positive social and emotional engagement between the layer and the robot on the subsequent spoken interaction, Fig. 2. Playing the game "Guess my animal" at the open-air we designed introductory small-talk. Prior to playing the science festival “Weekend with the Slovak Academy of game, the user can interact with FURHAT in a short mini- Sciences”. dialogue. It consists of (1) FURHAT introduction and the prompt for the user’s name, commenting on the beauty of Before each game the subject selected a card with the the name, and if the name is correctly recognized, target animal from a list of cards offered by the addressing the user with this name throughout the game, experimenter. (2) FURHAT’s asking about the user’s birthplace, using it in After completing the three rounds of the game, the its response if correctly recognized, and commenting experimenter asked the subject to fill out a brief post-test positively about the place, and (3) asking about recent ice- questionnaire primarily assessing two domains on a Likert hockey world championship, the preferred team, and again scale from 1 (positive) to 5 (negative): FURHAT’s speaking expressing positive comment regarding that team. We behavior (natural – not natural) and FURHAT’s abilities hypothesized that this small talk establishes positive (smart – dumb). emotion and social rapport of the user toward FURHAT since it expressed positive comments about user’s name, 3. Results and observations birthplace, and favorite team. The interaction could thus be We start with the only quantifiable result available at the initiated with this small-talk or without it directly moment regarding the effect of small talk on the proceeding to the game. interactions and continue with informal observations Second, to enable testing the effect of backchannel use regarding the functionality of the proposed set-up. on the naturalness of the interaction, we selected four Our logs show that there were 73 subjects that completed instances of backchannel ‘mhm’ from the corpus used to the full three rounds of the game and we have 61 filled out train the female TTS voice (section 2.2.2). If a short silence post-test questionnaires. We calculated the total word and was detected in the running ASR, FURHAT produced one of syllable count in all the recognized speech of the subjects these backchannels randomly and simultaneously nodded from the logs. We take this as a proxy measure for user’s its head. We hypothesized that this behavior will naturally engagement in the interaction with FURHAT. Additionally, prompt for more input by the user and would be we have the subjective evaluations present in the post-test unobtrusive even if resulting in simultaneous speech of the questionnaire. user and the FURHAT. In the current setup, this parameter Of the 73 complete logs, 45 subjects started with the was always set to be active. small-talk and 28 proceeded directly to the game. Welch two-sample t-test showed that interactions with small-talk • The temporal management of the turn-taking should be included significantly more speech from the player than improved as this was the only feature commented on those without it; t[69.95] = -2.93; p = 0.0046 for syllables by the subjects sometimes in negative terms. The main and t[69.37] = -2.65; p = 0.01 for words. This is shown in reason for this deficiency was that sometimes Fig. 3 and suggests that small-talk has positive effect on FURHAT’s speech served as the input to the ASR, which players’ engagement with FURHAT in this implicit measure. resulted in asynchrony between the subject and Of the 61 post-test questionnaires, 43 subjects played FURHAT. This stems from the absence of information with small-talk. We observed a tendency that their regarding the end of the TTS utterance sent to FURHAT. experience with FURHAT’s speech was slightly less natural We will implement Voice-activity-detection and than for the subjects without small-talk (t[30.4] = -1.77, p = experiment with FURHAT’s broker functionality to 0.087). It might be that small-talk affects differently the address this issue. explicit perceived naturalness of FURHAT’s speech and the • The audio output from the speakers was more natural implicit measure of engagement. When people establish than using headphones but in high noise people greater social rapport with FURHAT in small-talk, it sometimes did not hear FURHAT very well. The use of increases their expectation of the naturalness of its more powerful speaker system for open-air events is speaking behavior, which is not always met in the current advisable. implementation. This speculation is supported by the • The recorded baseline when subjects interacted with results from the effect of small-talk on the perceived the experimenter is not usable in the current setup due abilities of FURHAT. Subjects with small-talk preceding the to great noise of the environment and the difference game rated FURHAT as significantly less capable and smart between the game external and internal speech than those without small-talk (t[43.9] = -2.63, p = 0.012). recording. Adjustments should be made to use either directional microphone or include the pre-test baseline interaction into the game to unify the recording of entire session. • The implementation of backchannels and conversational fillers was well received but some deployments were not natural. Particularly, backchannels after questions from the subjects or subjects utterance such as ‘I don’t know’. • In this initial setup, the three renditions of the game played in succession did not vary basic utterances of FURHAT; especially those initiating and concluding the game. For greater naturalness we will vary these utterances to limit the perception of ‘mechanical’ speech by FURHAT. • Several people also commented on the confusing biological sex of FURHAT. TTS was a female voice, the facial features were gender-neutral, but in Slovak morphology the name ‘FURHAT’ is associated with the male gender and in some of FURHAT’s utterances male gender in self-address was also used. We will $ implement interactions with either male or female consistent persona for FURHAT in voice quality, visual Fig. 3. The effect of small-talk on the number of words representation of the face, and speaking behavior. and syllables per subject The experience with a 2-day deployment of FURHAT with Acknowledgment Guess my animal game in an open-air science festival yielded the following observations. First, we discuss the This work was funded by the Slovak Scientific Grant limitations that we wanted to address. The implementation Agency VEGA “Automatic assessment of acute stress from of the game with FURHAT was perceived very positively speech”, grant No. 2/0161/18. and the interaction as quite natural. This was the case both References in the questionnaires (mean perceived speech naturalness 1.87 and abilities 1.71 on a scale between 1 and 5), and 1. R. Levitan, Š. Beňuš, R. H. Gálvez, A. Gravano, F. informal reactions of the subjects. The implementation of Savoretti, M. Trnka, A. Weise, and J. Hirschberg, small-talk proved to affect engagement of people and their “Implementing acoustic-prosodic entrainment in a evaluation of the interaction. The speaking behavior with conversational avatar,” in Proc. Of Interspeech 2016, 1166–1170. backchaneling and conversational fillers was also perceived 2. Š. Beňuš, M. Trnka, E. Kuric, L. Matrák, A. Gravano as very natural. Finally, the fact that the game was and J. Hirschberg, “Prosodic entrainment and trust in functional in such adverse noise conditions (speech from human-computer interaction,” in Proc. of 9th passer-bys, loud music, loud presentations for big International Conference on Speech Prosody, pp. audience) is encouraging for future implementations of 220-224, 2018. FURHAT with general public. 3. S. Beňuš, M. Patacchiola, M. Trnka, D. Zanatto, R. Additional observations relevant for future work include: Sabo, A. Cangelosi, “Do people trust robots whose prosody synchronizes with the user?” in Šašinka, Č., Strnadová, A:, Šmideková, Z., Juřík, V. (eds.). Kognice a umělý život, sborník příspěvků [Cognition and Artificial Intelligence, conference proceedings], pp. 9-10, Brno: Flow, 2018. 4. M. Lohani, C. Stokes, M. McCoy, C. A. Bailey, and S. E. Rivers, ”Social interaction moderates human- robot trust-reliance relationship and improves stress coping,” 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 471-472, 2016. 5. S. Al Moubayed, J. Beskow, G. Skantze, B. Granström, ”Furhat: A Back-Projected Human-Like Robot Head for Multiparty Human-Machine Interaction,” in Esposito A., Esposito A.M., Vinciarelli A., Hoffmann R., Müller V.C. (eds) Cognitive Behavioural Systems. Lecture Notes in Computer Science, vol 7403. Springer, Berlin, Heidelberg, 2012. 6. G. Skantze and S. Al Moubayed, “IrisTK: a statechart- based toolkit for multi-party face-to-face interaction,” in Proceedings of ICMI. Santa Monica, CA, 2012. 7. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al. “The kaldi speech recognition toolkit” in Proc. ASRU, pp. 1–4, 2011. 8. M. Lojka, P. Viszlay, J. Staš, D. Hládek, J. Juhár, “Slovak Broadcast News Speech Recognition and Transcription System,” in: Barolli L., Kryvinska N., Enokido T., Takizawa M. (eds) Advances in Network- Based Information Systems. NBiS 2018. Lecture Notes on Data Engineering and Communications Technologies, vol 22. Springer, Cham, 2019. 9. P.Viszlay, J. Staš, T. Koctúr, M. Lojka, J. Juhár, “An extension of the Slovak broadcast news corpus based on semi-automatic annotation,” in Proc. of the 10th edition of the Language Resources and Evaluation Conference (LREC), Portorož, Slovenia, pp. 4684-4687, 2016. 10. M. Pleva and J. Juhar, “TUKE-BNews-SK: Slovak Broadcast News Corpus Construction and Evaluation,” in: Proc. of LREC2014, Reykjavik, Iceland, ELRA, 2014, pp. 1709–1713 11. M. Rusko, M. Trnka, S. Darjaa and J. Hamar, “The dramatic piece reader for the blind and visually impaired,” in Proceedings of SLPAT, pp. 83-91 Grenoble, 2013. 12. M. Sulír, J. Juhár and M. Rusko,”Development of the Slovak HMM-based TTS system and evaluation of voices in respect to the used vocoding techniques,” in Computing and Informatics, vol. 35, no. 6, p. 1467-1490, 2016. 13. SPEECH API https://azure.microsoft.com/en-gb/ resources/samples /cognitive-speech-tts/... 14. R. Garabík and M. Šimková, “Slovak Morphosyntactic Tagset,” in Journal of Language Modeling. Institute of Computer Science PAS, vol. 0, no. 1, pp. 41-63, 2012.