=Paper=
{{Paper
|id=Vol-2473/invited1
|storemode=property
|title=Word Guessing Game with a Social Robotic Head
|pdfUrl=https://ceur-ws.org/Vol-2473/invited1.pdf
|volume=Vol-2473
|authors=Štefan Beňuš,Róbert Sabo,Marián Trnka
|dblpUrl=https://dblp.org/rec/conf/itat/BenusST19
}}
==Word Guessing Game with a Social Robotic Head==
<pdf width="1500px">https://ceur-ws.org/Vol-2473/invited1.pdf</pdf>
<pre>
                               Word guessing game with a social robotic head
                                            Štefan Beňuš1,2, Róbert Sabo2, Marián Trnka2
                    1 Constantine the Philosopher University in Nitra, Tr. A. Hlinku 1, 94901 Nitra, Slovakia
         2Institute of Informatics of the Slovak Academy of Sciences, Dúbravská cesta 9, 841 04 Bratislava, Slovakia
                                     1sbenus@ukf.sk, 2robert.sabo@savba.sk, 2trnka@savba.sk


Abstract. In this paper we address three limitations of our           real environments and situations and in adverse acoustic
previous implementations in human-machine spoken                      conditions. Transferring HMI applications from the
interaction in Slovak: low prosodic variability, limited              laboratory to real environment and testing its usability is
naturalness, and deployment in real acoustically challenging
situations. We designed a word-guessing game in which                 important if spoken HMI technology should be usable in a
subjects provide verbal cues for a target animal and social           wide scale of users and situations.
robotic head Furhat guesses the animal. We then deployed it in           All of these potential limitations are addressed in our
an open-air science festival with over 60 subjects playing the        new experimental setup in which we designed a simple
game. We describe the implementation and initial observations         games ‘Guess my animal’ and tested how people interact
from the design, deployment, and user evaluation.
                                                                      with the socially expressive robotic head Furhat with
1.    Introduction1                                                   implemented backchanneling behavior in real acoustically
                                                                      challenging condition of open-air science festival with high
   Spoken interactions between humans and machines                    babble noise and loud-speaker noise. Additionally, we
(HMI) are becoming ever more common in everyday lives.                tested the relevance of small talk prior to the target
Predictions for near future include common deployment of              interaction, which was hypothesized to facilitate the
not only personal assistants in smart phones but also                 establishment positive social contact, engagement of the
companions for social well-being for senior citizens,                 users and perceived naturalness of the interaction.
applications participating in diagnosing health issues, or
teaming between humans and robots for various tasks.                  2.   Methods
While most of the research in this area is done in English or
                                                                        We designed a simple game ‘Guess an animal’ in which a
other major languages, understanding both cognitive as
                                                                      user selects a card with a name of the animal, provides cues
well as engineering aspects of deployment in less studies
                                                                      and information about the animal without reveling its
languages is also warranted. In this paper we address
                                                                      identity, and a machine guesses the animal. We employed
limitations of our previous implementations in human-
                                                                      the robotic head Furhat equipped with Slovak TTS and
machine spoken interactions in Slovak and describe the
                                                                      ASR and deployed in an open-air science festival with over
implementation and initial observations from a novel
                                                                      60 subjects playing the game. In this section we describe in
research tool for spoken interaction between humans and
                                                                      turn the hardware, the robotic head and its setup, the
robots. Our long-term goal is to understand better the
                                                                      software in terms of ASR, TTS and the implementation of
intricacies and challenges of HMI in Slovak and inform
                                                                      the game, and the procedure for data collection.
thus designers developing real-world applications in this
area.                                                                 2.1. Hardware
   In previous experiments with spoken HMI in our lab we
used a simple card game GoFish, with a closed set of                  2.1.1. Furhat robotic head
questions the user could ask the machine, or one-person                  The robotic head FURHAT is a 3D humanoid agent that
adventure game motivated by Harry Potter, with a less                 employs the optical projection of an animated facial model
constrained options for user to respond to the prompts from           ([5], furhatrobotics.com). The neck uses two degrees of
the machine. Our experience pointed to three areas of                 freedom, which enables simple gestures like nodding or
limitations relevant to the current study. First, when                shaking and the movement of the head into any direction
humans interacted with spoken dialogues systems in these              within a reasonable viewing angle. The flexibility of the
task-oriented game-like scenarios ([1], [2], [3]) they tended         facial animation model, allowing for gaze changes, eye-
to use speech with limited prosodic variability and                   brow movement, blinking, syncing lip movements with
engagement. We hypothesized that the absence of                       speech, or various emotional gestures like disgust or
emotional attachment and social contact prior to the target           happiness, together with the neck flexibility enable FURHAT
dialogues might negatively affect both prosodic variability           to participate in social spoken interactions with humans and
of humans and potential for speech entrainment between                signal various intents and behaviors.
the humans and the machines ([2], [4]).                                  In terms of hardware, FURHAT is a computer with a
   Second, the limited naturalness of the dialogues in our            mounted model of human head, the face is back-projected
previous scenarios could also be attributed to the lack of            on the front mask; see Fig 2. In our experiment we used the
backchanneling and conversational fillers that are so                 1st generation of FURHAT [https://docs.furhat.io/gen1].
ubiquitous in human-human dialogues but were missing in               Camera or Microsoft Kinect sensor using face recognition
our previous implementations.                                         allows tracking the face of the user(s) who enter or remain
   Third, the experiments were conducted in laboratory                in the Furhat’s visual field and subsequent controlling of
environment without testing the deployment capability in              the head movement and gaze for eye contact functionality.

Copyright ©2019 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
We used the default artificial face ‘Bertil’ and the control of   trained on newspaper database collected from internet with
the visual modality was kept at a minimum in the current          550k vocabulary items [8] and the acoustic model trained
experiment.                                                       on TV news [9], [10]. The advantage of speech in TV news
   FURHAT comes pre-equipped with various software                is that it contains spontaneous speech and background
features. Primarily, it uses commercially available speech        noise similar to our experiment. For this first experiment
recognition (ASR), e.g. by Google, and speech synthesis           the acoustic and language models were not adapted in any
(TTS), e.g. by Amazon. Our long-term goal is to use               way to the game domain. To make communication with the
Slovak ASR and TTS in experiments investigating how               robot as natural as possible and to minimize the time delay,
modifications of TTS and improvements in ASR affect user          the recognizer runs in online mode. Hence, speech was
experience. For this reason, although Slovak ASR is               recognized continually.
available with Google, we equip FURHAT with our own               2.2.2. Slovak TTS
Slovak TTS and ASR so that we have greater control over
the characteristics and deployment of Slovak ASR and                 To make speech synthesis fast we used a statistical text-
TTS.                                                              to-speech system wherein speech is synthesized directly
   The robot can be programmed using FURHAT Legacy                from hidden Markov models [11], [12]. The statistical
SDK [http://www.furhat.io] statechart XML-based                   model for Slovak female voice is trained on a newly
framework for developing multi-modal interactive systems          created phonetically balanced speech corpora consisting of
developed from the original IrisTK system [6]. This SDK is        10k sentences spoken by professional actor. The TTS was
especially designed to be a powerful framework for social         implemented in FURHAT using Microsoft Speech API 5.4
robotics applications. In our initial effort, however, we         that supports syncing lip movements with audio through 21
chose to control FURHAT over TCP/IP using Broker                  visemes that were mapped to the Slovak ones [13].
functionality [https://docs.furhat.io/gen1/tutorials/             2.2.3. Game design and implementation
tutorial_broker_cshar]. This solution allows us to develop
the game in any programming language and for the time                The game is construed as a research tool for
being to circumvent technical problems in implementing            investigating various aspects of human-machine spoken
our own Slovak ASR onto FURHAT.                                   interactions that can be deployed and adjusted to various
                                                                  situations. For this reason, in this first implementation we
2.1.2. Hardware setup                                             aimed at a simple algorithm to test the functionality of the
   In the current experiment, the architecture of the             basic building blocks rather than a sophisticated and
hardware is illustrated in Fig. 1. The human user is sitting      complex dialogue manager.
in front of the robot at a distance of about one meter and           The basis of the game is the table that lists the animals
his/her speech is recorded with a head mounted                    and describes their keywords. There were 12 animals in
microphone that is connected to a laptop. The recorded            total in this version of the game. We have grouped the
sound is sent to the ASR server that outputs the text             keywords into seven attribute categories: {legs, food, area,
transcript of the recognized utterances. The game itself and      species, characteristics, look, alternative}. We have pre-
the dialogue manager (see section 2.2.3) processes the ASR        tested with several naïve subjects and added the keywords
output and selects the robot’s behavior consisting of             used by the pilot subjects. A subset of the table relevant for
speech, non-verbal audio expression like hesitation or a          ‘tiger’ is illustrated in Table I.
backchannel, visual expression of the face, or a
combination of the above.                                              Table 1. Attributes and keywords for the animal “Tiger”

                                                                         Attribute        Keyword

                                                                         Legs             four, four-legged

                                                                         Food             meat, carnivore

                                                                                          Asia Asian India Indian Bangladesh
                                                                         Area             Russia

                                                                         Species          Mammal vertebrae felidae

                                                                                          beast cat feline hunter hunt predator
                                                                         Characteristic   forest prairie jungle rainforest taiga
                                                                                          teeth endangered

                                                                         Look             stripe black brown orange white large

                                                                         Alternative      Leopard

$                                                                    The ASR output is fed to the lemmatization process for
            Fig. 1. Schematized experimental setup                Slovak [14] since Slovak has a rich system of inflectional
                                                                  morphology. The algorithm then searches for keywords
2.2. Software                                                     representing the attributes in the lemmatized ASR output. If
                                                                  a relevant keyword is recognized, the total score for each
2.2.1. Slovak ASR                                                 animal with such a keyword is increased. Commonly, a
  Slovak ASR was implemented based on Kaldi toolkit for           keyword is applicable to several animals, e.g. ‘big teeth’
speech recognition [7]. We used a generic language model          can describe both the tiger and the crocodile, but some
attributes are assumed to be unique to a single animal, e.g.     2.3. Procedure
‘meow’ should be used only with the cat. After pre-testing
                                                                    If a passer-by was interested in interacting with FURHAT
we have also included special keywords such as reptile,
                                                                 and FURHAT was not engaged with another player, the
mammal, amphibian, bird, insect, carnivore, four-legged,
                                                                 experimenter seated the subject in front of the FURHAT and
double-legged, bark, meow, wing, fly, ride, etc. These form
                                                                 fitted him/her with a head-mounted microphone. After
a subset of the keywords such that the animals including
                                                                 briefly explaining the rationale of the game, the
these keywords will add an extra point but the animals
                                                                 experimented started the external recorder and prompted
without these keywords will have one point deducted from
                                                                 the subject to test the game with the experimenter. The goal
their score.
                                                                 was not only for the subject to try out and get comfortable
   The progress from the start to the FURHAT’s guess is
                                                                 with giving cues about animals, but also to obtain the
controlled with several internal parameters. First, the
                                                                 baseline of the user’s speaking behavior. For later
player must use a minimal number of words. If this limit is
                                                                 experiments, we plan to use this baseline in testing how the
not reached and FURHAT detects silence, it prompts the
                                                                 speaking behavior changes when interacting with an
player with general cues like ‘Tell me something more’ or
                                                                 experimenter and with FURHAT.
‘What else?’. Second, a minimal number of questions from
                                                                    After this trial run, the experimenter explained that the
FURHAT has to be asked. In these questions FURHAT
                                                                 subject will play the game three times with FURHAT, that
determines the animal with the highest score of the
                                                                 the experimenter will not participate in the game, and
keywords as the most probable guess, checks the attributes
                                                                 prompted the subject to use full sentences and rich input for
for which it detected no keywords, and asks randomly
                                                                 the FURHAT. Importantly, motivation of a small gift was
about one of these attributes. For example, ‘Tell me how
                                                                 indicated if FURHAT is successful at guessing the animals.
many legs this animal has’ or ‘Where does this animal
live’. Third, FURHAT tries to achieve a minimal difference
between the top animal with the highest score and the next
best one. The interaction of these three parameters keeps
the game to a manageable length, and guarantees dialogue-
like turn-exchanges and sufficient input from the player.
   FURHAT then proceeds to the guess, which is either
unique, if a single animal reaches the highest score or
includes alternatives if multiple animals reached the top
score FURHAT checks with the player if its guess was
correct and based on ASR of the response appends the
count of the correct responses.
   In addition to the internal parameters, two external
parameters were also implemented. First, to test the effect      $
of positive social and emotional engagement between the
layer and the robot on the subsequent spoken interaction,            Fig. 2. Playing the game "Guess my animal" at the open-air
we designed introductory small-talk. Prior to playing the              science festival “Weekend with the Slovak Academy of
game, the user can interact with FURHAT in a short mini-                                     Sciences”.
dialogue. It consists of (1) FURHAT introduction and the
prompt for the user’s name, commenting on the beauty of             Before each game the subject selected a card with the
the name, and if the name is correctly recognized,               target animal from a list of cards offered by the
addressing the user with this name throughout the game,          experimenter.
(2) FURHAT’s asking about the user’s birthplace, using it in        After completing the three rounds of the game, the
its response if correctly recognized, and commenting             experimenter asked the subject to fill out a brief post-test
positively about the place, and (3) asking about recent ice-     questionnaire primarily assessing two domains on a Likert
hockey world championship, the preferred team, and again         scale from 1 (positive) to 5 (negative): FURHAT’s speaking
expressing positive comment regarding that team. We              behavior (natural – not natural) and FURHAT’s abilities
hypothesized that this small talk establishes positive           (smart – dumb).
emotion and social rapport of the user toward FURHAT since
it expressed positive comments about user’s name,                3.      Results and observations
birthplace, and favorite team. The interaction could thus be        We start with the only quantifiable result available at the
initiated with this small-talk or without it directly            moment regarding the effect of small talk on the
proceeding to the game.                                          interactions and continue with informal observations
   Second, to enable testing the effect of backchannel use       regarding the functionality of the proposed set-up.
on the naturalness of the interaction, we selected four             Our logs show that there were 73 subjects that completed
instances of backchannel ‘mhm’ from the corpus used to           the full three rounds of the game and we have 61 filled out
train the female TTS voice (section 2.2.2). If a short silence   post-test questionnaires. We calculated the total word and
was detected in the running ASR, FURHAT produced one of          syllable count in all the recognized speech of the subjects
these backchannels randomly and simultaneously nodded            from the logs. We take this as a proxy measure for user’s
its head. We hypothesized that this behavior will naturally      engagement in the interaction with FURHAT. Additionally,
prompt for more input by the user and would be                   we have the subjective evaluations present in the post-test
unobtrusive even if resulting in simultaneous speech of the      questionnaire.
user and the FURHAT. In the current setup, this parameter            Of the 73 complete logs, 45 subjects started with the
was always set to be active.                                     small-talk and 28 proceeded directly to the game. Welch
two-sample t-test showed that interactions with small-talk       •    The temporal management of the turn-taking should be
included significantly more speech from the player than               improved as this was the only feature commented on
those without it; t[69.95] = -2.93; p = 0.0046 for syllables          by the subjects sometimes in negative terms. The main
and t[69.37] = -2.65; p = 0.01 for words. This is shown in            reason for this deficiency was that sometimes
Fig. 3 and suggests that small-talk has positive effect on            FURHAT’s speech served as the input to the ASR, which
players’ engagement with FURHAT in this implicit measure.             resulted in asynchrony between the subject and
   Of the 61 post-test questionnaires, 43 subjects played             FURHAT. This stems from the absence of information
with small-talk. We observed a tendency that their                    regarding the end of the TTS utterance sent to FURHAT.
experience with FURHAT’s speech was slightly less natural             We will implement Voice-activity-detection and
than for the subjects without small-talk (t[30.4] = -1.77, p =        experiment with FURHAT’s broker functionality to
0.087). It might be that small-talk affects differently the           address this issue.
explicit perceived naturalness of FURHAT’s speech and the        •    The audio output from the speakers was more natural
implicit measure of engagement. When people establish                 than using headphones but in high noise people
greater social rapport with FURHAT in small-talk, it                  sometimes did not hear FURHAT very well. The use of
increases their expectation of the naturalness of its                 more powerful speaker system for open-air events is
speaking behavior, which is not always met in the current             advisable.
implementation. This speculation is supported by the             •    The recorded baseline when subjects interacted with
results from the effect of small-talk on the perceived                the experimenter is not usable in the current setup due
abilities of FURHAT. Subjects with small-talk preceding the           to great noise of the environment and the difference
game rated FURHAT as significantly less capable and smart             between the game external and internal speech
than those without small-talk (t[43.9] = -2.63, p = 0.012).           recording. Adjustments should be made to use either
                                                                      directional microphone or include the pre-test baseline
                                                                      interaction into the game to unify the recording of
                                                                      entire session.
                                                                 •    The implementation of backchannels and
                                                                      conversational fillers was well received but some
                                                                      deployments were not natural. Particularly,
                                                                      backchannels after questions from the subjects or
                                                                      subjects utterance such as ‘I don’t know’.
                                                                 •    In this initial setup, the three renditions of the game
                                                                      played in succession did not vary basic utterances of
                                                                      FURHAT; especially those initiating and concluding the
                                                                      game. For greater naturalness we will vary these
                                                                      utterances to limit the perception of ‘mechanical’
                                                                      speech by FURHAT.
                                                                 •    Several people also commented on the confusing
                                                                      biological sex of FURHAT. TTS was a female voice, the
                                                                      facial features were gender-neutral, but in Slovak
                                                                      morphology the name ‘FURHAT’ is associated with the
                                                                      male gender and in some of FURHAT’s utterances male
                                                                      gender in self-address was also used. We will
  $                                                                   implement interactions with either male or female
                                                                      consistent persona for FURHAT in voice quality, visual
      Fig. 3. The effect of small-talk on the number of words         representation of the face, and speaking behavior.
                     and syllables per subject

   The experience with a 2-day deployment of FURHAT with         Acknowledgment
Guess my animal game in an open-air science festival
yielded the following observations. First, we discuss the          This work was funded by the Slovak Scientific Grant
limitations that we wanted to address. The implementation        Agency VEGA “Automatic assessment of acute stress from
of the game with FURHAT was perceived very positively            speech”, grant No. 2/0161/18.
and the interaction as quite natural. This was the case both     References
in the questionnaires (mean perceived speech naturalness
1.87 and abilities 1.71 on a scale between 1 and 5), and         1.   R. Levitan, Š. Beňuš, R. H. Gálvez, A. Gravano, F.
informal reactions of the subjects. The implementation of             Savoretti, M. Trnka, A. Weise, and J. Hirschberg,
small-talk proved to affect engagement of people and their            “Implementing acoustic-prosodic entrainment in a
evaluation of the interaction. The speaking behavior with             conversational avatar,” in Proc. Of Interspeech 2016,
                                                                      1166–1170.
backchaneling and conversational fillers was also perceived      2.   Š. Beňuš, M. Trnka, E. Kuric, L. Matrák, A. Gravano
as very natural. Finally, the fact that the game was                  and J. Hirschberg, “Prosodic entrainment and trust in
functional in such adverse noise conditions (speech from              human-computer interaction,” in Proc. of 9th
passer-bys, loud music, loud presentations for big                    International Conference on Speech Prosody, pp.
audience) is encouraging for future implementations of                220-224, 2018.
FURHAT with general public.                                      3.   S. Beňuš, M. Patacchiola, M. Trnka, D. Zanatto, R.
   Additional observations relevant for future work include:          Sabo, A. Cangelosi, “Do people trust robots whose
                                                                      prosody synchronizes with the user?” in Šašinka, Č.,
    Strnadová, A:, Šmideková, Z., Juřík, V. (eds.). Kognice
    a umělý život, sborník příspěvků [Cognition and
    Artificial Intelligence, conference proceedings], pp.
    9-10, Brno: Flow, 2018.
4. M. Lohani, C. Stokes, M. McCoy, C. A. Bailey, and S.
    E. Rivers, ”Social interaction moderates human-
    robot       trust-reliance relationship and improves
    stress      coping,” 11th ACM/IEEE International
    Conference on Human-Robot Interaction (HRI), pp.
    471-472, 2016.
5. S. Al Moubayed, J. Beskow, G. Skantze, B. Granström,
    ”Furhat: A Back-Projected Human-Like Robot Head
    for Multiparty Human-Machine Interaction,” in
    Esposito A., Esposito A.M., Vinciarelli A., Hoffmann
    R., Müller V.C. (eds) Cognitive Behavioural Systems.
    Lecture Notes in Computer Science, vol 7403.
    Springer, Berlin, Heidelberg, 2012.
6. G. Skantze and S. Al Moubayed, “IrisTK: a statechart-
    based toolkit for multi-party face-to-face interaction,”
    in Proceedings of ICMI. Santa Monica, CA, 2012.
7. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O.
    Glembek, N. Goel, M. Hannemann, P. Motlicek, Y.
    Qian, P. Schwarz, et al. “The kaldi speech recognition
    toolkit” in Proc. ASRU, pp. 1–4, 2011.
8. M. Lojka, P. Viszlay, J. Staš, D. Hládek, J. Juhár,
    “Slovak Broadcast News Speech Recognition and
    Transcription System,” in: Barolli L., Kryvinska N.,
    Enokido T., Takizawa M. (eds) Advances in Network-
    Based Information Systems. NBiS 2018. Lecture Notes
    on Data Engineering and Communications
    Technologies, vol 22. Springer, Cham, 2019.
9. P.Viszlay, J. Staš, T. Koctúr, M. Lojka, J. Juhár, “An
    extension of the Slovak broadcast news corpus based
    on semi-automatic annotation,” in Proc. of the 10th
    edition of the Language Resources and Evaluation
    Conference (LREC), Portorož, Slovenia, pp.
    4684-4687, 2016.
10. M. Pleva and J. Juhar, “TUKE-BNews-SK: Slovak
    Broadcast News Corpus Construction and Evaluation,”
    in: Proc. of LREC2014, Reykjavik, Iceland, ELRA,
    2014, pp. 1709–1713
11. M. Rusko, M. Trnka, S. Darjaa and J. Hamar, “The
    dramatic piece reader for the blind and visually
    impaired,” in Proceedings of SLPAT, pp. 83-91
    Grenoble, 2013.
12. M. Sulír, J. Juhár and M. Rusko,”Development of the
    Slovak HMM-based TTS system and evaluation of
    voices in respect to the used vocoding techniques,” in
    Computing and Informatics, vol. 35, no. 6, p.
    1467-1490, 2016.
13. SPEECH API https://azure.microsoft.com/en-gb/
    resources/samples /cognitive-speech-tts/...
14. R. Garabík and M. Šimková, “Slovak Morphosyntactic
    Tagset,” in Journal of Language Modeling. Institute of
    Computer Science PAS, vol. 0, no. 1, pp. 41-63, 2012.

</pre>