A facial imitation framework for the simultaneous face
                         control of a virtual avatar and a humanoid robot
                         Mattia Bruscia1 , Graziano A. Manduzio1 , Lorenzo Cominelli1 and Enzo Pasquale Scilingo1
                         1
                             University of Pisa, Pisa, Italy


                                          Abstract
                                          Facial expression imitation (FEI) for humanoid robots is an active research field in the context of human robot
                                          interaction (HRI). Virtual avatars can enhance and simplify the experimental HRI setup in terms of cost and
                                          performance, avoiding possible long-term mechanical degradation of the physical robot in use. Moreover, the
                                          presented framework allows to conduct comparison studies aimed at investigating the role of embodiment in the
                                          interaction with a robot versus its digital twin, which is a critical factor to establish a successful social bond with
                                          the robot, as in the case of numerous clinical applications.

                                          Keywords
                                          Human-robot interaction, facial expression imitation, virtual avatar, Facial Action Coding System (FACS)


                         1. Introduction
                         In recent years, the advent of anthropomorphic social robots, increasingly similar in physical features
                         to human beings, has lead researchers and engineers to endow these robots with even more advanced
                         human-like abilities. Among these, we can easily consider the real-time ability to recognize and mimic
                         the facial expressions of another human being. Methods for automated facial expression recognition
                         (FER) have been a research field in human-robot interaction for several years (see Li and Deng (2020)
                         [1]; Canedo and Neves (2019) [2], for a survey). Research activity related to facial expressions, refers
                         to the studies of Paul Ekman about the action units (AUs), a set of anatomical basis facial movements
                         to compose all the others, described in the FACS [3]. However, facial expression imitation (FEI) for
                         humanoid robot is a younger field of research. For example, Breazeal et al., built a robot capable of
                         learning how to imitate facial expressions from simple imitative games played with a human [4]. Wu et
                         al., developed a system to make the robot face “Einstein” able to learn expression facial patterns coding
                         a map from detected action units (AUs) to servos, solving an inverse kinematic problem [5]. Boucenna
                         et al., developed a neural network model able to control a robot head and learn online to recognize
                         the facial expressions of the human partner [6]. A similar approach was used by Meghdari et al. [7],
                         and Kobayashi and Hara [8]. Teaching a robot these skills is challenging. Frequently, the learning
                         methodologies employed may necessitate substantial exertion from the robot’s joints, potentially leading
                         to a swift degradation in performance or even the fracturing of servos. In this context, the use of virtual
                         avatars can speed up the task learning process, without imposing excessive mechanical strain on
                         robot’s mechanisms. Furthermore, a virtual avatar is easier and cheaper to use than a physical robotic
                         counterpart, opening up a range of possible application scenarios, not only in the development context
                         but also in the clinical one. Dongxiao et al., moved a virtual face, using a 3D facial video of the user
                         captured with Kinect [9]. Rawal et al., introduced ExGenNet, a novel deep generative approach for
                         facial expressions on humanoid robots. They trained the system using a robot Alfie’s simulator [10].
                         In this manuscript, we introduce an innovative methodology for concurrently manipulating the facial
                         expressions of a virtual avatar [11, 12] and a sophisticated expressive robot [13], utilizing a cutting-edge
                         Action Units (AUs) detector with an high detection accuracy [14]. The simultaneous control of a digital
                         avatar and a highly expressive humanoid robot is a fundamental aspect of the study described, as it

                          10th Italian Workshop on Artificial Intelligence and Robotics (AIRO 2023)
                          Envelope-Open m.bruscia@studenti.unipi.it (M. Bruscia); grazianoalfredo.manduzio@phd.unipi.it (G. A. Manduzio);
                          lorenzo.cominelli@unipi.it (L. Cominelli); enzo.scilingo@unipi.it (E. P. Scilingo)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
provides the opportunity to assess the value of embodiment in the context of emotional communication
between a human being and an artificial interlocutor.


2. Proposed work
As shown in Fig. 1, the proposed framework is composed of three main systems: the real-time acquisition
of images from a camera, the analysis of the extracted images to obtain the Action Units (AUs) of the
detected subject, the send and execution of the facial movements to the avatar and to the physical
robot. In the acquisition phase, the Frame Grabber (FG, Algorithm 1), when executed, asks the user to
input the desired frame acquisition rate 𝑟 (𝑟 = 5 fps if not specified) and the name of the folder where
the frames will be stored. The setupFolder() function is then called, creating both a main folder and
a temporary one for storing the frames. If folders already exist, the program informs the user that
specified folders already exists. Next, the setupCamera() function is called, initializing the webcam
and configuring its resolution (width 𝑤 = 640 pixel, height ℎ = 480 pixel) and frame rate. Finally, the
getFrames() function starts capturing frames from the webcam. For each captured frame, a unique
name is generated, including the frame number and timestamp. The frame is then saved as an image in
the temporary folder and copied to the main folder. This process continues until the user interrupts
the program with a keyboard interruption. When this occurs, the program disconnects the webcam,
closes all OpenCV windows, and removes both the main and temporary folders. If an error occurs during
the frame capture (e.g., if the webcam is unavailable), the program notifies the user that it cannot
initiate a new webcam recording. The second program, i.e., the Event Handler (EH, Algorithm 2) is
a file monitoring system that responds to the creation of new files by sending them to a server for
processing and subsequently forwarding the results to another server. It uses the watchdog module to
monitor a specified directory for new files. The program starts by defining the directory to monitor
and then enters a waiting loop until the specified directory exists. Once this condition is met, it begins
monitoring it using the OnMyWatch class. This class utilizes the Observer class from the watchdog
module to monitor the directory. The OnMyWatch class has a run method that initiates the observation
and waits for file system events. When a file system event is detected, the on_any_event() method
of the Handler class is called. This method checks if the event corresponds to the creation of a new
file. If a new file is detected, its path is passed to the emaution() function, which sends the file in
a binary representation 𝐼 to a local server for processing and returns an XML response. The XML
response is then sent to another server using the sendToAbel() and sendToAvatar() function. This
process continues until the user interrupts the program or an error occurs. Two Flask web applications
[15], i.e., ReceiverAvatar (RAv, Algorithm 3) and ReceiverAbel (RAb, Algorithm 4), are structured
as web services that receive XML data, extract AUs values, and send them in the appropriate format
to the avatar and the robot using the sendAUsAbel() and sendAUsAvatar() functions. Using Flask
applications enable the system with a high versatility and scalability, because, when they receive a
post request to the relative write endpoint, they call the write function, which extracts the data from
the request, performs some formatting steps, sends the data to the avatar or to Abel using the relative
sendAUs() function, and returns a response to the original request. Another Flask application using
the Emotiva API is responsible for predicting facial AUs and estimated emotions from a single image
𝐼. Emotiva is a Facial Expression Recognition (FER) software able to analyze human attentive and
affective states [14]. A post call is made each time the event handler detects the capture of a frame in
the specified folder. The virtual avatar used in this framework is based on the 𝑂𝑝𝑒𝑛𝐹 𝐴𝐶𝑆 project, an
open-source 𝐹 𝐴𝐶𝑆-based 3D face animation system [11, 12]. It is a software that enables the simulation
of realistic facial expressions by manipulating specific AUs as defined in the FACS. 𝑂𝑝𝑒𝑛𝐹 𝐴𝐶𝑆 includes
an API suitable for generating real-time dynamic facial expressions for a three-dimensional character. It
can be easily integrated into existing systems without requiring prior experience in computer graphics.
Algorithm 1 Frame Grabber
 1: function setupFolder(𝑓 𝑜𝑙𝑑𝑒𝑟_𝑛𝑎𝑚𝑒)
 2:    if 𝑓 𝑜𝑙𝑑𝑒𝑟_𝑛𝑎𝑚𝑒 is not specified then
 3:         𝑓 𝑜𝑙𝑑𝑒𝑟_𝑛𝑎𝑚𝑒 ← ‘/correct/path/to/frame_folder’
 4:    if 𝑓 𝑜𝑙𝑑𝑒𝑟_𝑝𝑎𝑡ℎ doesn’t exist then
 5:         create 𝑓 𝑜𝑙𝑑𝑒𝑟_𝑝𝑎𝑡ℎ
 6:    return 𝑓 𝑜𝑙𝑑𝑒𝑟_𝑝𝑎𝑡ℎ
 1: function setupCamera(𝑟)
 2:    if 𝑟 is not specified then
 3:         𝑟 ← ‘5’
 4:    𝑐𝑎𝑚𝑒𝑟𝑎_𝑝𝑜𝑟𝑡 ← 0
 5:    𝑤 ← 640
 6:    ℎ ← 480
 7:    𝑤𝑎𝑖𝑡_𝐹 𝑃𝑆 ← 1000/𝑟
 8:    𝑐𝑎𝑚𝑒𝑟𝑎 ← initialize camera at 𝑐𝑎𝑚𝑒𝑟𝑎_𝑝𝑜𝑟𝑡
 9:    open 𝑐𝑎𝑚𝑒𝑟𝑎
10:    set frame dimensions for camera with dimensions 𝑤 and ℎ
11:    return 𝑐𝑎𝑚𝑒𝑟𝑎, wait_FPS
 1: function getFrames(𝑐𝑎𝑚𝑒𝑟𝑎, 𝑤𝑎𝑖𝑡_𝐹 𝑃𝑆, 𝑓 𝑜𝑙𝑑𝑒𝑟_𝑝𝑎𝑡ℎ)
 2:    𝑖←1
 3:    try: until no Interruption from user
 4:    while 𝑐𝑎𝑚𝑒𝑟𝑎 ready to record do
 5:        𝐼 ← capture frame from 𝑐𝑎𝑚𝑒𝑟𝑎
 6:        𝑓 𝑟𝑎𝑚𝑒_𝑝𝑎𝑡ℎ ← 𝑓 𝑟𝑎𝑚𝑒_𝑓 𝑜𝑙𝑑𝑒𝑟 + ‘frame_’ + 𝑖
 7:        save 𝐼 in 𝑓 𝑟𝑎𝑚𝑒_𝑝𝑎𝑡ℎ
 8:        𝑖←𝑖+1
 9:        do nothing for 𝑤𝑎𝑖𝑡_𝐹 𝑃𝑆/1000
10:    except: KeyboardInterrupt
11:    close 𝑐𝑎𝑚𝑒𝑟𝑎
12:    delete 𝑓 𝑜𝑙𝑑𝑒𝑟_𝑝𝑎𝑡ℎ


Algorithm 2 Image sender
 1: function sendAUs(s)
 2:    𝑣 ← content of the post request to ‘/AUs_write_port’
 3:    return response
 4: function detectNewImage(event)
 5:    if new image in the folder then
 6:        send 𝑣 to Flask receiver server


Algorithm 3 Avatar’s receiver
 1: initialize Flask instance
 2: function write(‘/write_port’, method=‘POST’)
 3:     𝑠 ← content of the post request to ‘/AUs_extraction_port’
 4:     𝑣 ← extract list from 𝑠
 5:     send 𝑣 and movement speed to avatar
 6:     return ‘Data received’


Algorithm 4 Abel’s receiver
 1: initialize Flask instance
 2: function write(‘/write_port’, method=‘POST’)
 3:     𝑠 ← content of the post request to ‘/AUs_extraction_port’
 4:     𝑣 ← extract list from 𝑠
 5:     send 𝑣 and movement speed to Abel
 6:     return ‘Data received’
                               Emotiva API


                                                                  Avatar Receiver                  Avatar

                   Frame      Event Handler
  Frame Grabber
                              (Image Sender)

                                     Watchdog                      Abel Receiver                    Abel


                              frame_folder


Figure 1: Schematic of the proposed framework.


3. Experimental results
The proposed framework was developed with Python on PC platform. We run the application with a
frame rate 𝑟 = 5 fps using the integrated webcam, taking images of size 640 x 480 pixel. In Table 1, AUs
used for the experiment are listed. In our tests, the facial expressions assumed by the digital avatar
as well as by the robot successfully followed the AUs extracted by the Emotiva API and, although
they were noisy data acquired with non-specific equipment, it was possible to effectively control the
movement of both the virtual and physical agents simply changing the user facial expression. This
is shown in Fig. 2c and 2d in the case of the avatar, and in Fig. 2e and 2f in case of the robot. The
landmarks are represented as yellow dots superimposed on the two images of the subjects (Fig. 2a and
2b). Additionally, a rectangle is used to identify the faces present in the field of view. The values of the
AUs in both cases are also displayed in Figures 2g and 2h. To improve the quality of control, we plan,
instead of using the PC camera, to use a Kinect camera directly connected to Abel for image acquisition,
which has a higher resolution, and to increase the number of AUs involved in the agent control. To
set up the experiment accurately it is necessary to be in a very bright environment, preferably under
direct light to increase the contrast of the acquired image. It is also advisable to choose a sufficiently
high frame rate to achieve real-time control of the robot and the avatar. Selecting values that are too
low can lead to latency issues in the avatar system. Additionally, during the experiment, the subject’s
face should not exit the camera’s field of view or rotate more than an angle of about 30° from the
central position. Processing a partial face is not supported. If this occurrence happens, the results are
deemed unreliable, and the user receives an alert message. In case of detection of multiple subjects
in the field of view, a processing is performed for each visible face. In the context of the presented
framework, this could generate conflicts in the control of the avatar and the robot, since there is no
decision-making algorithm included in this framework. This problem is solved by the integration of
the proposed framework with the high-level cognitive processing (i.e., the Plan block of Abel’s control
architecture [13]) that makes the artificial agents able to focus their attention on a specific subject
according to specific attention rules [16].
                     (a) Happy expression                                              (b) Sad expression


               (c) Happy expression, avatar face                                 (d) Sad expression, avatar face


                (e) Happy expression, Abel face                                   (f) Sad expression, Abel face


                                        (g) Happy expression AUs and emotions values


                                          (h) Sad expression AUs and emotions values
Figure 2: Examples of analyzed expressions of different users and related control of the facial mimicry of the digital avatar and
Abel robot.
                                      AU index     AU description
                                          1        Inner brow raiser
                                          2       Outer brow raiser
                                          4          Brow lowerer
                                          5         Upper lid raiser
                                          6          Cheek raiser
                                          9          Nose wrinkler
                                         10         Upper lip raiser
                                         12        Lip corner puller
                                         15      Lip corner depressor
                                         17           Chin raiser
                                         18          Lip puckerer
                                         20          Lip stretcher
                                         24           Lip pressor
                                         25            Lips part
                                         26            Jaw drop
                                         28            Lip suck
                                         43           Eyes closed

    Table 1
    AUs used in the proposed paper.


4. Conclusions and future developments
The use of a digital avatar introduces several advantages, such as simplifying certain phases of de-
velopment and testing which would normally involve the correspondent physical robot, helping to
prevent the inevitable wear and tear of the robot’s electronic and mechanical components, and the high
scalability and affordability of this technology. On the other hand, we are aware of the importance
and the influence of a social robot corporeality in HRI, especially in several clinical applications (e.g.,
[17, 18, 19, 20]. The presented architecture allows exploratory interaction studies where it will be
possible to compare two systems in which the perception and information processing remain identical,
with the only differing variable being the representation and embodiment of the artificial agent. These
studies will lead to a methodological evaluation using standard scales (e.g., the Godspeed methods [21])
comparing the cases of Abel and the digital avatar. Moreover, the degrees of freedom of the avatar are
limited e.g., it cannot make asymmetric expressions, and it cannot move any part of the body but the
face. To improve this aspect, the next steps of the project will be also to modify the code of the avatar
at a lower level, separating the right and left part of its face, and to build the graphical component and
the control of other expressive body parts, such as neck, arms and hands.


Acknowledgments
Thanks to the developers of Emotiva https://emotiva.it/ and openFACS https://github.com/phuselab/
 openFACS. Research partly funded by PNRR - M4C2 - Investimento 1.3, Partenariato Esteso PE00000013
- ”FAIR - Future Artificial Intelligence Research” - Spoke 1 ”Human-centered AI”, funded by the European
 Commission under the NextGeneration EU programme.
References
 [1] S. Li, W. Deng, Deep facial expression recognition: A survey, IEEE transactions on affective
     computing 13 (2020) 1195–1215.
 [2] D. Canedo, A. J. Neves, Facial expression recognition using computer vision: A systematic review,
     Applied Sciences 9 (2019) 4678.
 [3] P. Ekman, W. V. Friesen, Facial action coding system, Environmental Psychology & Nonverbal
     Behavior (1978).
 [4] C. Breazeal, D. Buchsbaum, J. Gray, D. Gatenby, B. Blumberg, Learning from and about others:
     Towards using imitation to bootstrap the social understanding of others by robots, Artificial life
     11 (2005) 31–62.
 [5] T. Wu, N. J. Butko, P. Ruvulo, M. S. Bartlett, J. R. Movellan, Learning to make facial expressions,
     in: 2009 IEEE 8th International Conference on Development and Learning, 2009, pp. 1–6. doi:10.
     1109/DEVLRN.2009.5175536 .
 [6] S. Boucenna, P. Gaussier, P. Andry, L. Hafemeister, A robot learns the facial expressions recognition
     and face/non-face discrimination through an imitation game, International Journal of Social
     Robotics 6 (2014) 633–652.
 [7] A. Meghdari, S. B. Shouraki, A. Siamy, A. Shariati, The real-time facial imitation by a social
     humanoid robot, in: 2016 4th International Conference on Robotics and Mechatronics (ICROM),
     IEEE, 2016, pp. 524–529.
 [8] H. Kobayashi, F. Hara, Facial interaction between animated 3d face robot and human beings, in:
     1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics
     and Simulation, volume 4, 1997, pp. 3732–3737 vol.4. doi:10.1109/ICSMC.1997.633250 .
 [9] D. Li, C. Sun, F. Hu, D. Zang, L. Wang, M. Zhang, Real-time performance-driven facial anima-
     tion with 3ds max and kinect, in: 2013 3rd International Conference on Consumer Electronics,
     Communications and Networks, 2013, pp. 473–476. doi:10.1109/CECNet.2013.6703372 .
[10] N. Rawal, D. Koert, C. Turan, K. Kersting, J. Peters, R. Stock-Homburg, Exgennet: Learning to
     generate robotic facial expression using facial expression recognition, Frontiers in Robotics and
     AI 8 (2022) 730317.
[11] V. Cuculo, A. D’Amelio, Openfacs: An open source facs-based 3d face animation system, in:
     Y. Zhao, N. Barnes, B. Chen, R. Westermann, X. Kong, C. Lin (Eds.), Image and Graphics, Springer
     International Publishing, Cham, 2019, pp. 232–242.
[12] openFACS, 2023. URL: https://github.com/phuselab/openFACS.
[13] L. Cominelli, G. Hoegen, D. De Rossi, Abel: integrating humanoid body, emotions, and time
     perception to investigate social interaction and human cognition, Applied Sciences 11 (2021) 1070.
[14] Emotiva, 2023. URL: https://emotiva.it/.
[15] Flask, 2023. URL: https://flask.palletsprojects.com/en/2.3.x/.
[16] L. Cominelli, D. Mazzei, D. E. De Rossi, Seai: Social emotional artificial intelligence based on
     damasio’s theory of mind, Frontiers in Robotics and AI 5 (2018) 6.
[17] S. Shamsuddin, H. Yussof, L. Ismail, F. A. Hanapiah, S. Mohamed, H. A. Piah, N. I. Zahari, Initial
     response of autistic children in human-robot interaction therapy with humanoid robot nao, in:
     2012 IEEE 8th International Colloquium on Signal Processing and its Applications, 2012, pp.
     188–193. doi:10.1109/CSPA.2012.6194716 .
[18] S. Shamsuddin, H. Yussof, L. I. Ismail, S. Mohamed, F. A. Hanapiah, N. I. Zahari, Initial response in
     hri-a case study on evaluation of child with autism spectrum disorders interacting with a humanoid
     robot nao, Procedia Engineering 41 (2012) 1448–1455.
[19] A. Tapus, A. Peca, A. Aly, C. Pop, L. Jisa, S. Pintea, A. S. Rusu, D. O. David, Children with autism
     social engagement in interaction with nao, an imitative robot: A series of single case experiments,
     Interaction studies 13 (2012) 315–347.
[20] L. J. Wood, A. Zaraki, B. Robins, K. Dautenhahn, Developing kaspar: a humanoid robot for children
     with autism, International Journal of Social Robotics 13 (2021) 491–508.
[21] C. Bartneck, E. Croft, D. Kulic, Measurement instruments for the anthropomorphism, animacy,
     likeability, perceived intelligence, and perceived safety of robots, International Journal of Social
     Robotics 1 (2009) 71–81. doi:10.1007/s12369- 008- 0001- 3 .