=Paper= {{Paper |id=Vol-1315/paper12 |storemode=property |title=Romeo2 Project: Humanoid Robot Assistant and Companion for Everyday Life: I. Situation Assessment for Social Intelligence |pdfUrl=https://ceur-ws.org/Vol-1315/paper12.pdf |volume=Vol-1315 |dblpUrl=https://dblp.org/rec/conf/aic/PandeyGAVBMCDTFGMKLM14 }} ==Romeo2 Project: Humanoid Robot Assistant and Companion for Everyday Life: I. Situation Assessment for Social Intelligence== https://ceur-ws.org/Vol-1315/paper12.pdf
Romeo2 Project: Humanoid Robot Assistant and
          Companion for Everyday Life:
 I. Situation Assessment for Social Intelligence 1

 Amit Kumar Pandey1 , Rodolphe Gelin1 , Rachid Alami2 , Renaud Viry2 , Axel
Buendia3 , Roland Meertens3 , Mohamed Chetouani4 , Laurence Devillers5 , Marie
  Tahon5 , David Filliat6 , Yves Grenier7 , Mounira Maazaoui7 , Abderrahmane
            Kheddar8 , Frédéric Lerasle2 , and Laurent Fitte Duval2
    1
        Aldebaran, A-Lab, France, akpandey@aldebaran.com; rgelin@aldebaran.com
            2
           CNRS, LAAS, 7 avenue du colonel Roche, F-31400 Toulouse, France;
   rachid.alami@laas.fr; frederic.lerasle@laas.fr; renaud.viry@laas.fr; lfittedu@laas.fr
3
  Spirops/CNAM (CEDRIC), Paris; axel.buendia@cnam.fr; rolandmeertens@gmail.com
                   4
                     ISIR, UPMC, France; mohamed.chetouani@upmc.fr
    5
      LIMSI-CNRS University Paris-Sorbonne; devil@limsi.fr; marie.tahon@limsi.fr
        6
          ENSTA ParisTech - INRIA FLOWERS; david.filliat@ensta-paristech.fr
                7
                  Inst. Mines-Télécom; Télécom ParisTech; CNRS LTCI;
            yves.grenier@telecom-paristech.fr; maazaoui@telecom-paristech.fr
                      8
                        CNRS-UM2 LIRMM IDH; kheddar@gmail.com

          Abstract. For a socially intelligent robot, different levels of situation as-
          sessment are required, ranging from basic processing of sensor input to
          high-level analysis of semantics and intention. However, the attempt to
          combine them all prompts new research challenges and the need of a co-
          herent framework and architecture.
          This paper presents the situation assessment aspect of Romeo2, a unique
          project aiming to bring multi-modal and multi-layered perception on a
          single system and targeting for a unified theoretical and functional frame-
          work for a robot companion for everyday life. It also discusses some of the
          innovation potentials, which the combination of these various perception
          abilities adds into the robot’s socio-cognitive capabilities.
          Keywords: Situation Assessment, Socially Intelligent Robot, Human Robot
          Interaction, Robot Companion

1        Introduction
As robots started to co-exist in a human-centered environment, the human aware-
ness capabilities must be considered. With safety being a basic requirement, such
robots should be able to behave in a socially accepted and expected manner. This
requires robots to reason about the situation, not only from the perspective of
physical locations of objects, but also from that of ‘mental’ and ‘physical’ states
of the human partner. Further, such reasoning should build knowledge with the
human understandable attributes, to facilitate natural human-robot interaction.
     The Romeo2 project (website 1 ), the focus of this paper, is unique in that it
brings together different perception components in a unified framework for real-
life personal assistant and companion robot in an everyday scenario. This paper
outlines our perception architecture, the categorization of basic requirements, the
key elements to perceive, and the innovation advantages such a system provides.
1
    This work is funded by Romeo2 project, (http://www.projetromeo.com/), BPIFrance
    in the framework of the Structuring Projects of Competitiveness Clusters (PSPC)




                                    Page 140 of 171
Fig. 1 shows the Romeo robot and its sensors. It
is a 40kg and 1.4m tall humanoid robot with 41
degrees-of-freedom, vertebral column, exoskele-
ton on legs, partially soft torso and mobile eyes.
1.1   An Example Scenario
Mr. Smith lives alone (with his Romeo robot
companion). He is elderly and visually im-
paired. Romeo understands his speech, emotion
and gestures, assists him in his daily life. It pro-
vides physical support by bringing the ‘desired’
items, and cognitive support by reminding about
medicine, items to add in to-buy list, playing
memory games, etc. It monitors Mr. Smith’s Fig. 1. Romeo robot and sensors.
activities and calls for assistance if abnormalities are detected in his behaviors. As
a social inhabitant, it plays with Mr. Smith’s grandchildren visiting him.
    This outlined partial tar-
get scenario of Romeo2 project
(also illustrated in fig. 2), de-
picts that being aware about
human, his/her activities, the
environment and the situation
are the key aspects towards
practical achievement of the
                                   Fig. 2. Romeo2 Project scenario: A Humanoid Robot
project’s objective.
                                  Assistant and Companion for Everyday Life.
1.2 Related Works and the main Contributions
Situation awareness is the ability to perceive and abstract information from
the environment [2]. It is an important aspect of day-to-day interaction, decision-
making, and planning, so as important is the domain-based identification of the
elements and attributes, constituting the state of the environment. In this paper, we
will identify and present such elements from companion robot domain perspective,
sec. 2.2. Further, three levels of it have been identified (Endsley et al. [9]): Level
1 situation awareness: To perceive the state of the elements composing the
surrounding environment. Level 2 situation awareness: To build a goal oriented
understanding of the situation. Experience and comprehension of the meaning are
important. Level 3 situation awareness: To project on the future. Sec. 2.1 will
present our sense-interact perception loop and map these levels.
     Further, there have been efforts to develop integrated architecture to utilize
multiple components of situation assessment. However, most of them are spe-
cific for a particular task like navigating [21], intention detection [16], robot’s
self-perception [5], spatial and temporal situation assessment for robot passing
through a narrow passage [1], laser data based human-robot-location situation as-
sessment, e.g. human entering, coming closer, etc. [12]. Therefore, they are either
limited by the variety of perception attributes, sensors or restricted to a particular
perception-action scenario loop. On the other hand, various projects on Human
Robot Interaction try to overcome perception limitations by different means and
focus on high-level semantic and decision-making. Such as, the detection of objects
is simplified by putting tags/markers on the objects, in the detection of people no
audio information is used, [6], [14], etc. In [10], different layers of perception have




                               Page 141 of 171
Fig. 3. A generalized perception system for sense-interact in Romeo2 project, with five
layers functioning in a closed loop.

been analyzed to build representations of the 3D space, but focused on eye-hand
coordination for active perception and not on high-level semantics and perception
of the human.
    In the Romeo2 project, we are making effort to bring a range of multi-sensor
perception components within a unified framework (Naoqi, [18]), at the same time
making the entire multi-modal perception system independent from a very specific
scenario or task, and explicitly incorporating reasoning about human, towards
realizing effective and more natural multi-modal human robot interaction. In this
regard, to the best of our knowledge, Romeo2 project is the first effort of its kind
for a real world companion robot. In this paper, we do not provide the details of
each component. Instead, we give an overview of the entire situation assessment
system in Romeo2 project (sec. 2.1). Interested readers can find the details in
documentation of the system [18] and in dedicated publications for individual
components, such as [4], [11], [19], [15], [3], [24], [17], [23], etc. (see the complete
list of publications 1 ). Further, the combined effort to bring different components
together helps us to identify some of the innovation potentials and to develop
them, as discussed in section 3.

2    Perceiving Situation in Romeo2 Project
2.1 A Generalized Sense-Interact Perception Architecture for HRI
We have adapted a simple yet meaningful, sensing-interaction oriented perception
architecture, by carefully identifying various requirements and their interdepen-
dencies, as shown in fig. 3. The roles of the five identified layers are:
    (i) Sense: To receive signals/data from various sensors. Depending upon the
sensors and their fusion. This layer can build 3D point cloud world; sense stimuli
like touch, sound; know about the robot’s internal states such as joint, heat; record
speech signals; etc. Therefore, it belongs to level 1 of situation assessment.
    (ii) Cognize: Corresponds to the ’meaningful’ (human-understandable level)
and relevant information extraction, e.g. learning shapes of objects; learning to
extract the semantics from 3D point cloud, the meaningful words from speech, the
meaningful parameters in demonstration, etc. In most of the perception-action
systems, this cognize part is provided a priori to the system. However, in Romeo2
projects we are taking steps to make cognize layer more visible by bringing together
different learning modules, such as to learn objects, learn faces, learn the meaning
of instructions, learn to categorize emotions, etc. This layer lies across level 1 and
level 2 of situation assessment, as it is building knowledge in terms of attributes
and their values and also extracting some meaning for future use and interaction.
    (iii) Recognize: Dedicated to recognizing what has been ’cognized’ earlier by
the system, e.g. a place, face, word, meaning, emotion, etc. This mostly belongs
to level 2 of situation assessment, as it is more on utilizing the knowledge either
learned or provided a priori, hence ’experience’ becomes the dominating factor.




                                Page 142 of 171
 Table 1. Identification and Classification of the key situation assessment components




    (iv) Track: This layer corresponds to the requirement to track something
(sound, object, person, etc.) during the course of interaction. From this layer, level
3 of situation assessment begins, as tracking allows to update in time the state of
the beforehand entity (person, object, etc.), hence involves a kind of ’projection’.
    (v) Interact: This corresponds to the high-level perception requirements for
interaction with the human and the environment. E.g. activity, action and inten-
tion prediction, perspective taking, social signal and gaze analyses, semantic and
affordance prediction (e.g. pushable objects, sitable objects, etc.). It mainly be-
longs to level 3 of situation assessment, as involves ’predicting’ side of perception.
    Sometimes, practically there are some intermediate loops and bridges among
these layers, for example a kind of loop between tracking and recognition. Those
are not shown for the sake of making main idea of the architecture better visible.
    Note the closed loop aspect of the architecture from interaction to sense. As
shown in some preliminary examples in section 3, such as Ex1, we are able to practi-
cally achieve this, which is important to facilitate natural human-robot interaction
process, which can be viewed as:       Sense æ Build knowledge for interaction æ
Interact æ Decide what to sense æ Sense æ...
2.2   Basic Requirements, Key Attributes and Developments
In Romeo2 project, we have identified the key attributes and elements of situation
assessment, to be perceived from companion robotics domain perspective, and
categorized along five basic requirements as summarized in table 1. In this section,
we describe some of those modules. See Naoqi [18] for details of all the modules.
I. Perception of Human
People presence: Perceives presence of people, assign unique ID to each detected
person. Face characteristics: To predict age, gender and degree of smile on a
detected face. Posture characterization (human): To find position and ori-
entation of different body parts of the human, shoulder, hand, etc. Perspective
taking: To perceive reachable and visible places and objects from the human’s
perspective, with the level of effort required to see and reach. Emotion recogni-
tion: For basic emotions of anxiety, anger, sadness, joy, etc. based on multi-modal
audio-video signal analysis. Speaker localization: Localizes spatially the speak-
ing person. Speech rhythm analysis: Analyzing the characterization of speech
rhythm by using acoustic or prosodic anchoring, to extract social signals such as




                               Page 143 of 171
engagement, etc. User profile: To generate emotional and interactional profile
of the interacting user. Used to dynamically interpret the emotional behavior as
well as to build behavioral model of the individual over a longer period of time.
Intention analysis: To interpret the intention and desire of the user through con-
versation in order to provide context, and switch among different topics to talk.
The context also helps other perception components about what to perceive and
where to focus. Thus, facilitates closing the interaction-sense loop of fig. 3.
II. Perception of Robot Itself
Fall detection: To detect if the robot is falling and to take some human user and
self-protection measures with its arms before touching the ground.
Other modules in this category are self-descriptive. However it is worth to mention
that, such modules also provide symbolic level information, such as battery nearly
empty, getting charged, foot touching ground, symbolic posture sitting, standing,
standing in init pose, etc. All these help in achieving one of the aims of Romeo2
project: sensing for natural interaction with human.
III. Perception of Object
Object Tracker: It consists of different aspects of tracking, such as moving to
track, tracking a moving object and tracking while the robot is moving. Semantic
perception (object): Extracts high-level meaningful information, such as object
type (chair, table, etc.), categories and affordances (sitable, pushable, etc.)
IV. Perception of Environment
Darkness detection: Estimates based on the lighting conditions of the envi-
ronment around the robot. Semantic perception (place): Extracts meaningful
information from the environment about places and landmarks (a kitchen, corridor,
etc.), and builds topological maps.
V. Perception of Stimuli
Contact observer: To be aware of desired or non-desired contacts when they
occur, by interpreting information from various embedded sensors, such as ac-
celerometers, gyro, inclinometers, joints, IMU and motor torques’.
3   Results and Discussion on Innovation Potentials
We will not go in detail of the individual modules and the results, as those can be
found online [18]. Instead, we will discuss some of the advantages and innovation
potentials, which such modules functioning on a unified platform could bring.
    Ex1: The capa-
bility of multi-modal
perception, combining
input from the inter-
acting user, the events
triggered by other per-
ception components,
and the centralized
memorization mecha-
nism of robot, help
to achieve the goal of Fig. 4. Subset of interaction topics (right), and their dynamic
                         activation levels based on multi-modal perception and events.
closing the interact-
sense loop and dynamically shaping the interaction.




                               Page 144 of 171
                    (a)                             (b)                    (c)
Fig. 5. High-level situation assessment. (a) The semantics map of the environment. (b)
Effort and Perspective taking based situation assessment. (c) Combining (a) and (b), the
robot will be able to make the object accessible to the human.

    To demonstrate, we programmed an extensive dialogue with 26 topics that
shows the capabilities of the Romeo robot. During this dialogue the user often
interrupts Romeo to quickly ask a question, this leads to several ’conflicting’ topics
in the dialogue manager. The activation of different topics during an interaction
over a period is shown in fig. 4. The plot shows that around 136th second the user
has to take his medicine, but the situation assessment based memory indicates that
the user has ignored and not yet taken the medicine. Eventually, the system results
the robot urging the user to take his medication (pointed by blue arrow), making
it more important than the activity indicated by the user during the conversation
(to engage in reading a book, pointed by dotted arrow in dark green). Hence, a
close loop between the perception and interaction is getting achieved in a real
time, dynamic and interactive manner.
    Ex2: Fig. 5(a) shows situation assessment of the environment and objects at
the level of semantics and affordances, such as there is a ’table’ recognized at
position X, and this belongs to an affordance category on which something can
be put. Fig. 5(b) shows situation assessment by perspective taking, in terms of
abilities and effort of the human. This enables the robot to infer that the sitting
human (as shown in fig. 5(c)) will be required to stand up and lean forward to see
and take the object behind the box. Thanks to the combined reasoning of (a) and
(b), the robot will be able to make the object accessible to the human by placing it
on the table (knowing that something can be put on it), at a place reachable and
visible by the human with least effort (through the perspective taking mechanism),
as shown in fig. 5(c).
    In Romeo2 we also aim to use this combined reasoning about abilities and
efforts of agents, and affordances of the environment, for autonomous human-
level understanding of task semantics through interactive demonstration, for the
development of robot’s proactive behaviors, etc. as suggested the feasibility and
advantages in some of our complementary studies in those directions, [19], [20].
    Ex3: Analyzing verbal and non-verbal behav-
iors such as head direction (e.g. on-view or off-
view detection) [15], speech rhythm (e.g. on-talk
or self-talk) [22], laugh detection [8], emotion de-
tection [24], attention detection [23], and their dy-
namics (e.g. synchrony [7]), combined with acoustic
analysis (e.g. spectrum) and prosodic analysis alto-
gether greatly allows to improve social engagement
                                                           Fig. 6. Self-talk detection
characterization of the human during interaction.




                                Page 145 of 171
     To demonstrate, we collected a database
of human-robot interaction during sessions
of cognitive stimulation. The preliminary
result with 14 users shows that on a 7 level
evaluation scheme, the average scores for
questions, ”Did robot show any empathy?”,
”Was it nice to you?” and ”Was it polite?”
were 6.3, 6.2 and 6.4 respectively. In ad-
dition, the multi-modality combination of Fig. 7. Face, shoulder and face orienta-
the rhythmic, energy and pitch character- tion detection of two interacting people.
istics seems to be elevating the detection of
self-talk (known to reflect the cognitive load of the user, especially for elderly) as
shown in table of fig. 6.
     Ex4: Inferring face gaze (as illustrated in fig.
7), combined with sound localization and object
detection, altogether provides enhanced knowl-
edge about who might be speaking in a multi-
people human-robot interaction, and further fa-
cilitates analyzing the attention and intention.
To demonstrate this, we conducted an experi-
ment with two speakers, initially speaking at the
different sides of the robot and then slowly mov-
ing towards each other and eventually separate
away. Fig. 8 shows the preliminary result for the
                                                      Fig. 8. Sound source separation,
sound source separation by the system based on
                                                      only audio based (BF-SS) and
beamforming. The left part (BF-SS) shows when audio-video based (AVBF-SS).
only the audio signal is used. When the system
uses the visual information combined with the audio signals, the performance is
better (AVBF-SS) in all the three types of analyses: signal-to-interference ratio
(SIR), signal-to-distortion ratio(SDR) and signal-to-artifact (SAR) ratio.
     Ex5: The fusion of rich information about visual clues, audio speech rhythm,
lexical content and the user profile is also opening doors for automated context
extraction, helping for better interaction and emotion grounding and making the
interaction interesting, like doing humor [13].
4   Conclusion and Future Work
In this paper, we have provided an overview of the rich multi-modal perception
and situation assessment system within the scope of Romeo2 project. We have
presented our sensing-interaction perception architecture and identified the key
perception components requirements for companion robot. The main novelty lies
in the provision for rich reasoning about the human and practically closing the
sensing-interaction loop. We have pointed towards some of the work in progress
innovation potentials, achievable when different situation assessment components
are working on a unified theoretical and functional framework. It would be inter-
esting to see how it could serve as guideline in different context than companion
robot, such as robot co-worker.
References
 1. Beck, A., Risager, C., Andersen, N., Ravn, O.: Spacio-temporal situation assessment
    for mobile robots. In: Int. Conf. on Information Fusion (FUSION) (2011)




                               Page 146 of 171
 2. Bolstad, C.A.: Situation awareness: Does it change with age. vol. 45, pp. 272–276.
    Human Factors and Ergonomics Society (2001)
 3. Buendia, A., Devillers, L.: From informative cooperative dialogues to long-term social
    relation with a robot. In: Natural Interaction with Robots, Knowbots and Smart-
    phones, pp. 135–151 (2014)
 4. Caron, L.C., Song, Y., Filliat, D., Gepperth, A.: Neural network based 2d/3d fusion
    for robotic object recognition. In: Proc. European Symposium on Artificial Neural
    Networks, Computational Intelligence and Machine Learning (ESANN) (2014)
 5. Chella, A.: A robot architecture based on higher order perception loop. In: Brain
    Inspired Cognitive Systems 2008, pp. 267–283. Springer (2010)
 6. CHRIS-Project:         Cooperative        human      robot     interaction     systems.
    http://www.chrisfp7.eu/
 7. Delaherche, E., Chetouani, M., Mahdhaoui, A., Saint-Georges, C., Viaux, S., Co-
    hen, D.: Interpersonal synchrony: A survey of evaluation methods across disciplines.
    Affective Computing, IEEE Transactions on 3(3), 349–365 (July 2012)
 8. Devillers, L.Y., Soury, M.: A social interaction system for studying humor with the
    robot nao. In: ICMI. pp. 313–314 (2013)
 9. Endsley, M.R.: Toward a theory of situation awareness in dynamic systems. Human
    Factors: Journal of the Human Factors and Ergonomics Society 37(1), 32–64 (1995)
10. EYESHOTS-Project: Heterogeneous 3-d perception across visual fragments.
    http://www.eyeshots.it/
11. Filliat, D., Battesti, E., Bazeille, S., Duceux, G., Gepperth, A., Harrath, L., Jebari,
    I., Pereira, R., Tapus, A., Meyer, C., Ieng, S., Benosman, R., Cizeron, E., Mamanna,
    J.C., Pothier, B.: Rgbd object recognition and visual texture classification for indoor
    semantic mapping. In: Technologies for Practical Robot Applications (2012)
12. Jensen, B., Philippsen, R., Siegwart, R.: Narrative situation assessment for human-
    robot interaction. In: IEEE ICRA. vol. 1, pp. 1503–1508 vol.1 (Sept 2003)
13. JOKER-Project: Joke and empathy of a robot/eca: Towards social and affective
    relations with a robot. http://www.chistera.eu/projects/joker
14. Lallee, S., Lemaignan, S., Lenz, A., Melhuish, C., Natale, L., Skachek, S., van
    Der Zant, T., Warneken, F., Dominey, P.F.: Towards a platform-independent co-
    operative human-robot interaction system: I. perception. In: IEEE/RSJ IROS. pp.
    4444–4451 (Oct 2010)
15. Le Maitre, J., Chetouani, M.: Self-talk discrimination in human-robot interaction
    situations for supporting social awareness. J. of Social Robotics 5(2), 277–289 (2013)
16. Lee, S., Baek, S.M., Lee, J.: Cognitive robotic engine: Behavioral perception archi-
    tecture for human-robot interaction. In: Human Robot Interaction (2007)
17. Mekonnen, A.A., Lerasle, F., Herbulot, A., Briand, C.: People detection with hetero-
    geneous features and explicit optimization on computation time. In: ICPR (2014)
18. NAOqi-Documentation:                 https://community.aldebaran-robotics.com/doc/2-
    00/naoqi/index.html/
19. Pandey, A.K., Alami, R.: Towards human-level semantics understanding of human-
    centered object manipulation tasks for hri: Reasoning about effect, ability, effort and
    perspective taking. Int. J. of Social Robotics pp. 1–28 (2014)
20. Pandey, A.K., Ali, M., Alami, R.: Towards a task-aware proactive sociable robot
    based on multi-state perspective-taking. J. of Social Robotics 5(2), 215–236 (2013)
21. Pomerleau, D.A.: Neural network perception for mobile robot guidance. Tech. rep.,
    DTIC Document (1992)
22. Ringeval, F., Chetouani, M., Schuller, B.: Novel metrics of speech rhythm for the
    assessment of emotion. Interspeech pp. 2763–2766 (2012)
23. Sehili, M., Yang, F., Devillers, L.: Attention detection in elderly people-robot spoken
    interaction. In: ICMI WS on Multimodal Multiparty real-world HRI (2014)
24. Tahon, M., Delaborde, A., Devillers, L.: Real-life emotion detection from speech in
    human-robot interaction: Experiments across diverse corpora with child and adult
    voices. In: INTERSPEECH. pp. 3121–3124 (2011)




                                 Page 147 of 171