<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshops at the Third International Conference on Hybrid Human-Artificial Intelligence (HHAI),
June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>RoboTrio2: Annotated Interactions of a Teleoperated Robot and Human Dyads for Data-Driven Behavioral Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Frédéric Elisei</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Léa Haefflinger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gérard Bailly</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Atos</institution>
          ,
          <addr-line>Échirolles</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab</institution>
          ,
          <addr-line>F-38000 Grenoble</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <fpage>0</fpage>
      <lpage>14</lpage>
      <abstract>
        <p>We present here RoboTrio2, an annotated multimodal corpus of interactions between an autonomouslooking social robot and two humans, and the original way to record it: an immersive tele-operation of the robot, which makes it behave naturally and efficiently, and captures many signals (gaze including vergence, head and neck movements, the exact subjective stereo views that motivate the decisions in the interaction, binaural audio...). With this high level of embodiment, the pilot provides the robot with demonstrations of conversational skills to conduct a natural interaction with humans and successfully perform the intended task (social interactions in a gaming scenario, with gaze and speech turnovers). The behaviors of its two human partners are also recorded through static HD cameras and headset microphones to ease annotation. Training autonomous behavioral models for our social robot is the main goal of this 8-hour corpus, but the study of elicited human behaviors is also possible with the corpus and annotations we released.</p>
      </abstract>
      <kwd-group>
        <kwd>human-robot interaction</kwd>
        <kwd>social robotics</kwd>
        <kwd>multi-party</kwd>
        <kwd>cooperative game</kwd>
        <kwd>head and gaze orientation</kwd>
        <kwd>immersive teleoperation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Nowadays machine learning needs adequate training data. What do social robots need to train
to social interaction with naive humans? Would it be successful to imitate human signals, just
like children do? The generation of verbal and non-verbal behaviors for robots is frequently
based on human-human interaction dataset [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. But robots – even humanoids ones – have
different bodies and capabilities, will not grow up in a human body and must be prepared to be
alternatively considered by users as agents or objects, or even ignored when their behavior is no
more appropriate...
      </p>
      <p>
        Some studies have already highlighted the differences between Human-Human Interaction
(HHI) and Human-Robot Interaction (HRI): Children appear to be more expressive when playing
with another child than with a robot [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the position of a human’s head during turn-taking varies
depending on whether the change is occurring between two humans or between a human and
a robot [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the duration of gaze fixations of a robot face is longer than for a human face [ 6],
humans modify their prosody when addressing a machine [7]. To tackle this problem and study
human-robot interactions, the wizard of Oz method is often used [8], where the robot is remotely
controlled by a human using buttons and predefined actions [ 9]. A second possible method is a
robot controlled by rules [10, 11]. However both options restrict the possible actions to those that
are predenfied, Both introduce a lack of naturalness and fluidity in the interaction, also modifying
the transistions. In addition, they often limit the experiment to the study of a unique aspect of the
interaction.
      </p>
      <p>If learning by imitation, robots should do from other robots that have the same
sensors/actuators (body) and already engage successfully in fluid, natural, ecological interactions with
humans ... while humans interact with this yet-to-be autonomous robot!</p>
      <p>We describe here an original method for collecting such corpora with fluid humans/robot
interactions: immersive teleoperation of a humanoid robot can collect multimodal signals
intrinsically adapted to the specific robot sensors/actuators and its specific reaction times, while
bringing the social know-how, language understanding, and decision-taking of a human tutor into
the sensory-motor coupling.</p>
      <p>The here delivered 8-hour RoboTrio2 corpus [12] is such an example, with many annotations.
It consists of a task-oriented interaction of a robot with 23 humans pairs, collected in French by a
single pilot. This dataset was successfully used to train a machine learning model for robot gaze
control [13]. It could also be used to study conversational modes, gaze and head behavior, or to
compare with behavior from HH data.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Immersive teleoperation</title>
      <p>His chin and lip corners have motion capture markers to drive the robot face in real time
with his articulations: The robot is a modified iCub [ 14] that has an articulated mouth (hiding
a speaker), mobile eyes (each embedding a VGA camera) that move like human ones (with 3
degrees of freedom, including vergence), and a microphone in each ear (with human-shaped ear
pinnas). The robot neck allows human-like orientations of the head (with 3 degrees of freedom).
Logged streams: With our original immersive teleoperation system [15], all these real time
streams are synchronized and recorded, and control the iCub robot in real-time. Pilot’s gaze
and pilot head movements are 60 Hz streams. We also record what the robot does from these
motor instructions to drive the equivalent 6 degrees of freedom (3 for the neck, 3 for the eyes
including vergence). The pilot generates his interactive behaviors (where to look, what to say,
head movements...) from what he hears and what he sees through the robot sensors, captured
by the robot ears and through the pair of cameras embedded in its mobile eyes. No joystick or
button press for the pilot, he just directs his own face and gaze, that the VR headset and its dual
eye-tracker track. Pilot’s jaw and lip movements are tracked by a Qualisys motion capture system
and drive the robot face to shape its mouth, that relays the pilot speech through a speaker.</p>
    </sec>
    <sec id="sec-3">
      <title>3. The RoboTrio2 corpus</title>
      <p>The corpus involves a cooperative game played by the two humans. They sit in front of a
social robot that acts as a game animator and referee. This robot is teleoperated as previously
described, resulting in a high level of embodiment. What is demonstrated by the pilot is a viable
solution with the specific robot sensors/actuators to conduct a natural real time interaction
with humans (decoding and generating meaningful gaze and aversion, speech turnovers...) to
successfully perform the intended social interaction needed by the gaming scenario. What is
experienced by the human players is an autonomous-looking robot that utilizes natural
language and ecological head and gaze patterns to perform the joint task.</p>
      <p>This 8-hour corpus logs data streams and events linking perception and action, making it ideal
for building autonomous behavior models for our social robot, Nina, a modified iCub. But with
all the provided annotations, it can also be used to study all the humans that interacted with this
robot: we recorded 23 interactions with different human pairs (either male or female) while the
robot is always teleoperated by the same human pilot (to help build a coherent one-to-many robot
behavior model).</p>
      <p>The game: It is played by a team of two humans, trying to find the words most commonly
associated with a given theme (previously played online by other human players). E.g. for the
“eat” theme, the words that would score the most are “drink”, “food”, “lunch”, “diner”, “swallow”
and “feed”. The same 9 theme words are played for all the games, and 5 answers are collected
per theme. During the game, our players collaborate to find the best answers and look/question
the robot at will. This scenario generates a lot of interaction and social cues; thinking about the
theme, brainstorming and debating potential answers, etc. The robot guides them as its human
pilot would, and frequently takes part in the conversation. The corpus is both complex and rich
in verbal and non-verbal content for the players and the robot (mutual gaze, gaze aversion,
speech overlap, backchannel . . . ). Videos, annotations and extra details can be seen online [16].</p>
      <p>To ease the post-recording annotations, the two human players are also recorded by two fixed
HD cameras (synchronized with the Qualisys capture system). These are not used by the robot
nor the pilot, but were helpful in annotating the interaction signals and meaningful events: gaze
directed to robot/other player, prephonatory gestures, thinking attitudes... while the first person
cameras are low resolution and will exhibit motion blur.</p>
      <p>We also ran OpenFace [17] to extract head rotations and eye movements as well as FACS
Action Units for every player (seen by the HD cameras), giving access to higher-level events (e.g.
lip opening for prephonatory gestures).
3.1. What has been annotated
We use Elan from MPI [18] to concentrate all the multi-channels audio and video streams in
parallel tracks, plus the annotations of the robot streams/motion capture that form the corpus.
Figure 2 shows some of the hierarchical annotations of the verbal content.</p>
      <p>Speech transcriptions: Speech has been transcribed for the pilot as well as the human players.
In our specific scenario, some spot-words play specific roles and have been annotated specifically:
the themes that the referee gives to the players (known beforehand), and all the words that the
players may give as a proposition, or discuss together before making a formal proposition.
Speech acts of the robot/pilot: We listed 23 different classes for the pilot speech intents,
including: ask for a proposition (or a validation), repeat a proposition or the theme again, give the
score/the theme/an explanation or feedback, wait for players after each round.
Gazes of the robot/pilot: The pilot gaze focal point is computed from the 60 Hz recordings
of the pilot’s head and eyes movements. After detecting ocular saccades, these points were
classified using Gaussian Mixture Models (GMM) into 4 different targets: LeftUser (leftmost
player), RightUser (rightmost player), Tablet (live game info), and Elsewhere.
Gaze of the users: By combining the players’ head and eye positions provided by OpenFace,
their gaze was classified using GMM (after detection of ocular saccades) according to their three
targets: Robot, other User, Elsewhere.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Statistics on the corpus</title>
      <p>Of the 23 recorded sequences, 11 (nearly 4 hours) are fully annotated both verbally and
nonverbally. To illustrate the richness and interest of this corpus, this section presents some statistics
on the behavior of the pilot and the users, as well as some findings.</p>
      <p>Verbal statistics: As the roles in this corpus are asymmetrical, verbal behavior of the
pilot/robot and players differ. As seen in Table 1, the number of utterances is equivalent between
the participants, but the average duration of these and therefore the total speaking time differ
significantly. Indeed, users produce a lot of backchanneling (few hundred) or positive/negative
feedbacks (almost 2,000) to share their reactions to the proposals or scores given, resulting
in very short utterances, unlike the pilot, who animates the game and may have to use longer
sentences when announcing the theme or scores.</p>
      <p>Concerning the pilot’s intentions, 10 of the 23 occurred for at least 90 utterances. The 3 most
frequent are “ask for the validation of a proposal", “give score" and “ask for a proposal", with
574, 367 and 363 utterances respectively.</p>
      <p>Furthermore, as this corpus is multi-party, the pilot can address one player or both at the same
time. This allows the study of the different participant roles in a conversation (speaker, listener,
side participant,. . . ), and their regulation, called “Footing" in [19]. To obtain an indication of
who the pilot’s addressee(s) are, we detected the use of French pronouns in his sentences: “Tu"
(You) for one addressee vs. “Vous" (You) for both. As a result, 172 utterances contain the “Tu"
pronoun and 632 the “Vous" pronoun. Players’ first names were also used in 139 utterances. This
corpus has already been used successfully to compare the different behaviors of the teleoperated
robot’s head depending on whether he was addressing one or both players [20].
Gaze statistics: As gaze is one of the most important non-verbal cues in human and HRI
conversations [21], this section presents some general statistics on it. First, figure 3 shows the
gaze distributions of the pilot/robot and the users for the 11 sequences. Both users are looked at
equally by the pilot (there could be some variations inside sequences but not globally). We can
also see the significant use of the tablet, which is looked at by the robot/pilot to retrieve needed
information. The players have looked at the robot a lot, indicating that they haven’t put the robot
aside and have included it in the conversation. The Elsewhere class is also well represented, due
to the many moments when players need to think.</p>
      <p>As the pilot’s behavior can be affected by his role in the conversation, speaker or listener,
table 2 shows in more detail the distribution of his gaze according to his activity. When he’s
speaking, he looks more at the tablet that displays information about the game in progress. When
one of the players is speaking, the pilot will often look at him/her (green cells), however the
proportion of his gaze directed at the other player is not low (yellow cells). The role of
game facilitator requires the pilot to observe the reactions of the players and motivate their
collaboration, making his behavior a complex one (best generated with machine learning and ad
hoc data).</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>We have shown here how the immersive teleoperation of a robot can produce a valuable corpus,
especially for training social robotics models. In contrast to human-human corpora [22, 23], our
data may include the change of expectations and behavior in front of robotic bodies. In
addition, this setup provides data from a fluid interaction in HRI, where the robot’s behaviors are
less constrained than when using usual wizard of Oz or rule-based methods [24, 25, 26]. The
recorded behaviours are also suitable for studying conversational modes, gaze and head behavior
in natural HRI. As an example, we provide RoboTrio2, an 8-hour annotated multimodal corpus
of a multi-party game (available online [12]), and outline here both some of its contents and first
results from its analysis.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This data collection was funded by a CNRS S2IH PEPS project, involving GIPSA-lab, LPL
and INT. We are grateful to our subjects and to the people who contributed to the immersive
teleoperation platform (M. Sauze, R. Cambuzat, and C. Plasson) and to the RoboTrio2
recording/annotation (N. Loudjani, O. Granier, and J. Rengot). Part of this work is funded by ANR
19-P3IA-0003 MIAI and a PhD granted by ANRT (2021/0836).
[6] C. Yu, P. Schermerhorn, M. Scheutz, Adaptive eye gaze patterns in interactions with human
and artificial agents, ACM Transactions on Interactive Intelligent Systems (TiiS) 1 (2012)
1–25.
[7] E. Shriberg, A. Stolcke, D. Hakkani-Tür, L. P. Heck, Learning when to listen: Detecting
system-addressed speech in human-human-computer dialog., in: INTERSPEECH, 2012,
pp. 334–337.
[8] L. D. Riek, Wizard of oz studies in hri: A systematic review and new reporting guidelines,</p>
      <p>J. Hum.-Robot Interact. 1 (2012) 119–136.
[9] N. Dahlbäck, A. Jönsson, L. Ahrenberg, Wizard of oz studies: Why and how, in:
Proceedings of the 1st International Conference on Intelligent User Interfaces, IUI ’93, Association
for Computing Machinery, New York, NY, USA, 1993, p. 193–200.
[10] G. Skantze, M. Johansson, J. Beskow, Exploring turn-taking cues in multi-party
humanrobot discussions about objects, in: Proceedings of the 2015 ACM on International
Conference on Multimodal Interaction, ICMI ’15, Association for Computing Machinery, New
York, NY, USA, 2015, p. 67–74.
[11] M. Moujahid, H. Hastie, O. Lemon, Multi-party interaction with a robot receptionist, in:
Proceedings of the 2022 ACM/IEEE International Conference on Human-Robot Interaction,
HRI ’22, IEEE Press, 2022, p. 927–931.
[12] F. Elisei, L. Haefflinger, L. Prévot, G. Bailly, The robotrio2 corpus, 2023. URL: https://hdl.
handle.net/11403/robotrio/v2, ORTOLANG (Open Resources and TOols for LANGuage) –
www.ortolang.fr.
[13] L. Haefflinger, F. Elisei, B. Bouchot, B. Varini, G. Bailly, Data-driven generation of eyes and
head movements of a social robot in multiparty conversation, in: International Conference
on Social Robotics, Springer, 2023, pp. 191–203.
[14] A. Parmiggiani, M. Randazzo, M. Maggiali, F. Elisei, G. Bailly, G. Metta, An articulated
talking face for the icub, in: 2014 IEEE-RAS International Conference on Humanoid
Robots, 2014, pp. 1–6. doi:10.1109/HUMANOIDS.2014.7041309.
[15] R. Cambuzat, F. Elisei, G. Bailly, O. Simonin, A. Spalanzani, Immersive Teleoperation of
the Eye Gaze of Social Robots Assessing Gaze-Contingent Control of Vergence, Yaw and
Pitch of Robotic Eyes, in: ISR 2018 - 50th International Symposium on Robotics, VDE,
Munich, Germany, 2018, pp. 232–239.
[16] F. Elisei, Presentation of the robotrio corpus, https://www.gipsa-lab.grenoble-inp.fr/
~frederic.elisei/RoboTrio, 2024. [Online; accessed 1-April-2024].
[17] B. Amos, B. Ludwiczuk, M. Satyanarayanan, OpenFace: A general-purpose face recognition
library with mobile applications, Technical Report, CMU-CS-16-118, CMU School of
Computer Science, 2016.
[18] H. Brugman, A. Russel, X. Nijmegen, Annotating multi-media/multi-modal resources with
elan., in: LREC, 2004, pp. 2065–2068.
[19] E. Goffman, Footing, Semiotica 25 (1979) 1–30.
[20] L. Haefiflnger, F. Elisei, S. Gerber, B. Bouchot, J.-P. Vigne, G. Bailly, On the benefit of
independent control of head and eye movements of a social robot for multiparty human-robot
interaction, in: M. Kurosu, A. Hashizume (Eds.), Human-Computer Interaction, Springer
Nature Switzerland, Cham, 2023, pp. 450–466.
[21] H. Admoni, B. Scassellati, Social eye gaze in human-robot interaction: A review, J.</p>
      <p>Hum.-Robot Interact. 6 (2017) 25–63.
[22] A. D. Marshall, P. L. Rosin, J. Vandeventer, A. Aubrey, 4d cardiff conversation database
(4d ccdb): A 4d database of natural, dyadic conversations, Auditory-Visual Speech
Processing,{AVSP} 2015 (2015) 157–162.
[23] O. Celiktutan, E. Skordos, H. Gunes, Multimodal human-human-robot interactions (mhhri)
dataset for studying personality and engagement, IEEE Transactions on Affective
Computing 10 (2017) 484–497.
[24] D. B. Jayagopi, S. Sheiki, D. Klotz, J. Wienke, J.-M. Odobez, S. Wrede, V. Khalidov,
L. Nyugen, B. Wrede, D. Gatica-Perez, The vernissage corpus: A conversational
humanrobot-interaction dataset, in: 2013 8th ACM/IEEE International Conference on
HumanRobot Interaction (HRI), IEEE, 2013, pp. 149–150.
[25] A. Ben-Youssef, C. Clavel, S. Essid, M. Bilac, M. Chamoux, A. Lim, Ue-hri: A new
dataset for the study of user engagement in spontaneous human-robot interactions, in:
Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI
’17, Association for Computing Machinery, New York, NY, USA, 2017, p. 464–472.
[26] E. Kesim, T. Numanoglu, O. Bayramoglu, B. B. Turker, N. Hussain, M. Sezgin, Y. Yemez,
E. Erzin, The ehri database: a multimodal database of engagement in human–robot
interactions, Language Resources and Evaluation (2023) 1–25.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Shintani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Ishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ishiguro</surname>
          </string-name>
          ,
          <article-title>Analysis of role-based gaze behaviors and gaze aversions, and implementation of robot's gaze control for multi-party dialogue</article-title>
          ,
          <source>in: Proceedings of the 9th International Conference on Human-Agent Interaction, HAI '21</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2021</year>
          , p.
          <fpage>332</fpage>
          -
          <lpage>336</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Oertel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jonell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontogiorgos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. F.</given-names>
            <surname>Mora</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Odobez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gustafson</surname>
          </string-name>
          ,
          <article-title>Towards an engagement-aware attentive artificial listener for multi-party interactions</article-title>
          ,
          <source>Frontiers in Robotics and AI</source>
          <volume>8</volume>
          (
          <year>2021</year>
          )
          <fpage>555913</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Domingo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gómez-García-Bermejo</surname>
          </string-name>
          , E. Zalama,
          <article-title>Optimization and improvement of a robotics gaze control system using lstm networks</article-title>
          ,
          <source>Multimedia Tools and Applications</source>
          <volume>81</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shahid</surname>
          </string-name>
          , E. Krahmer,
          <string-name>
            <given-names>M.</given-names>
            <surname>Swerts</surname>
          </string-name>
          ,
          <article-title>Child-robot interaction across cultures: How does playing a game with a social robot compare to playing a game alone or with a friend?</article-title>
          ,
          <source>Computers in Human Behavior</source>
          <volume>40</volume>
          (
          <year>2014</year>
          )
          <fpage>86</fpage>
          -
          <lpage>100</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Johansson</surname>
          </string-name>
          , G. Skantze,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gustafson</surname>
          </string-name>
          ,
          <article-title>Head pose patterns in multiparty human-robot team-building interactions</article-title>
          ,
          <source>in: Social Robotics: 5th International Conference, ICSR</source>
          <year>2013</year>
          ,
          <article-title>Bristol</article-title>
          , UK,
          <source>October 27-29</source>
          ,
          <year>2013</year>
          , Proceedings 5, Springer,
          <year>2013</year>
          , pp.
          <fpage>351</fpage>
          -
          <lpage>360</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>