<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Robust Multimodal Command Interpretation for Human-Multirobot Interaction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jonathan Cacace</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Finzi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincenzo Lippiello</string-name>
          <email>lippiellog@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universita degli Studi di Napoli Federico II</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this work, we propose a multimodal interaction framework for robust human-multirobot communication in outdoor environments. In these scenarios, several human or environmental factors can cause errors, noise and wrong interpretations of the commands. The main goal of this work is to improve the robustness of human-robot interaction systems in similar situations. In particular, we propose a multimodal fusion method based on the following steps: for each communication channel, unimodal classi ers are rstly deployed in order to generate unimodal interpretations of the human inputs; the unimodal outcomes are then grouped into di erent multimodal recognition lines, each representing a possible interpretation of a sequence of multimodal inputs; these lines are nally assessed to recognize the human commands. We discuss the system at work in a real world case study in the SHERPA domain.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In this work, we tackle the problem of robust multimodal communication
between a human operator and a team of robots during the execution of a shared
task in outdoor environments. In these scenarios, the robots should be able to
timely respond to the operator's commands, minimizing chances of
misunderstanding due to noise or user errors. This crucial problem is well illustrated by
the domain of the SHERPA project [
        <xref ref-type="bibr" rid="ref10 ref3">10, 3</xref>
        ], whose goal is to develop a mixed
ground and aerial robotic platform supporting search and rescue (SAR)
activities in an alpine scenario. One of the peculiar aspects of the SHERPA domain is
the presence of a special rescue operator, called the busy genius, that cooperates
with a team of aerial vehicles in order to accomplish search and rescue missions.
In this context, the human operator is not fully dedicated to the control of the
robots, but also involved in the rescue operations. On the other hand, he/she
can exploit light wearable devices to orchestrate the robotic team operations in
a multimodal manner, using voice and gestures based commands, in order to
enable a fast and natural interaction with the robots. This scenario challenges the
command recognition system, since the environment is unstructured and noisy,
the human is under pressure, and the commands are issued in a fast and sparse
manner. In order to support the operator in similar scenarios, a robust and
reliable multimodal recognition system is a crucial component. In multimodal
interaction frameworks [
        <xref ref-type="bibr" rid="ref12 ref2 ref4 ref7 ref8">8, 2, 4, 7, 12</xref>
        ], multimodal fusion is a key issue and di
erent strategies have been proposed [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to combine the data provided by multiple
input channels (gestures, speech, gaze, body postures, etc.). Analogously to [
        <xref ref-type="bibr" rid="ref11 ref9">11,
9</xref>
        ], in order to make the system interaction robust, extensible, and natural, we
adopt a late fusion approach where the multimodal inputs provided by the
human are rst processed by dedicated unimodal classi ers (gestures recognition,
speech recognition, etc.) and then recognized by combining these outcomes. In
this setting, multimodal data are usually rst synchronized and then interpreted
according to rules or other classi cation methods. In contrast with these
solutions, in this work we propose a novel multimodal fusion approach that avoids
explicit synchronization among incoming multimodal data and it is robust with
respect to several sources of errors, from human mistakes (e.g. delays in
utterances or gestures, wrong and incomplete sequencing, etc.) and environmental
disturbances (e.g. wind, external noises ), to unimodal classi cation failures. The
main idea behind the approach is to continuously assess multiple ways to
combine together the incoming multimodal inputs in order to obtain a subset of
events that better represent a human multimodal command. In particular,
command recognition is performed in two decision steps. In the rst one, we generate
multiple hypothesis on multimodal data association given a Bayesian model of
the user way of invoking commands. For this purpose, we estimate the
probability that new samples are related to others already received. Then, in a second
step, a Naive Bayes classi er is deployed to select the most plausible command
given the possible data associations provided by the previous step.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Multimodal Human-Robot Interaction Architecture</title>
      <p>
        In Figure 1(a) we illustrate the human-multirobot architecture. The human
operator interacts with the robotic platform using di erent communication
channels (i.e. Voice, Arm Gestures, Touch Gestures and Hand Poses) by means of
his/her wearable devices. In particular, the operator exploits a headset to issue
vocal commands, a motion and gesture control brand (Myo Thalmic Armband 1)
and a mobile device (tablet ) with a touch based user interface. The multimodal
interaction system (MHRI ) should then interpret these commands passing them
to the Distributed Multi-Robot Task Allocation (DMRTA) (see [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for details).
In this work, we focus on the MHRI describing the multimodal command
recognition system illustrated in Figure 1(b). Raw device data are directly sent and
simultaneously elaborated by the unimodal classi ers C0; :::; Cn in order to
generate the unimodal samples si. These samples are then received by the Multimodal
Fusion module to generate di erent recognition lines fL0; :::; Lmg exploiting the
Bayesian Network and the Training Set. Each multimodal command is
successively interpreted as a user command by the Command Classi cation module.
1 https://www.myo.com/
(a) Human-Robot Interaction architecture.
(b) Multimodal
Recognition System.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Command Recognition</title>
      <p>Multimodal command recognition relies on a late fusion approach in which
heterogeneous inputs provided by the user through di erent channels are rst
classied by unimodal recognizers and then fused together in order to be interpreted as
human commands. More speci cally, given a sequence of inputs S generated by
the unimodal classi ers, the command recognition problem consists in nding the
command c that maximizes the probability value P (cjS). This problem is here
formulated as follow. We assume a set C = fc0; c1; :::; ckg of possible commands
invokable by the operator. Each command is issued in a multimodal manner,
hence it is associated with a sequence of unimodal inputs S = fs0; :::; sng, each
represented by the triple si = (wi; chi; ti), where wi 2 W is the label provided
by the unimodal classi er associated with the chi 2 I channel and ti 2 R+
is its time of arrival. In our approach, the user commands are interpreted in
two decision steps: rstly, the outputs of unimodal classi ers are fused together
(Multimodal Fusion) in order to be assessed an recognized as user commands in
the second step (Command Recognition).</p>
      <p>Multimodal Fusion. The multimodal fusion step allows the system to select and
group together unsynchronized inputs provided by the unimodal classi ers
associated with the same current command. For this purpose, in correspondence
to the input sequence S of unimodal classi ed data, we generate di erent
possible subsets of elements, called Recognition Lines, each representing a possible
way to associate these inputs to the invoked command. Therefore, during the
command interpretation process, di erent Recognition Lines are generated and
collected into a Recognition Set in order to be interpreted in the second step.
These multiple ways of grouping the inputs, allow the proposed framework to
fuse unsynchronized unimodal inputs in a robust fashion, coping with
disturbances like environmental noise, command invocation errors, or failures of the
single unimodal recognition subsystems. The Recognition Line generation
process works as follows. First of all, for each new input, a new Recognition Line
containing only this data is generated; then the incoming data is also assessed
in order to be included into other Recognition Lines already available in the
Recognition Set. In order to assign an input sample to a Recognition Line we
rely on a Bayesian Network (BN ) approach suitably trained in order to infer the
probability that a new incoming unimodal sample sn belongs to a Recognition
Line given the others s0; :::; si already associated to the same line. Speci cally,
the BN proposed in this work consists of three di erent nodes (see Figure 1(c)).
The Word node, that contains the list of input data in the recognition line, the
Channel node that is for the input channels and the Line node that represents
the probability of the new incoming samples to belong to the considered line.
In this setting, a recieved input data is associated with a recognition line the
probability to belong to that line is greater than a suitable threshold 1 and the
temporal distance of the received sample (sr) with respect to the previous one
(sp) on the same line is within a speci c interval (jtsr tsp j &lt; ).
Command Recognition. In the command recognition phase, the previously
generated Recognition Lines are to be interpreted as user commands. Our approach
exploits a Naive Bayes classi er to associate each element of the Recognition
Set with a label and a score representing, respectively, the recognized command
class and its classi cation probability. More speci cally, given a sequence of
samples S = fs0; :::; sng, the list of semantic labels Sw = fw0; :::; wng is extracted.
Given the list of possible commands c0; :::; ck, the class c^ and its score is assessed
by through the formula: c^ = arg maxc2C P (c) QjiS=w1j p(cjwi). Once all the
Recognition Lines have been classi ed, the line with maximum score is selected as
the recognized user command (see Figure 1(d)). Also in this case, a command is
properly recognized only if the probability returned by the Naive Bayes classi er
is higher than a trained threshold 2 .</p>
      <p>(c) Bayesian Network for
multimodal command fusion.</p>
      <p>(d) Recognition lines and scores.</p>
      <p>System Training. The multimodal system is trained exploiting a Training Set
that collects, for each sample: the requested command coupled with the
generated samples, the associated channel, and the elapsed time between the samples.
This way, the Bayesian Network for multimodal fusion is trained by with list
of pairs (wi; chi) for each command invocation in the dataset. The command
recognition system is trained with the list of (wi) of the samples used to
interpret the user commands. Moreover, once the multimodal fusion system has been
trained, a nal training session is needed to adapt the thresholds ( 1, 2, ).
This is obtained by asking the users to validate both the generated Recognition
Lines and the associated classi cation result.</p>
    </sec>
    <sec id="sec-4">
      <title>SHERPA Case Study</title>
      <p>
        The proposed system has been demonstrated and tested in a real Alpine
environment. In order to communicate with the robotic platforms, the operator
is equipped with wearable devices: a standard headset, a mobile device (tablet )
along with a gesture/motion control bracelet. Speech recognition is based on
the PocketSphinx 2 software adopting a bag-of-words model instead of the most
commonly used context-free grammars. Grammar based models exploit the word
ordering in the sentence, which is not reliable in our setting since the user can
accidentally skip words because the interaction is sparse and incomplete or the
recognizer fails to catch words, because the environment is noisy. In contrast,
we adopt a less restrictive model where the recognized sentences are represented
as bags of words, which are then further processed in the late fusion step of the
multimodal recognition system described above. Gesture based commands are
used to control the robotic team via complete or complementary information
(i.e. pointing or mimic gestures ). We designed and implemented a continuous
gesture recognition module based on the approach by [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Gesture classi
cation is here based on the acceleration of the operator's arm, which is detected
by a lightweight IMU -based bracket. We de ned 14 di erent types of gestures
used to invoke high level actions (i.e. direction movements, circles, approaching,
etc). These gestures have been trained using a data set that collects gestures
from 30 users, each providing 10 trials of each gesture class. The operator is
also able to issue commands by drawing 2D gestures on a touch user interface
(see Figure 1(e)). In this case, areas to explore, trajectories or target points can
be speci ed using geometrical shapes like Circles, Squares or Lines eventually
paired with voice information. The operator can also specify commands or part
of them using hand poses. The hand pose recognition system is implemented
exploiting the built-in Myo Armband classi er able to discriminate ve di
erent hand poses from EMG sensors, namely double-tap, spread, wave left, Wave
Right and Make Fist. As for the user dataset, we mainly focus on commands
suitable for interacting with a set of co-located drones during navigation and
search tasks. Namely, selection commands enable the operator to select single or
groups of robots; for this purpose the operator can issue speech (e.g. all drones
take o , red drone land ), speech and gestures in combination (e.g. you go down),
including touch gestures for the user interface. Similar combination of modalities
2 http://wiki.ros.org/pocketsphinx
can be exploited to invoke motion and search during navigation and exploration
tasks.
(e) Touch Screen User Interface. In red an
area to explore, in green a path to navigate.
(f) Human operator interacting with
multiple drones in a snow-clad eld.
      </p>
      <p>System Training. The overall system requires three training sessions. The rst
one is related to the unimodal classi ers set up. The second training phase
concerns the multimodal fusion engine. It requires the Training Set introduced
above, exploited by the system to learn how the operator generates commands,
that is, how he/she composes the unimodal samples to invoke commands. Notice
that in our scenario the operator is an expert rescuer already aware about the
system and the operative domain, therefore we trained the system with 4 trained
users (involved in the research project), asking them to repeat 45 commands 10
times each. The collected data are then used to train both the multimodal fusion
and the command recognition system. A nal training phase is needed to tune
the 1 and 2 thresholds.</p>
      <p>
        System Testing. The robotic platform set up and the scenario is analogous to the
one described in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The testing site is the one depicted in Figure 1(f). In this
context, we collected data from 14 di erent missions lasting about 15 minutes
each and performed in two di erent days. A more extended description and
discussion of these tests can be found in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], here we only summarize the main
results about the system robustness with noisy communication. Speci cally, we
collected data about 107 commands (and 708 samples) achieving a success rate
of 96:8%, even though more than half of the samples generated by the user
have been marked as mistakes and rejected by the multimodal fusion algorithm
(66:9% rejected samples), among these, 74:3% have been correctly rejected in
the recognition line exploited for multimodal classi cation.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgement</title>
      <p>The research leading to these results has been supported by the
FP7-ICT600958 SHERPA, ERC AdG-320992 RoDyMan and H2020-ICT-731590
REFILLs projects respectively. The authors are solely responsible for its content. It
does not represent the opinion of the European Community and the Community
is not responsible for any use that might be made of the information contained
therein.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Atrey</surname>
            ,
            <given-names>P.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hossain</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>El Saddik</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kankanhalli</surname>
            ,
            <given-names>M.S.:</given-names>
          </string-name>
          <article-title>Multimodal fusion for multimedia analysis: a survey</article-title>
          .
          <source>Multimedia Systems</source>
          <volume>16</volume>
          (
          <issue>6</issue>
          ),
          <volume>345</volume>
          {
          <fpage>379</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bannat</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gast</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rehrl</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , Rosel, W.,
          <string-name>
            <surname>Rigoll</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wallho</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A multimodal human-robot-interaction scenario: Working together with an industrial robot</article-title>
          .
          <source>In: Human-Computer Interaction. Novel Interaction Methods and Techniques, 13th International Conference, HCI International</source>
          <year>2009</year>
          , San Diego, CA, USA, July
          <volume>19</volume>
          -
          <issue>24</issue>
          ,
          <year>2009</year>
          , Proceedings, Part II. pp.
          <volume>303</volume>
          {
          <issue>311</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bevacqua</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cacace</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finzi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lippiello</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Mixed-initiative planning and execution for multiple drones in search and rescue missions</article-title>
          .
          <source>In: Proceedings of the Twenty-Fifth International Conference on International Conference on Automated Planning and Scheduling</source>
          . pp.
          <volume>315</volume>
          {
          <fpage>323</fpage>
          . ICAPS'15, AAAI Press (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Burger</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferrane</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lerasle</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Infantes</surname>
          </string-name>
          , G.:
          <article-title>Two-handed gesture recognition and fusion with speech to command a robot</article-title>
          .
          <source>Autonomous Robots</source>
          <volume>32</volume>
          (
          <issue>2</issue>
          ),
          <volume>129</volume>
          {
          <fpage>147</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cacace</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finzi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lippiello</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Furci</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mimmo</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marconi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>A control architecture for multiple drones operated via multimodal interaction in search rescue mission</article-title>
          .
          <source>In: Proc. of SSRR 2016</source>
          . pp.
          <volume>233</volume>
          {
          <issue>239</issue>
          (Oct
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Cacace</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finzi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lippiello</surname>
          </string-name>
          , V.:
          <article-title>A robust multimodal fusion framework for command interpretation in human-robot cooperation</article-title>
          .
          <source>In: 26th IEEE International Symposium on Robot and Human Interactive Communication, RO-MAN</source>
          <year>2017</year>
          , Lisbon, Portugal,
          <source>August 28 - Sept. 1</source>
          ,
          <year>2017</year>
          . pp.
          <volume>372</volume>
          {
          <issue>377</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Dumas</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lalanne</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oviatt</surname>
            ,
            <given-names>S.L.</given-names>
          </string-name>
          :
          <article-title>Multimodal interfaces: A survey of principles, models and frameworks</article-title>
          . In: Lalanne,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Kohlas</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          . (eds.)
          <source>Human Machine Interaction, Lecture Notes in Computer Science</source>
          , vol.
          <volume>5440</volume>
          , pp.
          <volume>3</volume>
          {
          <fpage>26</fpage>
          . Springer (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Holzapfel</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nickel</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stiefelhagen</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Implementation and evaluation of a constraint-based multimodal fusion system for speech and 3d pointing gestures</article-title>
          .
          <source>In: Proc. of ICMI 2004</source>
          . pp.
          <volume>175</volume>
          {
          <fpage>182</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Lucignano</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cutugno</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rossi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finzi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A dialogue system for multimodal human-robot interaction</article-title>
          .
          <source>In: Proc. of ICMI 2013</source>
          . pp.
          <volume>197</volume>
          {
          <fpage>204</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Marconi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Melchiorri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beetz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pangercic</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Siegwart</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leutenegger</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carloni</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stramigioli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bruyninckx</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doherty</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kleiner</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lippiello</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finzi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Siciliano</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sala</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tomatis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          :
          <article-title>The sherpa project: Smart collaboration between humans and ground-aerial robots for improving rescuing activities in alpine environments</article-title>
          .
          <source>In: Proc. of SSRR 2012</source>
          . pp.
          <volume>1</volume>
          {
          <issue>4</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Rossi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leone</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fiore</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finzi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cutugno</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>An extensible architecture for robust multimodal human-robot communication</article-title>
          .
          <source>In: Proc. of IROS 2013</source>
          . pp.
          <volume>2208</volume>
          {
          <issue>2213</issue>
          (Nov
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Villani</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sabattini</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riggio</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Secchi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Minelli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fantuzzi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A natural infrastructure-less human-robot interaction system</article-title>
          .
          <source>IEEE Robotics and Automation Letters</source>
          <volume>2</volume>
          (
          <issue>3</issue>
          ),
          <volume>1640</volume>
          {
          <fpage>1647</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Wobbrock</surname>
            ,
            <given-names>J.O.</given-names>
          </string-name>
          , Wilson,
          <string-name>
            <given-names>A.D.</given-names>
            ,
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          :
          <article-title>Gestures without libraries, toolkits or training: A $1 recognizer for user interface prototypes</article-title>
          .
          <source>In: Proc. of UIST 2007</source>
          . pp.
          <volume>159</volume>
          {
          <fpage>168</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>