Multimodal Interaction with Co-located Drones
           for Search and Rescue

            Jonathan Cacace, Alberto Finzi, and Vincenzo Lippiello

                   Università degli Studi di Napoli Federico II
             {jonathan.cacace,alberto.finzi,lippiello}@unina.it


       Abstract. We present a multimodal interaction framework that allows
       a human operator to interact with co-located drones during search and
       rescue missions. In contrast with usual human-multidrones interaction
       scenarios, in this case the operator is not fully dedicated to the con-
       trol of the robots, but directly involved in search and rescue tasks, hence
       only able to provide fast, although high-value, instructions to the robots.
       This scenario requires a framework that supports intuitive multimodal
       communication along with an effective and natural mixed-initiative in-
       teraction between the human and the robots. In this work, we describe
       the domain along with the designed multimodal interaction framework.


Introduction
We present a multimodal interaction framework suitable for human-multidrone
interaction in search and rescue missions. This work is framed in the context of
the SHERPA project [1, 13] whose goal is to develop a mixed ground and aerial
robotic platform supporting search and rescue (SAR) activities in a real-world
alpine scenario. One of the peculiar and original aspects of the SHERPA do-
main is the presence of a special rescue operator, called the busy genius, that
is to cooperate with a team of aerial vehicles in order to accomplish the res-
cue tasks. In contrast with typical human-multidrones interaction scenarios [8,
4, 14], in place of an operator fully dedicated to the drones, we have a rescuer
which might be deeply involved in a specific task, hence only able to provide
incomplete, although high-value, inputs to the robots. In this context, the hu-
man should focus his cognitive effort on relevant and mission-critical activities
(e.g. visual inspection, precise maneuvering, etc.), while relying on the robotic
autonomous system for more stereotypical tasks (navigation, scan procedures,
etc.). In this work, we illustrate the multimodal and mixed-initiative interaction
framework we are designing for this domain. The multimodal interaction system
should allow the human to communicate with the robots in a natural, incom-
plete, but robust manner exploiting gestures, vocal-, or tablet-based commands.
The interpretation of the multimodal communication is based on a late fusion
classification method that permits a flexible and extensible combination of mul-
tiple interaction modalities [18]. In order to communicate with the robots, we
assume the human equipped with a headset and Myo Gesture Control Armband 1
1
    https://www.thalmic.com/en/myo/
for speech-based and gesture-based interaction respectivelly. In this domain, we
introduced a set of multimodal communication primitives suitable for the ac-
complishment of cooperative search tasks. We consider both command-based
and joystick-based interaction metaphors, which can be exploited and smoothly
combined to affect the robots behavior. In order to test the framework and the
associated interaction modalities, we developed a test-bed where a human oper-
ator is to orchestrate the operations of simulated drones while searching for lost
people in an alpine scenario.


Multimodal Interaction with Multiple Drones

We designed a modular architecture suitable for supervising and orchestrating
the activities of both groups of robots and single robots (see Fig. 1). The oper-
ator should be capable of interacting with the system using different modalities
(joystick, gestures, speech, tablet, etc.) at different levels of abstraction (task, ac-
tivity, path, trajectory, motion, etc.). These continuous human interventions are
to be suitably and reactively interpreted and integrated in the robotics control
loops providing a natural and intuitive interaction. Feedback from the drones,
beyond line-of-sight control, can be provided via a tablet interface, the head-
phones, and the armband.


    Fig. 1. The overall HRI architecture (left) and multimodal framework (right)


Multimodal Interaction. The multimodal interaction block in Fig. 1 (right) al-
lows the human operator to naturally interact with the drones while interpreting
the underlying intention. For instance, while the voice commands concern move-
ment, selection, and exploration commands, gesture-based communication can
be used to specify navigational commands with deictic communication (e.g. “go-
there”) with the co-located drones. We consider multimodal commands for drone
selection and navigation (e.g. “you go down”, “red hawks land”), search (e.g.
search-expanding, search-parallel-track, according to helicopter search [15, 16, 2]
standards), switch modality, to change the interaction mode and metaphor, e.g.
from command-based to joystick-like interactive control. Indeed, gestures can be
also used as joystick-like control commands to manually guide the robots or to
adjust the execution of specific tasks.
    We employ a multimodal interaction framework that exploits a late fusion
approach [18] where single modalities are first separately classified and then
combined. Speech recognition is based on Julius[12], a two-pass large vocabulary
continuous speech recognition (LVCSR) engine. The proposed gesture recogni-
tion system exploits the Thalmic Myo Armbend. This device permits to detect
and distinguish several poses of the hand from the electrical activity of the mus-
cles of the arm where the band is weared. In addition, the band is endowed with
a 9 DOF IMU for motion capture. In our framework, the position of the hand is
used to enable/disable control modalities/metaphors (switch commands), while
the movements of the hand are to be interpreted as gestures. The fusion module
combines the results of vocal and/or gesture recognition into a uniform inter-
pretation. We deploy a late fusion approach that exploits the confidence values
associated with the single modalities, i.e. speech and gestures. First of all, the
two channels are to be synchronized. We assume that the first channel that be-
comes active (speech or gesture) starts a time interval during which any other
activity can be considered as synchronized. When this is the case, contextual
rules are used to disambiguate the conflicting commands or to combine vocal
and gesture inputs exploiting the information contained in both the channels.

Mixed-Initiative Interaction. The operator is allowed to interact with the drones
at any time at different levels of abstraction, while using different interaction
metaphors [17]. The abstract inputs of the human are further interpreted by
the mixed-initiative module (MI) that interacts with the single robot supervisor
(SRS) and multirobot supervisors (MRS) mediating between the human and
the robotic initiative. For instance, commands can be provided to both single
or multirobots, and sketchy instructions are to be completed and instantiated
by the robots supervisory systems. In our framework, both the MRS and the
SRS are endowed with a BDI executive engine [10] and task/path/motion plan-
ners are to support mixed-initiative control and sliding autonomy. Following
a mixed-initiative planning approach [9, 6, 7] the human interventions are inte-
grated into continuous planning and execution processes that reconfigure the
robotic plans of activities according to the human intentions and the operative
state. At a low level of interaction, the operator can continuously interact with
path/trajectory planners with multimodal inputs. In particular, we distinguish
between command-based and joystick-based interaction using hand position to
switch between these two modalities. We defined an intuitive switching strategy
based on the armband: if the hand is closed the command-based gesture inter-
pretation is enabled, otherwise, if the hand is open the joystick-like control is
active. In a command-based interaction the human can issue commands to the
drones using voice, tablet, gestures, while in the joystick-based the planned tra-
jectory can be directly modified during the execution. For example, the operator
can start the execution of a task (e.g. “blue hawks go there”), and during the
execution, use the hand-open mode to correct the trajectory. A full hand-open
teleoperation can be reached once the current command execution has been
stopped with a brake. Specifically, we deploy a RRT ∗ algorithm [11] for path
planning, while trajectory planning is based on a 4-th order spline concatena-
tion preserving continuous acceleration. Following the approach in [5], during
the execution, the human is allowed to on-line adjust the robot planned trajec-
tory without provoking replanning. Moreover, similarly to [5, 3], we assume that
the human operator can move the robot within an adaptive workspace, while a
replanning process is started when the human operator moves the robot outside
this area.


   Fig. 2. Simulated alpine scenario testbed (left, center), tablet interace (right).


Testing Scenarios. We are currently testing the effectiveness of the multimodal
interaction framework in a simulated alpine scenario (see Fig. 2, left). We used
Unity 3D to simulate a set of dornes equipped with an onboard camera. A tablet-
based user interface (see Fig. 2, right) allows the operator to monitor the robots
position on a map, while receiving video streams for the cameras of the drones on
multiple windows, the one associated with the selected drone has a bigger size. In
this scenario, a set of victims is randomly positioned within the environment and
the mission goal is to find a maximum number of missed persons within a time
deadline. This setting allows us to compare the user performace and behavior
in different conditions, changing the number of drones, the available modalities,
the tablet interface, the time pressure, etc.. We are currently testing the overall
system, initial results show that the interaction system allows the human to
naturally interact with two or three drones with a satisfactory distribution of
the search effort.


Acknowledgement
The research leading to these results has been supported by the SHERPA and
SAPHARI projects, which have received funding from the European Research
Council under Advanced Grant agreement number 600958 and 287513, respec-
tively. The authors are solely responsible for its content. It does not represent
the opinion of the European Community and the Community is not responsible
for any use that might be made of the information contained therein.
References
 1. Sherpa,      eu      collaborative      project   ict-600958,    http://www.sherpa-
    project.eu/sherpa/
 2. Bernardini, S., Fox, M., Long, D.: Planning the behaviour of low-cost quadcopters
    for surveillance missions. In: Proc. of ICAPS 2014. pp. 446–453 (2014)
 3. Bevacqua, G., Cacace, J., Finzi, A., Lippiello, V.: Mixed-initiative planning and
    execution for multiple drones in search and rescue missions. In: Proc. of ICAPS.
    pp. 315–323 (2015)
 4. Bitton, E., Goldberg, K.: Hydra: A framework and algorithms for mixed-initiative
    uav-assisted search and rescue. In: Proc. of CASE 2008. pp. 61–66 (2008)
 5. Cacace, J., Finzi, A., Lippiello, V.: A mixed-initiative control system for an aerial
    service vehicle supported by force feedback. In: Proc. of IROS 2014. pp. 1230–1235
    (2014)
 6. Cacace, J., Finzi, A., Lippiello, V., Loianno, G., Sanzone, D.: Aerial service vehicles
    for industrial inspection: Task decomposition and plan execution. In: Proc. of EIA-
    AIE 2013 (2013)
 7. Cacace, J., Finzi, A., Lippiello, V., Loianno, G., Sanzone, D.: Aerial service vehi-
    cles for industrial inspection: task decomposition and plan execution. Appl. Intell.
    42(1), 49–62 (2015)
 8. Cummings, M., Bruni, S., Mercier, S., Mitchell, P.J.: Automation architecture for
    single operator, multiple uav command and control. The International Command
    and Control Journal 1(2), 1–24 (2007)
 9. Finzi, A., Orlandini, A.: Human-robot interaction through mixed-initiative plan-
    ning for rescue and search rovers. In: Proc. of AI*IA 2005. pp. 483–494 (2005)
10. Ingrand, F., Georgeff, M., Rao, A.: An architecture for real-time reasoning and
    system control. IEEE Expert: Intelligent Systems and Their Applications (1992)
11. Karaman, S., Frazzoli, E.: Incremental sampling-based algorithms for optimal mo-
    tion planning. In: Robotics: Science and Systems (RSS) (2010)
12. Lee, A., Kawahara, T., Shikano, K.: Julius–an open source real-time large vocab-
    ulary recognition engine. In: Proceedings of EUROSPEECH-2001 (2001)
13. Marconi, L., Melchiorri, C., Beetz, M., Pangercic, D., Siegwart, R., Leutenegger,
    S., Carloni, R., Stramigioli, S., Bruyninckx, H., Doherty, P., Kleiner, A., Lippiello,
    V., Finzi, A., Siciliano, B., Sala, A., Tomatis, N.: The sherpa project: Smart collab-
    oration between humans and ground-aerial robots for improving rescuing activities
    in alpine environments. In: Proceedings of SSRR-2012. pp. 1–4
14. Nagi, J., Giusti, A., Gambardella, L., Di Caro, G.A.: Human-swarm interaction
    using spatial gestures. In: Proc. of IROS-14. pp. 3834–3841 (2014)
15. NATO: ATP-10 (C). Manual on Search and Rescue. Annex H of Chapter 6 (1988)
16. NATSAR: Australian National Search and Rescue Manual. Australian National
    Search and Rescue Council (2011)
17. Pfeil, K., Koh, S.L., Jr., J.J.L.: Exploring 3d gesture metaphors for interaction
    with unmanned aerial vehicles. In: Proc. of IUI 2013. pp. 257–266 (2013)
18. Rossi, S., Leone, E., Fiore, M., Finzi, A., Cutugno, F.: An extensible architecture
    for robust multimodal human-robot communication. In: Proc. of IROS 2013. pp.
    2208–2213 (2013)