Robust Multimodal Command Interpretation for Human-Multirobot Interaction Jonathan Cacace, Alberto Finzi, and Vincenzo Lippiello Università degli Studi di Napoli Federico II {jonathan.cacace,alberto.finzi,lippiello}@unina.it Abstract. In this work, we propose a multimodal interaction framework for robust human-multirobot communication in outdoor environments. In these scenarios, several human or environmental factors can cause er- rors, noise and wrong interpretations of the commands. The main goal of this work is to improve the robustness of human-robot interaction sys- tems in similar situations. In particular, we propose a multimodal fusion method based on the following steps: for each communication channel, unimodal classifiers are firstly deployed in order to generate unimodal interpretations of the human inputs; the unimodal outcomes are then grouped into different multimodal recognition lines, each representing a possible interpretation of a sequence of multimodal inputs; these lines are finally assessed to recognize the human commands. We discuss the system at work in a real world case study in the SHERPA domain. Introduction In this work, we tackle the problem of robust multimodal communication be- tween a human operator and a team of robots during the execution of a shared task in outdoor environments. In these scenarios, the robots should be able to timely respond to the operator’s commands, minimizing chances of misunder- standing due to noise or user errors. This crucial problem is well illustrated by the domain of the SHERPA project [10, 3], whose goal is to develop a mixed ground and aerial robotic platform supporting search and rescue (SAR) activi- ties in an alpine scenario. One of the peculiar aspects of the SHERPA domain is the presence of a special rescue operator, called the busy genius, that cooperates with a team of aerial vehicles in order to accomplish search and rescue missions. In this context, the human operator is not fully dedicated to the control of the robots, but also involved in the rescue operations. On the other hand, he/she can exploit light wearable devices to orchestrate the robotic team operations in a multimodal manner, using voice and gestures based commands, in order to en- able a fast and natural interaction with the robots. This scenario challenges the command recognition system, since the environment is unstructured and noisy, the human is under pressure, and the commands are issued in a fast and sparse manner. In order to support the operator in similar scenarios, a robust and reliable multimodal recognition system is a crucial component. In multimodal interaction frameworks [8, 2, 4, 7, 12], multimodal fusion is a key issue and differ- ent strategies have been proposed [1] to combine the data provided by multiple input channels (gestures, speech, gaze, body postures, etc.). Analogously to [11, 9], in order to make the system interaction robust, extensible, and natural, we adopt a late fusion approach where the multimodal inputs provided by the hu- man are first processed by dedicated unimodal classifiers (gestures recognition, speech recognition, etc.) and then recognized by combining these outcomes. In this setting, multimodal data are usually first synchronized and then interpreted according to rules or other classification methods. In contrast with these solu- tions, in this work we propose a novel multimodal fusion approach that avoids explicit synchronization among incoming multimodal data and it is robust with respect to several sources of errors, from human mistakes (e.g. delays in ut- terances or gestures, wrong and incomplete sequencing, etc.) and environmental disturbances (e.g. wind, external noises), to unimodal classification failures. The main idea behind the approach is to continuously assess multiple ways to com- bine together the incoming multimodal inputs in order to obtain a subset of events that better represent a human multimodal command. In particular, com- mand recognition is performed in two decision steps. In the first one, we generate multiple hypothesis on multimodal data association given a Bayesian model of the user way of invoking commands. For this purpose, we estimate the probabil- ity that new samples are related to others already received. Then, in a second step, a Naive Bayes classifier is deployed to select the most plausible command given the possible data associations provided by the previous step. Multimodal Human-Robot Interaction Architecture In Figure 1(a) we illustrate the human-multirobot architecture. The human op- erator interacts with the robotic platform using different communication chan- nels (i.e. Voice, Arm Gestures, Touch Gestures and Hand Poses) by means of his/her wearable devices. In particular, the operator exploits a headset to issue vocal commands, a motion and gesture control brand (Myo Thalmic Armband 1 ) and a mobile device (tablet) with a touch based user interface. The multimodal interaction system (MHRI ) should then interpret these commands passing them to the Distributed Multi-Robot Task Allocation (DMRTA) (see [5] for details). In this work, we focus on the MHRI describing the multimodal command recog- nition system illustrated in Figure 1(b). Raw device data are directly sent and simultaneously elaborated by the unimodal classifiers C0 , ..., Cn in order to gener- ate the unimodal samples si . These samples are then received by the Multimodal Fusion module to generate different recognition lines {L0 , ..., Lm } exploiting the Bayesian Network and the Training Set. Each multimodal command is succes- sively interpreted as a user command by the Command Classification module. 1 https://www.myo.com/ (a) Human-Robot Interaction architecture. (b) Multimodal Recogni- tion System. Command Recognition Multimodal command recognition relies on a late fusion approach in which het- erogeneous inputs provided by the user through different channels are first classi- fied by unimodal recognizers and then fused together in order to be interpreted as human commands. More specifically, given a sequence of inputs S generated by the unimodal classifiers, the command recognition problem consists in finding the command c that maximizes the probability value P (c|S). This problem is here formulated as follow. We assume a set C = {c0 , c1 , ..., ck } of possible commands invokable by the operator. Each command is issued in a multimodal manner, hence it is associated with a sequence of unimodal inputs S = {s0 , ..., sn }, each represented by the triple si = (wi , chi , ti ), where wi ∈ W is the label provided by the unimodal classifier associated with the chi ∈ I channel and ti ∈ R+ is its time of arrival. In our approach, the user commands are interpreted in two decision steps: firstly, the outputs of unimodal classifiers are fused together (Multimodal Fusion) in order to be assessed an recognized as user commands in the second step (Command Recognition). Multimodal Fusion. The multimodal fusion step allows the system to select and group together unsynchronized inputs provided by the unimodal classifiers as- sociated with the same current command. For this purpose, in correspondence to the input sequence S of unimodal classified data, we generate different pos- sible subsets of elements, called Recognition Lines, each representing a possible way to associate these inputs to the invoked command. Therefore, during the command interpretation process, different Recognition Lines are generated and collected into a Recognition Set in order to be interpreted in the second step. These multiple ways of grouping the inputs, allow the proposed framework to fuse unsynchronized unimodal inputs in a robust fashion, coping with distur- bances like environmental noise, command invocation errors, or failures of the single unimodal recognition subsystems. The Recognition Line generation pro- cess works as follows. First of all, for each new input, a new Recognition Line containing only this data is generated; then the incoming data is also assessed in order to be included into other Recognition Lines already available in the Recognition Set. In order to assign an input sample to a Recognition Line we rely on a Bayesian Network (BN ) approach suitably trained in order to infer the probability that a new incoming unimodal sample sn belongs to a Recognition Line given the others s0 , ..., si already associated to the same line. Specifically, the BN proposed in this work consists of three different nodes (see Figure 1(c)). The Word node, that contains the list of input data in the recognition line, the Channel node that is for the input channels and the Line node that represents the probability of the new incoming samples to belong to the considered line. In this setting, a recieved input data is associated with a recognition line the probability to belong to that line is greater than a suitable threshold τ1 and the temporal distance of the received sample (sr ) with respect to the previous one (sp ) on the same line is within a specific interval (|tsr − tsp | < γ). Command Recognition. In the command recognition phase, the previously gen- erated Recognition Lines are to be interpreted as user commands. Our approach exploits a Naive Bayes classifier to associate each element of the Recognition Set with a label and a score representing, respectively, the recognized command class and its classification probability. More specifically, given a sequence of sam- ples S = {s0 , ..., sn }, the list of semantic labels Sw = {w0 , ..., wn } is extracted. Given the list of possible commands c0 , ..., ck , the class ĉ and its score is assessed Q|Sw | by through the formula: ĉ = arg maxc∈C P (c) i=1 p(c|wi ). Once all the Recog- nition Lines have been classified, the line with maximum score is selected as the recognized user command (see Figure 1(d)). Also in this case, a command is properly recognized only if the probability returned by the Naive Bayes classifier is higher than a trained threshold τ2 . (c) Bayesian Network for mul- (d) Recognition lines and scores. timodal command fusion. System Training. The multimodal system is trained exploiting a Training Set that collects, for each sample: the requested command coupled with the gener- ated samples, the associated channel, and the elapsed time between the samples. This way, the Bayesian Network for multimodal fusion is trained by with list of pairs (wi , chi ) for each command invocation in the dataset. The command recognition system is trained with the list of (wi ) of the samples used to inter- pret the user commands. Moreover, once the multimodal fusion system has been trained, a final training session is needed to adapt the thresholds (τ1 , τ2 , γ). This is obtained by asking the users to validate both the generated Recognition Lines and the associated classification result. SHERPA Case Study The proposed system has been demonstrated and tested in a real Alpine en- vironment. In order to communicate with the robotic platforms, the operator is equipped with wearable devices: a standard headset, a mobile device (tablet) along with a gesture/motion control bracelet. Speech recognition is based on the PocketSphinx 2 software adopting a bag-of-words model instead of the most commonly used context-free grammars. Grammar based models exploit the word ordering in the sentence, which is not reliable in our setting since the user can accidentally skip words because the interaction is sparse and incomplete or the recognizer fails to catch words, because the environment is noisy. In contrast, we adopt a less restrictive model where the recognized sentences are represented as bags of words, which are then further processed in the late fusion step of the multimodal recognition system described above. Gesture based commands are used to control the robotic team via complete or complementary information (i.e. pointing or mimic gestures). We designed and implemented a continuous gesture recognition module based on the approach by [13]. Gesture classifica- tion is here based on the acceleration of the operator’s arm, which is detected by a lightweight IMU -based bracket. We defined 14 different types of gestures used to invoke high level actions (i.e. direction movements, circles, approaching, etc). These gestures have been trained using a data set that collects gestures from 30 users, each providing 10 trials of each gesture class. The operator is also able to issue commands by drawing 2D gestures on a touch user interface (see Figure 1(e)). In this case, areas to explore, trajectories or target points can be specified using geometrical shapes like Circles, Squares or Lines eventually paired with voice information. The operator can also specify commands or part of them using hand poses. The hand pose recognition system is implemented exploiting the built-in Myo Armband classifier able to discriminate five differ- ent hand poses from EMG sensors, namely double-tap, spread, wave left, Wave Right and Make Fist. As for the user dataset, we mainly focus on commands suitable for interacting with a set of co-located drones during navigation and search tasks. Namely, selection commands enable the operator to select single or groups of robots; for this purpose the operator can issue speech (e.g. all drones take off, red drone land ), speech and gestures in combination (e.g. you go down), including touch gestures for the user interface. Similar combination of modalities 2 http://wiki.ros.org/pocketsphinx can be exploited to invoke motion and search during navigation and exploration tasks. (e) Touch Screen User Interface. In red an (f) Human operator interacting with area to explore, in green a path to navigate. multiple drones in a snow-clad field. System Training. The overall system requires three training sessions. The first one is related to the unimodal classifiers set up. The second training phase concerns the multimodal fusion engine. It requires the Training Set introduced above, exploited by the system to learn how the operator generates commands, that is, how he/she composes the unimodal samples to invoke commands. Notice that in our scenario the operator is an expert rescuer already aware about the system and the operative domain, therefore we trained the system with 4 trained users (involved in the research project), asking them to repeat 45 commands 10 times each. The collected data are then used to train both the multimodal fusion and the command recognition system. A final training phase is needed to tune the τ1 and τ2 thresholds. System Testing. The robotic platform set up and the scenario is analogous to the one described in [5]. The testing site is the one depicted in Figure 1(f). In this context, we collected data from 14 different missions lasting about 15 minutes each and performed in two different days. A more extended description and discussion of these tests can be found in [6], here we only summarize the main results about the system robustness with noisy communication. Specifically, we collected data about 107 commands (and 708 samples) achieving a success rate of 96.8%, even though more than half of the samples generated by the user have been marked as mistakes and rejected by the multimodal fusion algorithm (66.9% rejected samples), among these, 74.3% have been correctly rejected in the recognition line exploited for multimodal classification. Acknowledgement The research leading to these results has been supported by the FP7-ICT- 600958 SHERPA, ERC AdG-320992 RoDyMan and H2020-ICT-731590 RE- FILLs projects respectively. The authors are solely responsible for its content. It does not represent the opinion of the European Community and the Community is not responsible for any use that might be made of the information contained therein. References 1. Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimedia Systems 16(6), 345–379 (2010) 2. Bannat, A., Gast, J., Rehrl, T., Rösel, W., Rigoll, G., Wallhoff, F.: A multimodal human-robot-interaction scenario: Working together with an industrial robot. In: Human-Computer Interaction. Novel Interaction Methods and Techniques, 13th International Conference, HCI International 2009, San Diego, CA, USA, July 19- 24, 2009, Proceedings, Part II. pp. 303–311 (2009) 3. Bevacqua, G., Cacace, J., Finzi, A., Lippiello, V.: Mixed-initiative planning and execution for multiple drones in search and rescue missions. In: Proceedings of the Twenty-Fifth International Conference on International Conference on Automated Planning and Scheduling. pp. 315–323. ICAPS’15, AAAI Press (2015) 4. Burger, B., Ferrané, I., Lerasle, F., Infantes, G.: Two-handed gesture recognition and fusion with speech to command a robot. Autonomous Robots 32(2), 129–147 (2012) 5. Cacace, J., Finzi, A., Lippiello, V., Furci, M., Mimmo, N., Marconi, L.: A con- trol architecture for multiple drones operated via multimodal interaction in search rescue mission. In: Proc. of SSRR 2016. pp. 233–239 (Oct 2016) 6. Cacace, J., Finzi, A., Lippiello, V.: A robust multimodal fusion framework for command interpretation in human-robot cooperation. In: 26th IEEE International Symposium on Robot and Human Interactive Communication, RO-MAN 2017, Lisbon, Portugal, August 28 - Sept. 1, 2017. pp. 372–377 (2017) 7. Dumas, B., Lalanne, D., Oviatt, S.L.: Multimodal interfaces: A survey of principles, models and frameworks. In: Lalanne, D., Kohlas, J. (eds.) Human Machine Inter- action, Lecture Notes in Computer Science, vol. 5440, pp. 3–26. Springer (2009) 8. Holzapfel, H., Nickel, K., Stiefelhagen, R.: Implementation and evaluation of a constraint-based multimodal fusion system for speech and 3d pointing gestures. In: Proc. of ICMI 2004. pp. 175–182. ACM (2004) 9. Lucignano, L., Cutugno, F., Rossi, S., Finzi, A.: A dialogue system for multimodal human-robot interaction. In: Proc. of ICMI 2013. pp. 197–204. ACM (2013) 10. Marconi, L., Melchiorri, C., Beetz, M., Pangercic, D., Siegwart, R., Leutenegger, S., Carloni, R., Stramigioli, S., Bruyninckx, H., Doherty, P., Kleiner, A., Lippiello, V., Finzi, A., Siciliano, B., Sala, A., Tomatis, N.: The sherpa project: Smart collab- oration between humans and ground-aerial robots for improving rescuing activities in alpine environments. In: Proc. of SSRR 2012. pp. 1–4 (2012) 11. Rossi, S., Leone, E., Fiore, M., Finzi, A., Cutugno, F.: An extensible architecture for robust multimodal human-robot communication. In: Proc. of IROS 2013. pp. 2208–2213 (Nov 2013) 12. Villani, V., Sabattini, L., Riggio, G., Secchi, C., Minelli, M., Fantuzzi, C.: A natural infrastructure-less human-robot interaction system. IEEE Robotics and Automa- tion Letters 2(3), 1640–1647 (2017) 13. Wobbrock, J.O., Wilson, A.D., Li, Y.: Gestures without libraries, toolkits or train- ing: A $1 recognizer for user interface prototypes. In: Proc. of UIST 2007. pp. 159–168. ACM (2007)