=Paper=
{{Paper
|id=Vol-2730/paper16
|storemode=property
|title="What’s that called?": a multimodal fusion approach for cultural heritage virtual experiences
|pdfUrl=https://ceur-ws.org/Vol-2730/paper16.pdf
|volume=Vol-2730
|authors=Marco Grazioso,Maria Di Maro,Francesco Cutugno
|dblpUrl=https://dblp.org/rec/conf/psychobit/GraziosoMC20
}}
=="What’s that called?": a multimodal fusion approach for cultural heritage virtual experiences==
“What’s that called?”: A Multimodal Fusion Approach for Cultural Heritage Virtual Experiences Marco Grazioso1 , Maria Di Maro2 , and Francesco Cutugno1 1 Department of Electrical Engineering and Information Technology, Università degli Studi di Napoli ‘Federico II’, Italy 2 Department of Humanities, Università degli Studi di Napoli ‘Federico II’, Italy {marco.grazioso,maria.dimaro2,cutugno}@unina.it Abstract. In this paper, a multimodal dialogue system architecture is presented. The cultural heritage application of the software makes it im- portant to use different channels of communication to enable museum visitors to naturally interact with it and still enjoying the artistic envi- ronment, whose exploration is supported by the system itself. A question answering system for the 3D reconstruction of the Absis and Presbytery of the San Lorenzo Charterhouse (Padula, Salerno) is considered as a case study to demonstrate the capabilities of the proposed system. The implemented multimodal fusion engine will be described along with the strategies adopted to involve multiple users in an immersive, interac- tive environment supporting queries and commands expressed through speech and mid-air gestures. The collected feedback shows that the sys- tem was well received by the users. Keywords: multimodal dialogue · cultural heritage · fusion engine 1 Introduction Human beings function multimodally. The use of gestures accompanied the his- tory of men from the beginning. According to the Gestural Theory of language, human language developed from gestures used to communicate [1]. The alleged situational ambiguity-based incompleteness of both gestural and vocal channels can explain their common joint adoption. According to McNeill’s terminology [16], the typical movements that can be recognised in gestures are of four types: i) deictics (or pointing gestures), which connect speech to concrete or abstract referents; ii) iconics, which depict concrete objects or events in discourse; iii) metaphorics, which put abstract ideas into more concrete forms; iv) beats, which have no semantic meaning, but are used to structure the discourse. Interestingly, the same cognitive aspects, lying behind the production of these visual signals, govern the production of the acoustic counterparts. In fact, words are used to re- fer to external referents, both concrete and abstract ones, to describe events and Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 M. Grazioso et al. objects, to explain abstract ideas, and, together with super-segmental items, to organise the discourse and modulate it. The two communicative codes depend, indeed, on similar neural systems. Among the aforementioned gesture types, one is for us of particular interest: the pointing gesture. First of all, deictic gestures are the ones used to refer to something, which can be the topic of the communicative exchange representing the basis for the mutual knowledge being the skeleton of the interaction itself [3]. Therefore, they have a referent identification function and are a grounding tool. Secondly, these gestures are used as an embodiment tool for cognition [2]. According to the embodiment cognition approach, cognitive processes have their roots in motor behaviour. This means that cognition relies on a physical body acting on the environment in which it is immersed [22]. In this perspective, the design of dialogue systems whose interaction is based on the exchange of information about specific referents cannot overlook the use of such natural cognitive-motor means of communication. Multimodality becomes, therefore, the ultimate goal of this work, which aims to show how different modalities can be fused within a single system architecture. The systems we are interested in are multimodal conversational agents, whose application is gaining more and more importance. Different studies show how technologies of this kind are being adopted for one of the most traditional among human experiences, which is the museums visit. In fact, the introduction of technological devices offering a virtual experience in the exploration of cultural contents can create more memorable exhibitions [12], and at the same time it can change the way museums are perceived and, consequently, the expectation of users [14]. For this reason, new studies are committed to the exploration of the ways museum visitors can be better engaged via these new devices [18], for what both visual stimuli and engaging communication strategies are concerned. Concerning multimodality, some scholars [11, 25], in dealing with real en- vironmental issues, contrive strategies which restrict the way users can freely interact. Therefore, we are interested in investigating and testing alternative approaches to model small groups interactions in real contexts, allowing users to communicate with both verbal and non-verbal actions. Specifically, the main purpose of the presented software architecture is to allow each member of a group of museum visitors to express their request to our system by interacting multimodally. With the exclusive use of natural human means of communica- tion, i.e. voice, language and gestures, the virtual agent, projected on a curved screen, understands multimodal dialogue acts by users asking for information about artworks and architectural structures contained in a 3D scene. In the next session, the architecture of the system will be explained starting from the language modelling for both understanding and generation purposes (Section 2.1). Afterwards, we will focus our attention on the pointing interpre- tation (Section 2.2) and on the active speaker detection (Section 2.3), before explaining the way the different signals were fused together to give a single in- terpretation of the user turn (Section 2.4). To conclude, the results of the system evaluation will be presented (Section 3). “What’s that called?”: A Multimodal Fusion Approach ... 3 2 System Architecture In this section, the modules used to develop our multimodal system are described in detail. The entire setup of the interaction environment aims at creating an immersive experience for the users. Therefore, the interactive area consists of a 2,5m high and 4,4m long curved screen , used to project a realistic 3D en- vironment representing the interactive scene. To track users movements and their speech signals in real time, a Microsoft Kinect 2 [29] sensor is placed on the floor, at the centre of the screen. Speech recognition is performed using the grammars discussed in Section 2.1 and the Microsoft Speech Platform. Ac- quired data concerned with user signals represent the input of the Game Engine (Unreal Engine 43 ) used to model the 3D environment. An input recogniser communicate with the Multimodal Dialogue System in charge of understanding user intentions and providing the related responses. OpenDial is the framework adopted to implement this component [13], which communicate with the Knowl- edge Base designed through a graph database (Neo4j4 [27]). Once a response is retrieved, the Game Engine synthesises the machine utterance by using Mivoq5 , a TTS engine. In the next subsections, the different knowledge bases (i.e. for conceptual representation of spaces, natural language understanding and gener- ation) are structurally described. In the construction of meanings uttered during the interaction, whose context is shared by the interlocutors, other signals gain importance, specifically the pointing interpretation and the active speaker de- tection. In fact, we are finally going to focus on the fusion of the different signals in the modelling of our multimodal system. 2.1 Dialogue Modelling In this section, the dialogue organisation, as far as the knowledge-base is con- cerned, is presented. In fact, the interactional engine of the system is supported with content and linguistic knowledge of the domain under consideration. The knowledge with which the system is provided comprises both a corpus-based grammar for extracting the topic of interest from a user question and a domain- dependent corpus to extract the proper answer as a feedback given to the human interlocutor by the system. First of all, a collection of possible questions that a user could pose was carried out. The ad hoc structured survey enabled us to collect about 800 spoken questions divided into 10 different categories. Each category was then modelled in a Speech Recognition Grammar. The choice of this methodology depends on the fact that i) the speed of computation is higher in detecting the right class of the belonging question without the need of running complex algorithms on raw data; ii) the restricted domain of application can be better modelled with a rule- based approach [15, 23]; iii) the process of hand-crafting rules was simplified by 3 www.unrealengine.com 4 https://neo4j.com/ 5 www.mivoq.it 4 M. Grazioso et al. the use of a linguistic ontology by means of which semantic related words could be automatically included [5]. Specifically, the Speech Recognition Grammar Specification (SRGS) [9] W3C standard has been developed to allow Automatic Speech Recognition (ASR) engines to output the semantic interpretation of the matched pattern instead of the raw transcription. This is an important advantage for spoken dialogue systems as they can instruct ASR modules to expect specific word patterns and to present a structured interpretation of the obtained input to be provided to the dialogue manager. When the uttered word pattern is not included in one of the rules, the ASR finds the most similar pattern selected with a specific confidence6 . This results in a reduced latency, as linguistic analysis chains working on raw strings are avoided. This methodology was already tested for a preliminary application for a different case study [5]. In more detail, a speech recognition grammar is a finite set of rules, where each rule, associated to a semantic label, generates a set of utterances. As far as the lexical enrichment of the grammar is concerned, given a collection of semantic relations (i.e. synonymy, hyperonymy and meronymy), the system is capable of expanding each rule in a grammar, in order to produce a new grammar, where a new set of generated utterances includes the previous ones plus the additional lexical and morpho-syntactic information [5]. For the lexical extension of the differentiated syntactic realisations of ques- tions in our grammar, we made use of a graph database, named MultiWordNet- Extended [17, 5] which contains the Italian lexicon and the semantic relationships between words taken from MultiWordNet [20]. For the Natural Language Generation module we used another graph database containing textual nodes describing different points of the shown 3D scene. The nodes correspond to the concepts identified in the question classification pro- cess and are related to each other by inheritance (in-depth) relationships, when possible. Each node is, moreover, related to another node containing the textual answer to be given by the system. In order to generate the response, the system needs to verify three conditions. Being A a user interacting with the system: – A is the last speaker. – A asked a question recognised by the ASR. – A pointed at one relevant object in the scene. When all the conditions are true with a considerable probability, the concept relative to the pointed object is used to query the graph database, retrieving the needed information in accordance with the semantic interpretation provided by the natural language understanding module. A more in depth explanation of multimodal fusion is provided in section 2.4. 6 If the confidence is too low, the system can be modelled to ask for further clarifica- tions. “What’s that called?”: A Multimodal Fusion Approach ... 5 Fig. 1. Reproducing users and their movements in the 3D environment. 2.2 Pointing Interpretation The interpretation of the referential function of a message is strictly connected to the uttered entities which represent extralinguistic domain-related objects in the 3D scene, namely, in our scenario, the artworks or the structural items of the 3D model. These entities can be sometimes ambiguous, since users could have a minor expertise of the domain or since different items of the 3D model can be similarly uttered within the same domain. To overcome this interpretation problem, the non-verbal behaviour, such as pointing gestures, can help the dis- ambiguation process. Therefore, in this subsection, we are going to present the way we modelled the pointing recognition function for enabling our system to interpret pointing gestures in the multimodal construction of intents by users. The pointing recognition (PR) task can be divided into two subtasks: a) user positioning in the virtual environment, b) gesture recognition. In our system, the Kinect sensor returns the set of points representing the joints of human body. Using the points provided is possible to represent the user and his movements in the virtual environment by mapping kinect joints to a virtual avatar’s joints. as shown in Figure 1. Note that virtual avatar in Figure 1 is displayed only for demonstration purposes while, in a real interaction, its visibility is turned off. The kintenct base of spine joint position has been used to estimate the user position in relation to the screen. Furthermore, the user height has been also taken into account in order to improve the pointing precision. After the user representation is obtained, the next step is to recognise point- ing activities. We realised this task through a geometrical approach based on Unreal Engine 4 geometrical functions and collision detection. Using the shoul- der and hand positions we emit an invisible light ray that ends on an object surface generating a collision event. The event has been managed using the se- mantic maps mechanism combined with the Art and Architecture Thesaurus [8] in order to retrieve the concept label with relative relevance value, associated with the collided point. In this way, the collision event is enriched with concep- tual meaning to enable the system to understand what the user is referring to. To avoid wrong pointing recognition event triggered by transit area we consid- ered the arm movement speed to distinguish between transit area and fixation points. 6 M. Grazioso et al. 2.3 Active Speaker Detection Since museums are generally visited in groups, the system also needs to identify the speaker in order to address the answer to the right interlocutor. The active speaker detection module (ASD) has the responsibility to recognise the user that is actually speaking in a group. Specifically, the objective of this module is to distinguish between environment noises and speaking acts, and to compute the probability that a user is effectively talking. In this way the system is able to take into account the gestures produced by the user with the highest probability. Several approaches were proposed in literature using visual features [24] and both video and audio ones [7]. In order to avoid problems deriving from data- driven approaches (data collection, computational complexity), we adopted a technique that computes speaking probability considering only the current loud- est sound source location and users positions. More in depth, we define α as the angle between the Kinect forward vector and the vector that point to the sound source, and β as the angle between the Kinect forward vector and the vector that point to the user. We also define ∆(α, β) as the difference α − β. Normalis- ing ∆(α, β) in the range [0, 100] and dividing it by 100 we obtain a probability measure formalised as : P (Ui = T rue | θS , Li ) | Ui = {T rue, F alse} Where Ui represents the i − th user, θS represents the sound source direction and Li represents the current position of the i − th user. 2.4 Multimodal Fusion Engine In this subsection, we present the approach used to tackle the fusion of all the previously explained modules in order to provide multimodality. This module receives asynchronous messages from input modules ASD, NLU and PR, handled by specific receivers. The messages are respectively: – an ASD message: current speaker probability for each user. – a NLU message: user sentence recognised by NLU module with a confidence value. – a PR message: pointed object’s semantic labels with relevance values for each user. Messages received cause the update of the corresponding Bayesian network input variables (current speaker variable, verbal act variable, pointing variable) The input fusion process is activated as soon as an user dialogue act is recog- nised. Input variables are synchronised and propagated through the probabilistic network to derive a common interpretation. To obtain multimodal unification, different formalisms and approaches are adopted in literature, i.e. statistical ap- proaches [28], salience-based [6], or rule-based approaches [10], and biologically motivated approaches [26]. Here we propose a strategy that defines random vari- ables validation rules based on the study discussed in [19]. Several modules col- laborate in charge of performing this task. The Multimodal Input Integrator aims “What’s that called?”: A Multimodal Fusion Approach ... 7 at combining input variables coherently. In particular this module analyses ver- bal actions, pointed objects, and speakers in order to understand the current request. Since the variables evolve in real-time, the Multimodal Time Manager is used to check consistency and prune out-of-date variables. In more detail, starting from time-stamps related to the input variables, once a new speech sig- nal is captured, the module compares its time intervals with those computed for each pointing variable, pruning off pointing gestures whose occurrence was concluded more than 4 seconds before the start of the current speech signal. As input variables come asynchronously, the State Monitor directs the entire oper- ation by observing changes in dialogue state. Therefore, the unification methods are called by this component according to dialogue progresses. Next operations are in charge of the Dialogue Manager. This has been imple- mented using the OpenDial framework [13]. Here, the Dialogue State manages variables and their connections encoded in a Bayesian network, while the Dia- logue System provides the APIs to check and update the model. Once the system has derived the current request, this level provides services to select the most appropriate machine action to be performed. 3 System Evaluation In order to evaluate various aspects of the proposed system an experimental set- ting was adopted. As this is an ongoing work, a humanoid virtual conversational agent is not yet present in the setup. Specifically, a fixed 3D scene was designed in Unreal Engine 4 showing the user a part of San Lorenzo Charterhouse, namely the Absis and the Presbitery of the Church. A simple question-answering based interaction was modelled making users capable of multimodally interact with the system in order to obtain information about a small set of objects in the 3D environment. The evaluation was conducted in our laboratory by analysing interactions between the designed system and two users simultaneously. In order to avoid wrong interpretations caused by background noises, the evaluation was conducted in a room where other persons besides the observer and the evaluated group were not admitted. Moreover, a threshold relative to the input sound sig- nal intensity was established in order to cut off environment noises. The entire process was sliced up into the following steps: 1. System presentation: system functionality are presented to participants. 2. Training session: a video-clip presentation is shown and a first guided interaction is performed, in order to allow users to get acquainted with the multimodal interface. 3. Task-oriented interaction: users are asked to cooperate in completing a set of assigned tasks to test system functionality. Interactions are recorded through the Kinect and data-log in order to subsequently compute the suc- cess rate of each input recognition. Recorded data can be used to compose a training set to automatically tune the parameters of the probabilistic net- work. 8 M. Grazioso et al. ID TI ASD PR NLU 1 23 21 (91.3%) 22 (95.6%) 18 (78.3%) 2 19 18 (94.7%) 19 (100%) 15 (78.9%) 3 20 19 (95%) 20 (100%) 12 (60%) 4 21 19 (90.4%) 21 (100%) 15 (71.4%) 5 24 21 (87.5%) 23 (95.8%) 15 (62.5%) 6 26 19 (73%) 24 (92.3%) 20 (76.9%) TOT 133 117 (88%) 129 (97%) 95 (71.4%) Table 1. Success rate computed for Active Speaker Detection (ASD), Pointing Recog- nition (PR) and Natural Language Understanding (NLU) during group interactions 4. Free session: users interact with the system for an arbitrary time interval. This phase was useful to collect further data concerned with the way users would freely interact with such systems. Nevertheless, they were not yet evaluated. The time spent by users during this phase could be used as implicit estimation of their satisfaction. In order to evaluate the system and to identify principal causes of irregular multimodal fusion, the success rate SRi was computed for each module i as follows: SUi SRi = · 100 TI Where SUi is the total number of successful interpretations reported by the module recogniser i and T I is the total number of users interactions. A total of 6 groups, composed by 2 persons, was involved in the evalua- tion, recording the data shown in Table 1. Specifically, a total of 41’05” of task interactions and a total of 76’ of training interactions were analysed. In partic- ular, the 133 tasks interactions were used to estimate the success rate. Results (Table 1) show that, starting from a correct recognition of each input signal, the probabilistic network designed for the fusion engine is able to derive user requests during multimodal group interactions. Most relevant cases of erratic inferences are caused by a wrong input recognition. The Pointing Recognition module shows the highest result, with a success rate of 97%. The module that shows the worst behaviour is the Active Speaker Detection module. In partic- ular, this result can be described as related to the users tendency to overlap themselves during collaborative interactions. Anyway, ASD performances may be improved by combining sound source angle and users locations in 3D envi- ronment with further features like users gaze direction and/or lips movements, similar to what is discussed in [21]. 4 Conclusion The paper aims at showing a multi-channel inputs’ fusion approach applied in the development of a multimodal dialogue system for cultural heritage virtual “What’s that called?”: A Multimodal Fusion Approach ... 9 experiences. The promising results show that this approach is valuable for further investigations. For this reason, starting from the architecture proposed in this paper, our purpose is to further improve the performances extending the system functionality by enriching the content and linguistic knowledge and applying the pointing recognition to other objects in a more extended 3D scene. Furthermore, we aim at processing new input signals and modelling a multi-party dialogue to improve and promote collaborative interactions between users. Acknowledgment This work is funded by the Italian ongoing PRIN project CHROME - Cultural Heritage Resources Orienting Multimodal Experience [4] #B52F15000450001. References 1. Armstrong, D.F.: The gestural theory of language origins. Sign Language Studies 8(3), 289–314 (2008) 2. Ballard, D., Hayhoe, M., Pook, P., Rao, R.: Deictic codes for the embodiment of cognition (1996) 3. Clark, H.H., Marshall, C.R.: Definite knowledge and mutual knowledge (1981) 4. Cutugno, F., Dell’Orletta, F., Poggi, I., Savy, R., Sorgente, A.: The chrome mani- festo: integrating multimodal data into cultural heritage resources. In: Fifth Italian Conference on Computational Linguistics, CLiC-it (2018) 5. Di Maro, M., Valentino, M., Riccio, A., Origlia, A.: Graph databases for design- ing high-performance speech recognition grammars. In: IWCS 2017—12th Inter- national Conference on Computational Semantics, Short papers (2017) 6. Eisenstein, J., Christoudias, C.M.: A salience-based approach to gesture-speech alignment. In: HLT-NAACL 2004: Main Proceedings. pp. 25–32. Association for Computational Linguistics, Boston, Massachusetts, USA (May 2 - May 7 2004), https://www.aclweb.org/anthology/N04-1004 7. Gebru, I.D., Ba, S., Evangelidis, G., Horaud, R.: Tracking the active speaker based on a joint audio-visual observation model. In: IEEE International Conference on Computer Vision Workshop (ICCVW) (2015) 8. Grazioso, M., Cera, V., Di Maro, M., Origlia, A., Cutugno, F.: From linguistic linked open data to multimodal natural interaction: A case study. In: 2018 22nd International Conference Information Visualisation (IV). pp. 315–320. IEEE (2018) 9. Hunt, A., McGlashan, S.: Speech recognition grammar specification version 1.0. Tech. rep., W3C (2003) 10. Johnston, M.: Unification-based multimodal parsing. In: Proceedings of the 17th International Conference on Computational Linguistics - Vol- ume 1. pp. 624–630. COLING ’98, Association for Computational Linguis- tics, Stroudsburg, PA, USA (1998). https://doi.org/10.3115/980451.980949, https://doi.org/10.3115/980451.980949 11. Kopp, S., Gesellensetter, L., Krämer, N.C., Wachsmuth, I.: A conversational agent as museum guide–design and evaluation of a real-world application. In: Intelligent virtual agents. pp. 329–343. Springer (2005) 12. Lepouras, G., Vassilakis, C.: Virtual museums for all: employing game technology for edutainment. Virtual reality 8(2), 96–106 (2004) 10 M. Grazioso et al. 13. Lison, P., Kennington, C.: Opendial: A toolkit for developing spoken dialogue systems with probabilistic rules. Proceedings of ACL-2016 System Demonstrations pp. 67–72 (2016) 14. Marty, P.F., Jones, K.B.: Museum informatics: People, information, and technology in museums, vol. 2. Taylor & Francis (2008) 15. McGlashan, S., Fraser, N., Gilbert, N., Bilange, E., Heisterkamp, P., Youd, N.: Dialogue management for telephone information systems. In: Proceedings of the third conference on Applied natural language processing. pp. 245–246. Association for Computational Linguistics (1992) 16. McNeill, D.: Hand and mind: What gestures reveal about thought. University of Chicago press (1992) 17. Origlia, A., Paci, G., Cutugno, F.: Mwn-e: a graph database to merge morpho- syntactic and phonological data for italian. Proc. of Subsidia, page to appear (2017) 18. Othman, M.K., Petrie, H., Power, C.: Engaging visitors in museums with technol- ogy: scales for the measurement of visitor and multimedia guide experience. In: IFIP Conference on Human-Computer Interaction. pp. 92–99. Springer (2011) 19. Oviatt, S.L., DeAngeli, A., Kuhn, K.: Integration and synchronization of input modes during multimodal human-computer interaction. In: Proceedings of Con- ference on Human Factors in Computing Systems CHI ’97 (March 22-27, Atlanta, GA). ACM Press, NY. pp. 415–422 (1997) 20. Pianta, E., Bentivogli, L., Girardi, C.: MultiWordNet: developing an aligned mul- tilingual database, pp. 293–302 (2002) 21. Richter, V., Carlmeyer, B., Lier, F., Meyer zu Borgsen, S., Schlangen, D., Kum- mert, F., Wachsmuth, S., Wrede, B.: Are you talking to me?: Improving the ro- bustness of dialogue systems in a multi party hri scenario by incorporating gaze direction and lip movement of attendees. In: Proceedings of the Fourth Interna- tional Conference on Human Agent Interaction. pp. 43–50. ACM (2016) 22. Schneegans, S., Schöner, G.: Dynamic field theory as a framework for understand- ing embodied cognition. In: Handbook of Cognitive Science, pp. 241–271. Elsevier (2008) 23. Serban, I.V., Lowe, R., Henderson, P., Charlin, L., Pineau, J.: A survey of available corpora for building data-driven dialogue systems: The journal version. Dialogue & Discourse 9(1), 1–49 (2018) 24. Stefanov, K., Sugimoto, A., Beskow, J.: Look who’s talking: Visual identification of the active speaker in multi-party human-robot interaction. Association for Com- puting Machinery (ACM) pp. 22–27 (2016) 25. Traum, D., Aggarwal, P., Artstein, R., Foutz, S., Gerten, J., Katsamanis, A., Leuski, A., Noren, D., Swartout, W.: Ada and grace: Direct interaction with mu- seum visitors. In: International Conference on Intelligent Virtual Agents. pp. 245– 251. Springer (2012) 26. Wachsmuth, I.: Communicative rhythm in gesture and speech. In: Proceedings of the International Gesture Workshop on Gesture-Based Communication in Human- Computer Interaction. pp. 277–289. GW ’99, Springer-Verlag, London, UK, UK (1999), http://dl.acm.org/citation.cfm?id=647591.728724 27. Webber, J., Robinson, I.: A programmatic introduction to neo4j. Addison-Wesley Professional (2018) 28. Wu, L., Oviatt, S.L., Cohen, P.R.: Multimodal integration - a statistical view. IEEE Trans. Multimedia 1, 334–341 (1999) 29. Zhang, Z.: Microsoft kinect sensor and its effect. IEEE multimedia 19(2), 4–10 (2012)