=Paper=
{{Paper
|id=Vol-1621/paper3
|storemode=property
|title=Multimedia Responses in Natural Language Dialogues
|pdfUrl=https://ceur-ws.org/Vol-1621/paper3.pdf
|volume=Vol-1621
|authors=Antonio Sorgente,Paolo Vanacore,Antonio Origlia,Enrico Leone,Francesco Cutugno,Francesco Mele
|dblpUrl=https://dblp.org/rec/conf/avi/SorgenteVOLCM16
}}
==Multimedia Responses in Natural Language Dialogues==
Multimedia Responses in Natural Language Dialogues Antonio Sorgente Paolo Vanacore Antonio Origlia Inst. of Applied Sciences and Inst. of Applied Sciences and Dept. of Electrical Engineering Intelligent Systems of CNR Intelligent Systems of CNR and Information Technology, Naples, Italy Naples, Italy University of Naples a.sorgente@isasi.cnr.it p.vanacore@isasi.cnr.it “Federico II” & Inst. of Applied Science and Intelligent Systems of CNR Naples, Italy antonio.origlia@unina.it Enrico Leone Francesco Cutugno Francesco Mele Dept. of Electrical Engineering Dept. of Electrical Engineering Inst. of Applied Sciences and and Information Technology, and Information Technology, Intelligent Systems of CNR University of Naples University of Naples Naples, Italy “Federico II” “Federico II” & f.mele@isasi.cnr.it Naples, Italy Inst. of Applied Science and erik.leone82@gmail.com Intelligent Systems of CNR Naples, Italy cutugno@unina.it ABSTRACT Keywords Offering contents to a visitor in a natural and attractive multimedia composition, spoken dialogue system, cultural way is one of the most interesting challenges in promoting heritage cultural heritage. In this paper, we present an ongoing re- search about the design and development of interactive sys- 1. INTRODUCTION tems based on dialogues in natural language to assist a user The amount of information related to the domain of Cul- during a visit to a cultural space. The responses of system tural Heritage built by experts, and published on the web, contain multimedia elements and are generated by users’ is growing day by day. With this significant load of informa- queries or following contextual updates associated to their tion, offering contents to a visitor in a natural and attractive position, so the system can take initiative in absence of ex- way is one of the most interesting challenges in promoting plicit stimuli. The response of system results from a compo- cultural heritage. In the last years, applications of dialogue sition process that coherently synchronises media elements systems have been presented in the Cultural Heritage do- with a synthetic voice delivering the textual content. This main to study how to understand users’ requests, and how to way, the visitor receives an audio explanation commented provide adequate responses [?, ?, ?]. Interaction is typically by images. To implement this approach, a semantic archive based on textual chats so both input and output are only containing the annotation of stories has been built. The for- textual. With the diffusion of mobile and wearable devices, malism used for the annotation is CSWL (Cultural Stories the possibility to have systems based on spoken language Web Language), used to represent cultural stories through interaction has increased, and the advantage of having de- events. A case study on the ’800 exhibit at the Capodi- vices that support high quality video material introduces an monte museum is presented to describe how the system was advantage for content presentation. The basic idea is, there- designed and deployed. fore, to use portable and/or wearable devices to engage in- teraction in natural language and provide information about CCS Concepts a cultural asset. The challenge is not to create a new knowl- •Human-centered computing → Human computer edge base each time a new task is designed, but to exploit interaction (HCI); Natural language interfaces; In- the same great deal of information about the cultural her- teraction paradigms; itage domain that is already available. The main feature of this approach is to select the appropriate contents related to a cultural item and aggregate them in an unique multimedia response. In this paper, we will present a description of the multimedia dialogue system architecture and we will briefly describe the formalism used for the annotation of stories, which is based on the syncretic model introduced in [?]. We Copyright 2016 for this paper by its authors. Copying permitted for private will also describe how the system assembles the multimedia and academic purposes. response. To provide an example of the overall interaction, we will present a dialogue excerpt together with the relative multimedia answers. 2. DIALOGUE SYSTEM to semantic repository (as MultiWordnet and Wiktionary6 ). The core of the dialogue system is centred on the Open- Opendial is a probabilistic framework so, while in the dial framework [?], which provides a flexible environment to present version we only make use of deterministic rules to design dialogue systems using an XML-based language and manage the dialogue. It is possible to fine-tune the sys- can also be extended with customised plugins using Java. tem with a combination of probability estimates and utility Opendial represents the dialogue state as a set of variables functions to plan machine actions and even to estimate the and it lets the user define a series of internal models. These probability of some events in the next interactions. These are triggered by variable updates that automatically produce estimates can be computed on the basis of previous users’ reactions in accordance with the observed state. interactions with the system, which can be collected either Although not mandatory, three main models are, typi- with Wizard-of-Oz approaches or using a first prototype ver- cally, used in an Opendial application: the Natural Language sion of the system. Understanding (NLU) model analyses the user input and maps it on a set of possible user actions; the Action Selection 3. CSWL Model (ASM) associates the user action to the correspon- In this section we briefly present CSWL (Cultural Story dent machine action; and the Natural Language Generation Web Language) [?], a formalism to represent cultural sto- (NLG) model produces a spoken content in accordance with ries published on the Web. These stories concern the life of the selected machine action. In the proposed framework, we historical characters, the history of artworks, and architec- have three separate NLU models to handle different phases tural structures and their transformation over time. CSWL of the interaction: the first separates commands to the de- is an event-based formalism and it defines three types of en- vice (volume control, taking pictures or videos, etc.) from tities changing over time: simple events, complex events and user queries concerning cultural heritage items; the second fluents. The CSWL formalism constitutes the semantic ref- detects the requests for device-related functions; the third erence to build the annotation of texts and media that are one detects incomplete commands and summarises the pos- an important part of dialogue system in natural language sible outcomes, so that clarification strategies can be applied capable to provide multimedia responses. to recover the interaction. In this work, we concentrate on Simple Events are represented by four components: When the management of responses to user queries detected by the - the time interval in which the event happens; What - the first NLU model. As the system is focused on Italian, a set action happening in the event; Where - the location where of plugins to process this language have been included: an event takes place; and Who - the participants to the 1) a plugin to receive the audio stream from the client event. In CSWL, the event component why is represented and transcribe it using Google Speech; 2) a plugin to obtain by a causal relation between two events. For this reason, Part of Speech tags from the Treetagger[?] tool; 3) a plu- such a relation is defined as a complex event. gin to normalize the utterance substituting the synonyms of In CSWL, stories are represented through complex events. target terms with the target term itself and to perform lem- A Complex Event is composed of a set of events, a set of matisation (synsets are obtained from the MultiWordNet1 causal and temporal relationships between them, and of all database); 4) a plugin to extract the dependency-based parse properties holding over the time in which story unfolds. A tree of the normalised utterance using the Turin University complex event has the same type of components of a simple Linguistic Environment (TULE)2 converting from the Turin event, but its components are calculated starting from the University Treebank (TUT)3 format to the Maltparser4 for- elements composing the story. mat. This occurs as Opendial natively supports Maltparser Each participant to a story (character, archaeological struc- to represent dependency trees so, with this method, it is ture, art object, etc.) can be represented through prop- possible to fully program the system’s behaviour using the erties, spatial relations, and meronomic relations changing Opendial XML-based language. This also makes it easier to over time. In the same way, mental states of historical char- extend the system to English; 5) a plugin to connect Open- acters can be depicted through their desires, intentions, and dial to the higher-level system handling user queries. beliefs. In CWSL these relations are represented as Fluents Concerning the last plugin, a communication protocol based (fluent is the same concept as defined in Event Calculus [?]). on JavaScript Object Notation (JSON) has been adopted. For our experimentation, we have annotated through CSWL The JSON string contains the multimedia response (see sec- a collection of texts and media related to the ’800 exhibit tion 4) for the user and defines the synchronisation of syn- provided by museum experts. It contains textual informa- thesised text with media. The syntax of JSON is similar tion describing 4 museum rooms and 7 artworks and it also to Synchronised Multimedia Integration Language (SMIL)5 . contains 123 media objects linked to the relevant parts of For the implementation of the question answering process to the reference texts. For the annotation process a graphical generate the response a RESTful architecture has been im- tool has been developed. plemented. The main modules used for the interpretation of user requests are: a parser to identify its grammatical structure; a set of semantic services for the detection of se- 4. RESPONSE THROUGH MULTIMEDIA mantic concepts based on events; and services for accessing SEMANTIC MASHUP An issue we have dealt with in this work concerns the 1 creation of a multimedia response temporally synchronising http://multiwordnet.fbk.eu/english/home.php texts and media according to semantic annotations. This 2 http://www.tule.di.unito.it is called syncretic text[?], a text capable of organising 3 http://www.di.unito.it/∼tutreeb/ heterogeneous languages within a unitary communications 4 http://www.maltparser.org/ 5 6 https://www.w3.org/TR/smil/ https://www.wiktionary.org/ model[?], with features of cohesion and coherence that re- element in a single media and presents it using the FFmpeg7 fer to the same enunciation instance. This way, the visitor tool. Instead, Multimedia Response Script Builder produces receives an audio explanation commented by images. a composition script that reports the synchronisation times The starting point of this process is the selection of text between audio texts and media. This module generates a that will be given in response to the user request. Then, JSON code similar to SMIL representation. on this text, we associate multimedia resources to it. The system developed to achieve this goal is defined according to the guidelines described in [?]. A summary of these are: 1) a multimedia element concerning a participant can only be displayed after being uttered in the multimedia composition; 2) for each response, it’s preferable to display a multimedia element that represents the current topic; 3) if for some sig- nificant textual element there are no associated multimedia elements, then the visualisation of previous multimedia el- ements persists; 4) an expression in the text relating to a totality can be associated with a multimedia element that represents a part (a part for the totality) and viceversa (a totality for the part); 5) the duration of all the selected mul- timedia elements should not, in principle, exceed the enun- ciation time of the text; 6) each selected multimedia element has to be shown for at least two seconds. The above guidelines describe some good practice for the Figure 1: multimedia composition schema selection of media and composition of multimedia response. These are used to implement constraints on the selection and on the composition process of the response. 5. AN EXAMPLE: ’800 EXHIBIT AT THE For the system implementation, we have used the tech- niques presented in [?]. In particular, in this work we have CAPODIMONTE MUSEUM implemented the module to compose the multimedia re- In this section we present an example of interaction where sponse. The structure of the software modules is depicted the system, in addition to textual answer, provides a mul- in Figure 1. Textual Response Generator starting from user timedia response. Texts and media for this test have been query extracts the answer from the text. The recognition provided by experts. They describe in detail the contents of process identifies the components of the query using the the ’800 exhibit at the Capodimonte Museum. same semantics adopted to annotate the story of an artwork. The interaction model is a Question&Answering type, and For this purpose a set of rules based on the relations con- it also able to take the initiative in the absence of user’s tained in the dependency tree of the sentence and contextual queries and it takes into account the user’s movement to information of user’s position has been defined. Analysing suggest new information coherent to what the user expects. the dependency tree we discover the events from their com- So, the interaction design has a model related to the user’s ponents: what, where, who and when. The list of answers movement that describes where the user is located: in a (events) obtained from query results are ranked, then the museum room, near a cultural asset, or in transit areas. For best answer is selected and the corresponding text associ- the localisation system a beacon system has been used. In ated to it is chosen as textual answer. each location, the system can interact with the user and the The Multimedia Selector module selects and ranks avail- interaction phase is modelled on the base of information that able media that can be associated to the response sentence the system can provide: general information or deepening received by the Textual Response Generator that returns after user’s query. The system initiative is activated when the sentences annotated by CSWL so that media selection the user changes an area or by timer that tracks the user is based on annotated entities. The ranking is based on an inactivity. index calculated by comparing the CSWL annotation of me- In the listing below, we show an excerpt of a dialogue dia with respect to the annotations of the text. It checks if about the ’800 exhibit. The topic is the picture Caesar’s they (media and text) have a common annotation of some Death. We use the notation #near(x) to indicate that the entities, that is if on media are annotated entities that are user is near the cultural object x and #elapsed time to ex- cited in the text. The media coming out from this phase press that the user did not provide stimuli for a given time may be too much to be displayed, so for selection Multime- interval. In this case, the system takes the initiative and dia Selector also takes into account: the story of the previous provides new information. The underlined words in the text multimedia responses; the duration of the voice audio; and are concepts for which multimedia materials are associated minimum visualisation duration for each media. and so can be visualised. The Multimedia Synchroniser module synchronises media with synthesised text through a Text To Speech tool, so that Example of Multimedia Dialogue: ‘800 exhibit. media items are coherently visualised with the relevant time intervals in which a synthetic voice talks about the content [1.1] User: #near(la morte di cesare) represented in the media. The final composition is made [1.2] System: “Il dipinto del pittore romano by Multimedia Response Streamer or Multimedia Response Vincenzo Camuccini è stato realizzato ad olio su carta poi Script Builder. Multimedia Response Streamer merges the incollata su tavoletta.” 7 https://www.ffmpeg.org/ (English) “The painting, made by the Roman artist Vincenzo Camuccini, was produced using oil on paper then We have presented a multimedia spoken dialogue system pasted on a tablet.” for the Italian language, using texts and media to produce [2.1] System: #elapsed time multimedia responses through a semantic aggregation of such [2.2] System: “Questa versione è stata realizzata resources. The reasoning engine assembles texts and media probabilmente tra il 1804 e il 1806 durante la lunga fase di annotated with the event-based formalism CSWL. The sys- gestazione e di vendita della tela grande.” tem architecture is modular and can easily be adapted to (English) “This version was probably made between 1804 include upcoming interaction devices. The current proto- and 1806 during the long period of gestation and selling of type lacks, in this phase of research, a formal evaluation of the large painting.” the interaction quality so future work will consist of formal on-site tests to evaluate the user satisfaction. Users’ feed- [3.1] User: “Cos’è la tela grande? ” back will be collected through questionnaires filled out by (English) “What is the large painting? ” visitors at end of each visit session. Also, will be defined [3.2] System: “La Morte di Giulio Cesare e’ un dipinto a controlled tasks to measure the usability of the system and olio su tela realizzato dal pittore Vincenzo Camuccini nel how visitors actually use it through log analysis. 1798 e conservato nel Museo nazionale di Capodimonte a Napoli”. (English) “The “La Morte di Giulio Cesare” is an oil 7. ACKNOWLEDGMENTS painting on canvas by the artist Vincenzo Camuccini, it Part of this work is supported by the Italian PAC project has been produced in the 1798 and it’s hosted in the Cultural Heritage Emotional Experience See-Through Eye- National Museum of Capodimonte in Naples.” wear (CHEESE). [4.1] User: “Cosa rappresenta il quadro? ” (English) “What does the painting show?” [4.2] System: “Il momento raffigurato è quello in cui Cesare, attaccato dai congiurati, cade sotto i colpi dei pugnali, durante la riunione del Senato delle Idi di Marzo il 15 del 44 a.C.” (English) “The depicted moment is the one when Caesar, attacked by the conspirators, falls under the blows of the daggers, during the meeting of the Senate at the Ides of March in the 15 of the 44 B.C.” From the example, we can see that, when the user is near the painting named “Caesar’s Death” [1.1], the system pro- vides generic information about the author and the painting technique [1.2]. Then, the user does not make any request and stays near the picture, so the system takes the initiative providing more information [2.2]. The last turns of dialogue are stimulated by user’s queries. If we consider the answer [4.2], the system produces the response by associating the relevant media as shown in Figure 2. The answers of the sys- Figure 2: example of multimedia response tem are JSON scripts containing the media used and how they are temporally synchronised. The layout depends on the device (pc, smart-phone, tablet, glasses). On a smart- phone we show the media like slide-show, they are shown sequentially on the display, while on glasses, the images are located in the space, they don’t always appear in front of the user, but in a specific area of space. 6. CONCLUSIONS