=Paper=
{{Paper
|id=Vol-1772/paper4
|storemode=property
|title=Designing Interactive Experiences to Explore Artwork Collections: a Multimedia Dialogue System Supporting Visits in Museum Exhibits
|pdfUrl=https://ceur-ws.org/Vol-1772/paper4.pdf
|volume=Vol-1772
|authors=Antonio Origlia,Enrico Leone,Antonio Sorgente,Paolo Vanacore,Maria Parascandolo,Francesco Mele,Francesco Cutugno
|dblpUrl=https://dblp.org/rec/conf/aiia/OrigliaLSVPMC16
}}
==Designing Interactive Experiences to Explore Artwork Collections: a Multimedia Dialogue System Supporting Visits in Museum Exhibits==
Designing Interactive Experiences to Explore Artwork Collections: a Multimedia Dialogue System Supporting Visits in Museum Exhibits Antonio Origlia1,2 , Enrico Leone1 , Antonio Sorgente2 , Paolo Vanacore2 , Maria Parascandolo1 , Francesco Mele2 , and Francesco Cutugno1,2 1 PRISCA-Lab, Federico II University, Naples, Italy, {antonio.origlia, enrico.leone, maria.parascandolo, cutugno}@unina.it, 2 Institute of Applied Sciences and Intelligent Systems - CNR, Naples, Italy, {a.sorgente, p.vanacore, f.mele}@isasi.cnr.it Abstract. Speech and natural language processing have a central role in the implementation of systems designed to make the museum more reac- tive to users’ inputs and to improve the overall interaction quality. In this paper, we present the design and implementation of a dialogue system to provide multimedia presentations for museum visits. A corpus of speech recordings in Italian was collected with a mobile application to obtain a reference set of possible ways for the users to express their intentions. On the basis of this corpus, a set of recurring syntactic patterns associ- ated to device requests was extracted to let the dialogue system separate device commands from information queries. Disambiguation strategies depending on the context are also applied in presence of partial syntac- tic patterns. Information queries are answered by automatically assem- bling portions of semantically annotated texts and are synchronized with relevant multimedia resources. A case study on the ’800 exhibit at the Capodimonte museum in Naples is presented3 . Keywords: dialogue systems, cultural heritage 1 Introduction Italian has a very poor visibility in the area of spoken dialogue systems basic research. The EVALITA evaluation campaign held in 2009 [1] showed the state of art for telephonic systems was limited to the features offered by VoiceXML standard. Participants in that evaluation campaign were able to set up three dif- ferent system initiative dialogue managers in the field of train services (ticketing, booking and timetable queries) but performances were found to be below the performance that is usually obtained on English. Also, recent dialogue systems for Italian equipped with semantic reasoning capabilities were presented in [2–4], but they only considers chat based interaction. In the passage from telephonic 3 This work is supported by the Italian PAC project Cultural Heritage Emotional Experience See-Through Eyewear (CHEESE). to mobile applications first and to generalized spoken language understanding systems, most of the Italian researchers participated into international projects and mainly worked on languages different from their own. This is the case, for example, of the recently published SPEAKY [5] development environment for robotic vocal interfaces. At the same time big companies that were investing in personal assistant mobile apps and similar products, extended their native solutions to our language following industrial procedures that did not give raise to knowledge to be spread in the scientific community. In this paper, we will de- scribe the development of a dialogue system integrated with remote augmented reality interfaces in a cultural heritage setting. We will include a brief descrip- tion of the problems that arise when dealing with delicate environments, like the ’800 exhibit in the Capodimonte museum. These pose serious limitations to technological interventions that have an impact on the overall design process. We will describe how these were addressed and the how the system architecture was implemented. 2 Material An important problem that arises when working with an environment that re- quires technology to be non-invasive, like a museum exhibit, is that it is difficult to involve end users in the early steps of system development. Exhibits may not be always open to the general public, estimated visitor attendance can vary due to external conditions and wifi connectivity is not always guaranteed. In our case study, the ’800 exhibit at the Capodimonte museum in Naples, wifi connection presents a problem of its own as the exhibit is located inside the Bourbon Royal Palace, where walls are very thick and the possibility of intervention are limited. For this reason, in order to obtain a reference set of possible ways for the users to access the system’s functions before this is deployed, we used a simple pro- totype application to collect speech utterances and design the dialogue system accordingly. The application is implemented as an Android app running on a smartphone in uncontrolled environments. To avoid influencing participants in producing always the same utterances and obtain a higher expressive variability, we chose to present the scenarios using an iconographic approach. Each participant, at each step, was prompted with a set of icons representing a specific user request. An example of the VolumeUp scenario is presented in Figure 1. The participant can record the utterance, listen to it and submit it to the remote collection server when she is satisfied. Scenarios cover both device commands (volume control, taking pictures, recording videos. . . ) and content-related queries. If a prompt was not clear, the user would tap the single icons to visualize a single word explaining the icon. This way, no suggestion about how to combine the icons to derive the scenario were given. The users were, in general, able to derive the meaning of the prompt. After a manual check, only 171 utterances have been discarded because the users provided inconsistent recordings. 27 Fig. 1. A screenshot of the mobile app used to collect the reference corpus. The prompts were randomly presented to the users and were proposed five times each to encourage people to provide multiple ways to ask for the same service, increasing expressive variability. 22 gender-balanced participants with good technological competence were recruited and 17 scenarios were foreseen. A total number of 1870 recordings were collected this way. An expert linguist listened to the material and provided the correct transcription to obtain an estimate of the ASR errors. After a manual check, the correct transcription was found as 1-best in 70% of the cases. In 11% of the correct transcription was either presented a 2-best or 3-best. In the remaining 19% of the cases, the Google ASR engine was not able to provide the correct transcription on the first 3-best. This is mainly caused by the variable quality of the recordings and it represents a good approximation of the performance we can expect from the ASR. Also, it indicate that a good number of cases may be recovered by applying re-ranking techniques. A common error committed by the ASR engine was providing the transcription “Firma l’audioguida” (Sign the audioguide) as the 1-best while the correct “Ferma l’audioguida” (Stop the audioguide) was found as 2-best. In our context, of course, “Ferma l’audioguida” makes more sense than “Firma l’audioguida” and would be recovered, ideally bringing the system’s performance around 81% correct transcription. Concerning the question answering system, museum experts provided a col- lection of texts and media related the ’800 exhibit. This material contains textual information describing 4 museum rooms and 7 artworks and it also contains 123 media objects linked to the relevant parts of the reference texts. This allows the question answering system to control the timing of potential accompanying media presentations when the answer to a question is assembled. 3 System architecture In this section, we describe the client-server architecture used to deploy the CHEESE system. Most of the logic is located on the server side but some issues related to the client and how these were managed are worth mentioning. Client side. Although the dialogue management is independent from the client interface, limits due to the chosen wearable device influence the configuration of the speech manager. In our case, we use the Epson Moverio BT-200 glasses for augmented reality, which are equipped with Android 4.0.4. This is an important 28 issue as, in this version of Android, no offline recognition support is offered by the system. For this reason, we developed an Android app that continuously listens to the microphone and streams audio towards the server. On server side, an audio acquisition thread collects the recorded input and applies Voice Activ- ity Detection before connecting to Google Speech to obtain the transcription. To perform audio streaming and segmentation, adintool, which is part of the Julius ASR engine [6], is used. In order to connect the two parts of the system, the audio streaming procedure replicates, in Java, the C++ procedure used by adintool when operating in client mode. As future versions of wearable glasses will be equipped with higher versions of Android supporting offline recognition, the system can also be configured to manage strings instead of audio. Server side. The server side of the dialogue system is centered on the Opendial framework [7], which provides a flexible environment to design dialogue systems using and XML-based language and can also be extended with customized plu- gins using Java. Opendial represents the dialogue state as a set of variables and it lets the user define a series of internal models triggered by variable updates that automatically produce reactions accordingly to the the observed state. Although not mandatory, there are typically three main models in an Open- dial application. The Natural Language Understanding (NLU) model analyzes the user input and it maps it on a finite set of possible user actions. The Action Selection Model (ASM) connects the user action to the correspondent machine action. The Natural Language Generation (NLG) model produces spoken content in accordance with the selected machine action. In the CHEESE framework, we have three separate NLU models to handle different moments of the interaction: the first separates commands to the device (volume control, taking pictures or videos, etc.) from user queries concerning cultural heritage items that are part of the considered exhibit. The second detects requests for device-related func- tions. The third one detects incomplete commands and summarizes the possible outcomes so that clarification strategies can be applied to recover the interaction. As the system is focused on Italian, a set of plugins to include, in Opendial, the following software tools, needed to process this language: a plugin to receive the audio stream from the client and transcribe it using Google Speech; a plugin to obtain POS tags from the Treetagger[8] tool; a plugin to normalize the utter- ance substituting synonyms of target terms with the target term itself; a plugin to extract the dependency-based parse tree of the normalised utterance using the Turin University Linguistic Environment (TULE)[9]; a plugin to connect Opendial to the higher-level system handling user queries. Concerning the last plugin, a communication protocol based on JavaScript Object Notation (JSON) has been adopted. The JSON string contains the mul- timedia response for the user and defines the synchronisation of synthesised text and media. Its structure is based on a simplified version of Synchronized Multimedia Integration Language (SMIL)4 . The main modules used for the in- 4 https://www.w3.org/TR/REC-smil/ 29 terpretation of user requests are a parser to identify its grammatical structure, a set of semantic services implemented for the detection of semantic concepts in the text such as events, entities, locations, etc, and services to access semantic repository as MultiWordnet[10] and Wiktionary (https://www.wiktionary.org/). 4 Dialogue System In this Section, we describe the dialogue management logic governing the sys- tem. The system is modular and it applies a pipeline process after receiving an input utterance to evaluate its content and plan a reaction. First of all, the input string is preprocessed to obtain a normalised utterance using the plugins described in the previous Section. The dialogue state is then updated considering the incoming utterance and the current position of the user, which is provided externally and is relevant to answer queries like “Chi l’autore di questo quadro” (Who is the author of this painting). The NLU model dedicated to the detec- tion of WH-questions is the first to run. If this model detects a WH-question, the ASM gives control to the question answering system, otherwise the NLU command detection model is run. If a syntactic pattern associated with a de- vice command is detected, the ASM activates the corresponding device action, otherwise the NLU model for recover strategies is activated. This model checks if incomplete syntactic patterns can be detected in the utterance and, if this is the case, the ASM instructs the NLG model to pose an appropriate question to the user to disambiguate the command. This module also attempts to resolve ambiguities based on the current context. If no partial syntactic pattern can be found, the system asks the user to confirm the automatic transcription and, if this is confirmed, aborts the interaction as it is not able to help. 4.1 Command/query separation In our approach, we have focused on wh-question (or content questions), queries containing a question word (called wh-words) such as, for example, ‘chi’ (who) or ‘quando’ (when) [11]. For the identification of wh-questions, a set of lexico- syntactic patterns are defined. These are implemented as regular expressions detecting direct and indirect queries and identifying specific linguistic expres- sions. From the analysis of the reference corpus we observed that many users do not just make specific requests, but also use general queries such as “dammi altre informazioni ” (give me more information), “mi dici qualcosa sul quadro” (can you tell me something about the picture). 4.2 Device commands management Starting from the material collected with the procedure described in Section 2, an expert linguist analysed the output of the Treetagger and TULE tools applied to the normalised utterances to obtain the Opendial model dedicated to command recognition. Rules in this model attempt to match recurring syntactic patterns 30 related to specific user requests against the dependency parsing tree obtained from the received utterance. The rules consider the presence of a variable length syntactic dependence, in the tree, between a set of target subtree roots (usually verbs), covering the ones observed in the corpus, and, possibly, a set of target terms (usually nouns) in the tree. For both target roots and terms, it is possible to admit their MWN synsets. It is possible to summarize the structure of these rules as a tuple < R, T, l, S >, where R is a set of target subtree roots, T is a set of optional target terms, l is the maximum length of the dependency chain linking a member of R to a member of T and S is a subset of the union of R and T containing the terms for which synonyms are accepted. If T is empty, l is always 0 and the terms included in R must be the root of the entire tree for the rule to be matched. Multiple tuples can be associated with a single command to describe different syntactic patterns. For example, the TakePicture command is associated with the tuples < {scattare, . . . , f oto}, {}, 0, {scattare, . . . , f oto} > < {f are}, {f oto}, 2, {f oto} > The first tuple handles the cases in which users use isolated words, like im- perative verb forms, to control the device (“scatta!”, “fotografa!”, “foto!”), the second tuple handles the case of the most natural sentence “fai una fotografia” (Take a picture) where fotografia is a synonym of foto(photo) and also covers the frequent use of Googlese(talk using keywords) by the users as, in the corpus, utterances like “fai foto” were not uncommon. This is also the reason why de- pendency types are not included in the rules: we observed that, when users shift to Googlese, the structure of the dependency tree is preserved in most cases but the reported relationship types are erratic. When no command pattern can be matched exactly, the system checks for partial matches in the preceding rules represented by the presence of target terms in the input utterance to recover the interaction. In this phase, the system also checks the active processes to reduce the set of possibilities when binary actions are considered. For example, if the target term “registrazione” is detected and the device is already recording a video, the only possible action related to the target term is RecStop, which stops the recording. The system asks for confirmation before performing the action as a safety measure. When context does not help to reduce the set of possible actions to one, as in the case of “Avvia!”(Start) when both video recording and the audioguide are not running, the system prompts the user to specify an action among the possible ones to proceed. When system prompts conclude an interaction turn, control is given directly to the system query handler to complete the interaction after processing the user utterance answering the question. 4.3 Question answering In this section we introduce the approach used to understand the user’s request and generate the response that satisfies it. First, we check if the sentence is a Wh- query type. Then, if the check has success, a set of rules are applied to detect the 31 topic and the information request about topic. Discovered the type of information request, queries are generated to research answers in a knowledge base. This latter is composed by stories that have been annotated through an event-based formalism. Finally, texts and media retrieved are composed to generate an unique multimedia response to be returned to the user. For the extraction of the topic and the arguments of the request, we have defined a set of rules based on the relations contained in the dependency tree of the sentence and contextual information of user’s position. The rules defined are used for discovering the topic (subject which is discussed) and about what the visitor asks. The recognition process identifies the components of the query using the same semantics adopted to annotate the story of an artwork. For this, analysing the dependency tree and the topic of the query, we discover the events from their components: the action (what), the location (where), the participants (who), and the interval of happening (when). For example: “Chi ha dipinto la Morte di Cesare(Who painted Cesar’s Death)” is related to the action ”dipingere (paint)” and participant “la morte di Cesare”. Also, the visitor want know the author (”Chi (Who)”). In this example, the rules used for detected the action and the participant are: verb(S, V )∧ ∼ auxiliary(S, V ) → action(S, V ) verb(S, V ) ∧ obj(S, V, O) → participant(S, O) where verb(S, V ) means that V is the verb of the sentence S, auxiliary(S, V ) means V is an auxiliary verb of S and obj(S, V, O) means O is the object if V in S. In this example, the topic is explicit and corresponds to the object of verb. If the query has a passive form (“da chi stata dipinta la Morte di Cesare”) the participant is the subject. There are other rules to discovery the location and/or time interval of the event. If the topic is not explicit, is detected from contextual information analysing the user’s position. To assemble the multimedia response, we first have to obtain a textual an- swer. To do this, we adopt a modified version of the system presented in [12]. All the histories and media relative to an exhibit are archived in a semantic repos- itory annotated with an event-based formalism. In this work, this formalism is the Cultural Story Web Language (CSWL)[13], which represents. Starting from the results of query interpretation, we reformulate the user request as a CSWL query, and apply query expansion using semantic lexical databases (MultiWord- net and Wiktionary). The list of answers (events) obtained from query results are ranked and the best answer is expanded through extra events correlated with it, if necessary. Then, the corresponding text associated to the selected events, and the relative media recovered in the repository, are assembled as a presen- tation. The process takes in account the semantic annotation associated to the elements (texts and media) and synchronizes them so that media items are vi- sualised coherently with the relevant time instants in which a synthetic voice is talking about the content it represents[14]. 32 5 Conclusions We presented the design and implementation of a spoken dialogue system for Italian aiming to assist the visitors of a museum exhibit. We reported the dif- ficulties encountered in designing an interactive system using natural language understanding caused by the particular environment of museums located in his- torical places and how we addressed them. We also described the material col- lected to obtain a first working prototype of the CHEESE system. The system architecture is flexible, modular and can easily be adapted to future updates due to upcoming technologies. Also, spoken dialogue systems for Italian are not common and our contribution explores an area of recent interest for interactive environments, cultural heritage, which may find relevant applications in Italy. Future work will consist in collecting on-site feedback to evaluate the system and take full advantage of the probabilistic environment offered by Opendial to fine-tune the system. References 1. Baggia, P., Cutugno, F., Danieli, M., Pieraccini, R., Quarteroni, S., Riccardi, G., Roberti, P.: The multi-site 2009 evalita spoken dialog system evaluation. In: Proc. of EVALITA. (2009) 2. Sorgente, A., Brancati, N., Giannone, C., Zanzotto, F.M., Mele, F., Basili, R.: Chatting to personalize and plan cultural itineraries. In: UMAP Workshops. (2013) 3. Stock, O.: Language-based interfaces and their application for cultural tourism. AI Magazine 22(1) (2001) 85 4. Stock, O., Zancanaro, M.: PEACH-Intelligent interfaces for museum visits. Springer Science & Business Media (2007) 5. Bastianelli, E., Nardi, D., Aiello, L.C., Giacomelli, F., Manes, N.: Speaky for robots: the development of vocal interfaces for robotic applications. Applied Intel- ligence 44(1) (2015) 43–66 6. Lee, A., Kawahara, T.: Recent development of open-source speech recognition engine julius. In: Proc. of APSIPA ASC. (2009) 131–137 7. Lison, P.: A hybrid approach to dialogue management based on probabilistic rules. Computer Speech & Language 34(1) (2015) 232 – 255 8. Schmid, H.: Improvements in part-of-speech tagging with an application to german. In: Proc. of the ACL SIGDAT-Workshop. (1995) 47–50 9. Lesmo, L.: The rule-based parser of the nlp group of the university of torino. Intelligenza artificiale 2(4) (2007) 46–47 10. Pianta, E., Bentivogli, L., Girardi, C.: Developing an aligned multilingual database. In: Proc. 1st Intl Conference on Global WordNet. (2002) 11. Rossano, F.: Questioning and responding in italian. Journal of Pragmatics 42(10) (2010) 2756 – 2771 12. Mele, F., Sorgente, A.: Semantic mashups of multimedia cultural stories. Intelli- genza Artificiale 6(1) (2012) 19–40 13. Mele, F., Sorgente, A.: CSWL - un formalismo per rappresentare storie culturali nel web. Technical Report 180/15, Inst. of Cybernetics “E. Caianiello”, CNR (2015) 14. Sorgente, A., Vanacore, P., Origlia, A., Leone, E., Cutugno, F., Mele, F.: Multi- media responses in natural language dialogues. In: Proceedings of AVI*CH 2016. Volume 1621 of CEUR Workshop Proceedings., CEUR-WS.org (2016) 15–18 33