=Paper=
{{Paper
|id=Vol-1621/paper3
|storemode=property
|title=Multimedia Responses in Natural Language Dialogues
|pdfUrl=https://ceur-ws.org/Vol-1621/paper3.pdf
|volume=Vol-1621
|authors=Antonio Sorgente,Paolo Vanacore,Antonio Origlia,Enrico Leone,Francesco Cutugno,Francesco Mele
|dblpUrl=https://dblp.org/rec/conf/avi/SorgenteVOLCM16
}}
==Multimedia Responses in Natural Language Dialogues==
<pdf width="1500px">https://ceur-ws.org/Vol-1621/paper3.pdf</pdf>
<pre>
      Multimedia Responses in Natural Language Dialogues

                  Antonio Sorgente                                Paolo Vanacore                        Antonio Origlia
            Inst. of Applied Sciences and                  Inst. of Applied Sciences and        Dept. of Electrical Engineering
             Intelligent Systems of CNR                     Intelligent Systems of CNR           and Information Technology,
                      Naples, Italy                                  Naples, Italy                    University of Naples
              a.sorgente@isasi.cnr.it                       p.vanacore@isasi.cnr.it                       “Federico II” &
                                                                                                 Inst. of Applied Science and
                                                                                                  Intelligent Systems of CNR
                                                                                                           Naples, Italy
                                                                                                  antonio.origlia@unina.it
                      Enrico Leone                             Francesco Cutugno                      Francesco Mele
           Dept. of Electrical Engineering                Dept. of Electrical Engineering        Inst. of Applied Sciences and
            and Information Technology,                    and Information Technology,            Intelligent Systems of CNR
                University of Naples                            University of Naples                       Naples, Italy
                    “Federico II”                                   “Federico II” &                  f.mele@isasi.cnr.it
                    Naples, Italy                          Inst. of Applied Science and
            erik.leone82@gmail.com                          Intelligent Systems of CNR
                                                                     Naples, Italy
                                                                cutugno@unina.it

ABSTRACT                                                                       Keywords
Offering contents to a visitor in a natural and attractive                     multimedia composition, spoken dialogue system, cultural
way is one of the most interesting challenges in promoting                     heritage
cultural heritage. In this paper, we present an ongoing re-
search about the design and development of interactive sys-                    1.   INTRODUCTION
tems based on dialogues in natural language to assist a user
                                                                                  The amount of information related to the domain of Cul-
during a visit to a cultural space. The responses of system
                                                                               tural Heritage built by experts, and published on the web,
contain multimedia elements and are generated by users’
                                                                               is growing day by day. With this significant load of informa-
queries or following contextual updates associated to their
                                                                               tion, offering contents to a visitor in a natural and attractive
position, so the system can take initiative in absence of ex-
                                                                               way is one of the most interesting challenges in promoting
plicit stimuli. The response of system results from a compo-
                                                                               cultural heritage. In the last years, applications of dialogue
sition process that coherently synchronises media elements
                                                                               systems have been presented in the Cultural Heritage do-
with a synthetic voice delivering the textual content. This
                                                                               main to study how to understand users’ requests, and how to
way, the visitor receives an audio explanation commented
                                                                               provide adequate responses [?, ?, ?]. Interaction is typically
by images. To implement this approach, a semantic archive
                                                                               based on textual chats so both input and output are only
containing the annotation of stories has been built. The for-
                                                                               textual. With the diffusion of mobile and wearable devices,
malism used for the annotation is CSWL (Cultural Stories
                                                                               the possibility to have systems based on spoken language
Web Language), used to represent cultural stories through
                                                                               interaction has increased, and the advantage of having de-
events. A case study on the ’800 exhibit at the Capodi-
                                                                               vices that support high quality video material introduces an
monte museum is presented to describe how the system was
                                                                               advantage for content presentation. The basic idea is, there-
designed and deployed.
                                                                               fore, to use portable and/or wearable devices to engage in-
                                                                               teraction in natural language and provide information about
CCS Concepts                                                                   a cultural asset. The challenge is not to create a new knowl-
•Human-centered computing → Human computer                                     edge base each time a new task is designed, but to exploit
interaction (HCI); Natural language interfaces; In-                            the same great deal of information about the cultural her-
teraction paradigms;                                                           itage domain that is already available. The main feature of
                                                                               this approach is to select the appropriate contents related to
                                                                               a cultural item and aggregate them in an unique multimedia
                                                                               response. In this paper, we will present a description of the
                                                                               multimedia dialogue system architecture and we will briefly
                                                                               describe the formalism used for the annotation of stories,
                                                                               which is based on the syncretic model introduced in [?]. We
 Copyright 2016 for this paper by its authors. Copying permitted for private   will also describe how the system assembles the multimedia
and academic purposes.                                                         response. To provide an example of the overall interaction,
                                                                               we will present a dialogue excerpt together with the relative
                                                                               multimedia answers.
2.   DIALOGUE SYSTEM                                              to semantic repository (as MultiWordnet and Wiktionary6 ).
   The core of the dialogue system is centred on the Open-           Opendial is a probabilistic framework so, while in the
dial framework [?], which provides a flexible environment to      present version we only make use of deterministic rules to
design dialogue systems using an XML-based language and           manage the dialogue. It is possible to fine-tune the sys-
can also be extended with customised plugins using Java.          tem with a combination of probability estimates and utility
Opendial represents the dialogue state as a set of variables      functions to plan machine actions and even to estimate the
and it lets the user define a series of internal models. These    probability of some events in the next interactions. These
are triggered by variable updates that automatically produce      estimates can be computed on the basis of previous users’
reactions in accordance with the observed state.                  interactions with the system, which can be collected either
   Although not mandatory, three main models are, typi-           with Wizard-of-Oz approaches or using a first prototype ver-
cally, used in an Opendial application: the Natural Language      sion of the system.
Understanding (NLU) model analyses the user input and
maps it on a set of possible user actions; the Action Selection   3.     CSWL
Model (ASM) associates the user action to the correspon-
                                                                     In this section we briefly present CSWL (Cultural Story
dent machine action; and the Natural Language Generation
                                                                  Web Language) [?], a formalism to represent cultural sto-
(NLG) model produces a spoken content in accordance with
                                                                  ries published on the Web. These stories concern the life of
the selected machine action. In the proposed framework, we
                                                                  historical characters, the history of artworks, and architec-
have three separate NLU models to handle different phases
                                                                  tural structures and their transformation over time. CSWL
of the interaction: the first separates commands to the de-
                                                                  is an event-based formalism and it defines three types of en-
vice (volume control, taking pictures or videos, etc.) from
                                                                  tities changing over time: simple events, complex events and
user queries concerning cultural heritage items; the second
                                                                  fluents. The CSWL formalism constitutes the semantic ref-
detects the requests for device-related functions; the third
                                                                  erence to build the annotation of texts and media that are
one detects incomplete commands and summarises the pos-
                                                                  an important part of dialogue system in natural language
sible outcomes, so that clarification strategies can be applied
                                                                  capable to provide multimedia responses.
to recover the interaction. In this work, we concentrate on
                                                                     Simple Events are represented by four components: When
the management of responses to user queries detected by the
                                                                  - the time interval in which the event happens; What - the
first NLU model. As the system is focused on Italian, a set
                                                                  action happening in the event; Where - the location where
of plugins to process this language have been included:
                                                                  an event takes place; and Who - the participants to the
   1) a plugin to receive the audio stream from the client
                                                                  event. In CSWL, the event component why is represented
and transcribe it using Google Speech; 2) a plugin to obtain
                                                                  by a causal relation between two events. For this reason,
Part of Speech tags from the Treetagger[?] tool; 3) a plu-
                                                                  such a relation is defined as a complex event.
gin to normalize the utterance substituting the synonyms of
                                                                     In CSWL, stories are represented through complex events.
target terms with the target term itself and to perform lem-
                                                                  A Complex Event is composed of a set of events, a set of
matisation (synsets are obtained from the MultiWordNet1
                                                                  causal and temporal relationships between them, and of all
database); 4) a plugin to extract the dependency-based parse
                                                                  properties holding over the time in which story unfolds. A
tree of the normalised utterance using the Turin University
                                                                  complex event has the same type of components of a simple
Linguistic Environment (TULE)2 converting from the Turin
                                                                  event, but its components are calculated starting from the
University Treebank (TUT)3 format to the Maltparser4 for-
                                                                  elements composing the story.
mat. This occurs as Opendial natively supports Maltparser
                                                                     Each participant to a story (character, archaeological struc-
to represent dependency trees so, with this method, it is
                                                                  ture, art object, etc.) can be represented through prop-
possible to fully program the system’s behaviour using the
                                                                  erties, spatial relations, and meronomic relations changing
Opendial XML-based language. This also makes it easier to
                                                                  over time. In the same way, mental states of historical char-
extend the system to English; 5) a plugin to connect Open-
                                                                  acters can be depicted through their desires, intentions, and
dial to the higher-level system handling user queries.
                                                                  beliefs. In CWSL these relations are represented as Fluents
   Concerning the last plugin, a communication protocol based
                                                                  (fluent is the same concept as defined in Event Calculus [?]).
on JavaScript Object Notation (JSON) has been adopted.
                                                                     For our experimentation, we have annotated through CSWL
The JSON string contains the multimedia response (see sec-
                                                                  a collection of texts and media related to the ’800 exhibit
tion 4) for the user and defines the synchronisation of syn-
                                                                  provided by museum experts. It contains textual informa-
thesised text with media. The syntax of JSON is similar
                                                                  tion describing 4 museum rooms and 7 artworks and it also
to Synchronised Multimedia Integration Language (SMIL)5 .
                                                                  contains 123 media objects linked to the relevant parts of
For the implementation of the question answering process to
                                                                  the reference texts. For the annotation process a graphical
generate the response a RESTful architecture has been im-
                                                                  tool has been developed.
plemented. The main modules used for the interpretation
of user requests are: a parser to identify its grammatical
structure; a set of semantic services for the detection of se-    4.     RESPONSE THROUGH MULTIMEDIA
mantic concepts based on events; and services for accessing              SEMANTIC MASHUP
                                                                     An issue we have dealt with in this work concerns the
1
                                                                  creation of a multimedia response temporally synchronising
  http://multiwordnet.fbk.eu/english/home.php                     texts and media according to semantic annotations. This
2
  http://www.tule.di.unito.it                                     is called syncretic text[?], a text capable of organising
3
  http://www.di.unito.it/∼tutreeb/                                heterogeneous languages within a unitary communications
4
  http://www.maltparser.org/
5                                                                 6
  https://www.w3.org/TR/smil/                                         https://www.wiktionary.org/
model[?], with features of cohesion and coherence that re-         element in a single media and presents it using the FFmpeg7
fer to the same enunciation instance. This way, the visitor        tool. Instead, Multimedia Response Script Builder produces
receives an audio explanation commented by images.                 a composition script that reports the synchronisation times
   The starting point of this process is the selection of text     between audio texts and media. This module generates a
that will be given in response to the user request. Then,          JSON code similar to SMIL representation.
on this text, we associate multimedia resources to it. The
system developed to achieve this goal is defined according to
the guidelines described in [?]. A summary of these are: 1)
a multimedia element concerning a participant can only be
displayed after being uttered in the multimedia composition;
2) for each response, it’s preferable to display a multimedia
element that represents the current topic; 3) if for some sig-
nificant textual element there are no associated multimedia
elements, then the visualisation of previous multimedia el-
ements persists; 4) an expression in the text relating to a
totality can be associated with a multimedia element that
represents a part (a part for the totality) and viceversa (a
totality for the part); 5) the duration of all the selected mul-
timedia elements should not, in principle, exceed the enun-
ciation time of the text; 6) each selected multimedia element
has to be shown for at least two seconds.
   The above guidelines describe some good practice for the               Figure 1: multimedia composition schema
selection of media and composition of multimedia response.
These are used to implement constraints on the selection
and on the composition process of the response.                    5.     AN EXAMPLE: ’800 EXHIBIT AT THE
   For the system implementation, we have used the tech-
niques presented in [?]. In particular, in this work we have              CAPODIMONTE MUSEUM
implemented the module to compose the multimedia re-                  In this section we present an example of interaction where
sponse. The structure of the software modules is depicted          the system, in addition to textual answer, provides a mul-
in Figure 1. Textual Response Generator starting from user         timedia response. Texts and media for this test have been
query extracts the answer from the text. The recognition           provided by experts. They describe in detail the contents of
process identifies the components of the query using the           the ’800 exhibit at the Capodimonte Museum.
same semantics adopted to annotate the story of an artwork.           The interaction model is a Question&Answering type, and
For this purpose a set of rules based on the relations con-        it also able to take the initiative in the absence of user’s
tained in the dependency tree of the sentence and contextual       queries and it takes into account the user’s movement to
information of user’s position has been defined. Analysing         suggest new information coherent to what the user expects.
the dependency tree we discover the events from their com-         So, the interaction design has a model related to the user’s
ponents: what, where, who and when. The list of answers            movement that describes where the user is located: in a
(events) obtained from query results are ranked, then the          museum room, near a cultural asset, or in transit areas. For
best answer is selected and the corresponding text associ-         the localisation system a beacon system has been used. In
ated to it is chosen as textual answer.                            each location, the system can interact with the user and the
   The Multimedia Selector module selects and ranks avail-         interaction phase is modelled on the base of information that
able media that can be associated to the response sentence         the system can provide: general information or deepening
received by the Textual Response Generator that returns            after user’s query. The system initiative is activated when
the sentences annotated by CSWL so that media selection            the user changes an area or by timer that tracks the user
is based on annotated entities. The ranking is based on an         inactivity.
index calculated by comparing the CSWL annotation of me-              In the listing below, we show an excerpt of a dialogue
dia with respect to the annotations of the text. It checks if      about the ’800 exhibit. The topic is the picture Caesar’s
they (media and text) have a common annotation of some             Death. We use the notation #near(x) to indicate that the
entities, that is if on media are annotated entities that are      user is near the cultural object x and #elapsed time to ex-
cited in the text. The media coming out from this phase            press that the user did not provide stimuli for a given time
may be too much to be displayed, so for selection Multime-         interval. In this case, the system takes the initiative and
dia Selector also takes into account: the story of the previous    provides new information. The underlined words in the text
multimedia responses; the duration of the voice audio; and         are concepts for which multimedia materials are associated
minimum visualisation duration for each media.                     and so can be visualised.
   The Multimedia Synchroniser module synchronises media
with synthesised text through a Text To Speech tool, so that       Example of Multimedia Dialogue: ‘800 exhibit.
media items are coherently visualised with the relevant time
intervals in which a synthetic voice talks about the content       [1.1] User: #near(la morte di cesare)
represented in the media. The final composition is made            [1.2] System: “Il dipinto del pittore romano
by Multimedia Response Streamer or Multimedia Response             Vincenzo Camuccini è stato realizzato ad olio su carta poi
Script Builder. Multimedia Response Streamer merges the            incollata su tavoletta.”
                                                                   7
                                                                       https://www.ffmpeg.org/
(English) “The painting, made by the Roman artist
Vincenzo Camuccini, was produced using oil on paper then           We have presented a multimedia spoken dialogue system
pasted on a tablet.”                                             for the Italian language, using texts and media to produce
[2.1] System: #elapsed time                                      multimedia responses through a semantic aggregation of such
[2.2] System: “Questa versione è stata realizzata               resources. The reasoning engine assembles texts and media
probabilmente tra il 1804 e il 1806 durante la lunga fase di     annotated with the event-based formalism CSWL. The sys-
gestazione e di vendita della tela grande.”                      tem architecture is modular and can easily be adapted to
(English) “This version was probably made between 1804           include upcoming interaction devices. The current proto-
and 1806 during the long period of gestation and selling of      type lacks, in this phase of research, a formal evaluation of
the large painting.”                                             the interaction quality so future work will consist of formal
                                                                 on-site tests to evaluate the user satisfaction. Users’ feed-
[3.1] User: “Cos’è la tela grande? ”
                                                                 back will be collected through questionnaires filled out by
(English) “What is the large painting? ”
                                                                 visitors at end of each visit session. Also, will be defined
[3.2] System: “La Morte di Giulio Cesare e’ un dipinto a
                                                                 controlled tasks to measure the usability of the system and
olio su tela realizzato dal pittore Vincenzo Camuccini nel
                                                                 how visitors actually use it through log analysis.
1798 e conservato nel Museo nazionale di Capodimonte a
Napoli”.
(English) “The “La Morte di Giulio Cesare” is an oil             7.   ACKNOWLEDGMENTS
painting on canvas by the artist Vincenzo Camuccini, it            Part of this work is supported by the Italian PAC project
has been produced in the 1798 and it’s hosted in the             Cultural Heritage Emotional Experience See-Through Eye-
National Museum of Capodimonte in Naples.”                       wear (CHEESE).
[4.1] User: “Cosa rappresenta il quadro? ”
(English) “What does the painting show?”
[4.2] System: “Il momento raffigurato è quello in cui
Cesare, attaccato dai congiurati, cade sotto i colpi dei
pugnali, durante la riunione del Senato delle Idi di Marzo
il 15 del 44 a.C.”
(English) “The depicted moment is the one when Caesar,
attacked by the conspirators, falls under the blows of the
daggers, during the
meeting of the Senate at the Ides of March in the 15 of the
44 B.C.”
   From the example, we can see that, when the user is near
the painting named “Caesar’s Death” [1.1], the system pro-
vides generic information about the author and the painting
technique [1.2]. Then, the user does not make any request
and stays near the picture, so the system takes the initiative
providing more information [2.2]. The last turns of dialogue
are stimulated by user’s queries. If we consider the answer
[4.2], the system produces the response by associating the
relevant media as shown in Figure 2. The answers of the sys-


     Figure 2: example of multimedia response

tem are JSON scripts containing the media used and how
they are temporally synchronised. The layout depends on
the device (pc, smart-phone, tablet, glasses). On a smart-
phone we show the media like slide-show, they are shown
sequentially on the display, while on glasses, the images are
located in the space, they don’t always appear in front of
the user, but in a specific area of space.

6.   CONCLUSIONS

</pre>