<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>\What's that called?": A Multimodal Fusion Approach for Cultural Heritage Virtual Experiences</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Grazioso</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Di Maro</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Cutugno</string-name>
          <email>cutugnog@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Electrical Engineering and Information Technology, Universita degli Studi di Napoli `Federico II'</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Humanities, Universita degli Studi di Napoli `Federico II'</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, a multimodal dialogue system architecture is presented. The cultural heritage application of the software makes it important to use di erent channels of communication to enable museum visitors to naturally interact with it and still enjoying the artistic environment, whose exploration is supported by the system itself. A question answering system for the 3D reconstruction of the Absis and Presbytery of the San Lorenzo Charterhouse (Padula, Salerno) is considered as a case study to demonstrate the capabilities of the proposed system. The implemented multimodal fusion engine will be described along with the strategies adopted to involve multiple users in an immersive, interactive environment supporting queries and commands expressed through speech and mid-air gestures. The collected feedback shows that the system was well received by the users.</p>
      </abstract>
      <kwd-group>
        <kwd>multimodal dialogue</kwd>
        <kwd>cultural heritage</kwd>
        <kwd>fusion engine</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Human beings function multimodally. The use of gestures accompanied the
history of men from the beginning. According to the Gestural Theory of language,
human language developed from gestures used to communicate [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The alleged
situational ambiguity-based incompleteness of both gestural and vocal channels
can explain their common joint adoption. According to McNeill's terminology
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], the typical movements that can be recognised in gestures are of four types:
i) deictics (or pointing gestures), which connect speech to concrete or abstract
referents; ii) iconics, which depict concrete objects or events in discourse; iii)
metaphorics, which put abstract ideas into more concrete forms; iv) beats, which
have no semantic meaning, but are used to structure the discourse. Interestingly,
the same cognitive aspects, lying behind the production of these visual signals,
govern the production of the acoustic counterparts. In fact, words are used to
refer to external referents, both concrete and abstract ones, to describe events and
objects, to explain abstract ideas, and, together with super-segmental items, to
organise the discourse and modulate it. The two communicative codes depend,
indeed, on similar neural systems.
      </p>
      <p>
        Among the aforementioned gesture types, one is for us of particular interest:
the pointing gesture. First of all, deictic gestures are the ones used to refer to
something, which can be the topic of the communicative exchange representing
the basis for the mutual knowledge being the skeleton of the interaction itself
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Therefore, they have a referent identi cation function and are a grounding
tool. Secondly, these gestures are used as an embodiment tool for cognition [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
According to the embodiment cognition approach, cognitive processes have their
roots in motor behaviour. This means that cognition relies on a physical body
acting on the environment in which it is immersed [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. In this perspective,
the design of dialogue systems whose interaction is based on the exchange of
information about speci c referents cannot overlook the use of such natural
cognitive-motor means of communication. Multimodality becomes, therefore, the
ultimate goal of this work, which aims to show how di erent modalities can be
fused within a single system architecture.
      </p>
      <p>
        The systems we are interested in are multimodal conversational agents, whose
application is gaining more and more importance. Di erent studies show how
technologies of this kind are being adopted for one of the most traditional among
human experiences, which is the museums visit. In fact, the introduction of
technological devices o ering a virtual experience in the exploration of cultural
contents can create more memorable exhibitions [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and at the same time it
can change the way museums are perceived and, consequently, the expectation
of users [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. For this reason, new studies are committed to the exploration of
the ways museum visitors can be better engaged via these new devices [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], for
what both visual stimuli and engaging communication strategies are concerned.
      </p>
      <p>
        Concerning multimodality, some scholars [
        <xref ref-type="bibr" rid="ref11 ref25">11, 25</xref>
        ], in dealing with real
environmental issues, contrive strategies which restrict the way users can freely
interact. Therefore, we are interested in investigating and testing alternative
approaches to model small groups interactions in real contexts, allowing users
to communicate with both verbal and non-verbal actions. Speci cally, the main
purpose of the presented software architecture is to allow each member of a
group of museum visitors to express their request to our system by interacting
multimodally. With the exclusive use of natural human means of
communication, i.e. voice, language and gestures, the virtual agent, projected on a curved
screen, understands multimodal dialogue acts by users asking for information
about artworks and architectural structures contained in a 3D scene.
      </p>
      <p>In the next session, the architecture of the system will be explained starting
from the language modelling for both understanding and generation purposes
(Section 2.1). Afterwards, we will focus our attention on the pointing
interpretation (Section 2.2) and on the active speaker detection (Section 2.3), before
explaining the way the di erent signals were fused together to give a single
interpretation of the user turn (Section 2.4). To conclude, the results of the system
evaluation will be presented (Section 3).</p>
    </sec>
    <sec id="sec-2">
      <title>System Architecture</title>
      <p>
        In this section, the modules used to develop our multimodal system are described
in detail. The entire setup of the interaction environment aims at creating an
immersive experience for the users. Therefore, the interactive area consists of
a 2,5m high and 4,4m long curved screen , used to project a realistic 3D
environment representing the interactive scene. To track users movements and
their speech signals in real time, a Microsoft Kinect 2 [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] sensor is placed on
the oor, at the centre of the screen. Speech recognition is performed using
the grammars discussed in Section 2.1 and the Microsoft Speech Platform.
Acquired data concerned with user signals represent the input of the Game Engine
(Unreal Engine 43) used to model the 3D environment. An input recogniser
communicate with the Multimodal Dialogue System in charge of understanding
user intentions and providing the related responses. OpenDial is the framework
adopted to implement this component [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], which communicate with the
Knowledge Base designed through a graph database (Neo4j4 [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]). Once a response is
retrieved, the Game Engine synthesises the machine utterance by using Mivoq5,
a TTS engine. In the next subsections, the di erent knowledge bases (i.e. for
conceptual representation of spaces, natural language understanding and
generation) are structurally described. In the construction of meanings uttered during
the interaction, whose context is shared by the interlocutors, other signals gain
importance, speci cally the pointing interpretation and the active speaker
detection. In fact, we are nally going to focus on the fusion of the di erent signals
in the modelling of our multimodal system.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Dialogue Modelling</title>
        <p>In this section, the dialogue organisation, as far as the knowledge-base is
concerned, is presented. In fact, the interactional engine of the system is supported
with content and linguistic knowledge of the domain under consideration. The
knowledge with which the system is provided comprises both a corpus-based
grammar for extracting the topic of interest from a user question and a
domaindependent corpus to extract the proper answer as a feedback given to the human
interlocutor by the system.</p>
        <p>
          First of all, a collection of possible questions that a user could pose was
carried out. The ad hoc structured survey enabled us to collect about 800 spoken
questions divided into 10 di erent categories. Each category was then modelled
in a Speech Recognition Grammar. The choice of this methodology depends on
the fact that i) the speed of computation is higher in detecting the right class of
the belonging question without the need of running complex algorithms on raw
data; ii) the restricted domain of application can be better modelled with a
rulebased approach [
          <xref ref-type="bibr" rid="ref15 ref23">15, 23</xref>
          ]; iii) the process of hand-crafting rules was simpli ed by
3 www.unrealengine.com
4 https://neo4j.com/
5 www.mivoq.it
the use of a linguistic ontology by means of which semantic related words could
be automatically included [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Speci cally, the Speech Recognition Grammar
Speci cation (SRGS) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] W3C standard has been developed to allow Automatic
Speech Recognition (ASR) engines to output the semantic interpretation of the
matched pattern instead of the raw transcription. This is an important advantage
for spoken dialogue systems as they can instruct ASR modules to expect speci c
word patterns and to present a structured interpretation of the obtained input
to be provided to the dialogue manager. When the uttered word pattern is not
included in one of the rules, the ASR nds the most similar pattern selected with
a speci c con dence6. This results in a reduced latency, as linguistic analysis
chains working on raw strings are avoided. This methodology was already tested
for a preliminary application for a di erent case study [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          In more detail, a speech recognition grammar is a nite set of rules, where
each rule, associated to a semantic label, generates a set of utterances. As far as
the lexical enrichment of the grammar is concerned, given a collection of semantic
relations (i.e. synonymy, hyperonymy and meronymy), the system is capable of
expanding each rule in a grammar, in order to produce a new grammar, where
a new set of generated utterances includes the previous ones plus the additional
lexical and morpho-syntactic information [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          For the lexical extension of the di erentiated syntactic realisations of
questions in our grammar, we made use of a graph database, named
MultiWordNetExtended [
          <xref ref-type="bibr" rid="ref17 ref5">17, 5</xref>
          ] which contains the Italian lexicon and the semantic relationships
between words taken from MultiWordNet [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
        <p>For the Natural Language Generation module we used another graph database
containing textual nodes describing di erent points of the shown 3D scene. The
nodes correspond to the concepts identi ed in the question classi cation
process and are related to each other by inheritance (in-depth) relationships, when
possible. Each node is, moreover, related to another node containing the textual
answer to be given by the system.</p>
        <p>In order to generate the response, the system needs to verify three conditions.
Being A a user interacting with the system:
{ A is the last speaker.
{ A asked a question recognised by the ASR.
{ A pointed at one relevant object in the scene.</p>
        <p>When all the conditions are true with a considerable probability, the concept
relative to the pointed object is used to query the graph database, retrieving
the needed information in accordance with the semantic interpretation provided
by the natural language understanding module. A more in depth explanation of
multimodal fusion is provided in section 2.4.
6 If the con dence is too low, the system can be modelled to ask for further clari
cations.
The interpretation of the referential function of a message is strictly connected
to the uttered entities which represent extralinguistic domain-related objects in
the 3D scene, namely, in our scenario, the artworks or the structural items of the
3D model. These entities can be sometimes ambiguous, since users could have
a minor expertise of the domain or since di erent items of the 3D model can
be similarly uttered within the same domain. To overcome this interpretation
problem, the non-verbal behaviour, such as pointing gestures, can help the
disambiguation process. Therefore, in this subsection, we are going to present the
way we modelled the pointing recognition function for enabling our system to
interpret pointing gestures in the multimodal construction of intents by users.
The pointing recognition (PR) task can be divided into two subtasks: a) user
positioning in the virtual environment, b) gesture recognition.</p>
        <p>In our system, the Kinect sensor returns the set of points representing the
joints of human body. Using the points provided is possible to represent the user
and his movements in the virtual environment by mapping kinect joints to a
virtual avatar's joints. as shown in Figure 1. Note that virtual avatar in Figure
1 is displayed only for demonstration purposes while, in a real interaction, its
visibility is turned o . The kintenct base of spine joint position has been used to
estimate the user position in relation to the screen. Furthermore, the user height
has been also taken into account in order to improve the pointing precision.</p>
        <p>
          After the user representation is obtained, the next step is to recognise
pointing activities. We realised this task through a geometrical approach based on
Unreal Engine 4 geometrical functions and collision detection. Using the
shoulder and hand positions we emit an invisible light ray that ends on an object
surface generating a collision event. The event has been managed using the
semantic maps mechanism combined with the Art and Architecture Thesaurus [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
in order to retrieve the concept label with relative relevance value, associated
with the collided point. In this way, the collision event is enriched with
conceptual meaning to enable the system to understand what the user is referring to.
To avoid wrong pointing recognition event triggered by transit area we
considered the arm movement speed to distinguish between transit area and xation
points.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Active Speaker Detection</title>
        <p>Since museums are generally visited in groups, the system also needs to identify
the speaker in order to address the answer to the right interlocutor. The active
speaker detection module (ASD) has the responsibility to recognise the user that
is actually speaking in a group. Speci cally, the objective of this module is to
distinguish between environment noises and speaking acts, and to compute the
probability that a user is e ectively talking. In this way the system is able to
take into account the gestures produced by the user with the highest probability.</p>
        <p>
          Several approaches were proposed in literature using visual features [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] and
both video and audio ones [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. In order to avoid problems deriving from
datadriven approaches (data collection, computational complexity), we adopted a
technique that computes speaking probability considering only the current
loudest sound source location and users positions. More in depth, we de ne as the
angle between the Kinect forward vector and the vector that point to the sound
source, and as the angle between the Kinect forward vector and the vector
that point to the user. We also de ne ( ; ) as the di erence .
Normalising ( ; ) in the range [0; 100] and dividing it by 100 we obtain a probability
measure formalised as :
        </p>
        <p>P (Ui = T rue j S ; Li) j Ui = fT rue; F alseg
Where Ui represents the i th user, S represents the sound source direction
and Li represents the current position of the i th user.
2.4</p>
      </sec>
      <sec id="sec-2-3">
        <title>Multimodal Fusion Engine</title>
        <p>In this subsection, we present the approach used to tackle the fusion of all the
previously explained modules in order to provide multimodality. This module
receives asynchronous messages from input modules ASD, NLU and PR, handled
by speci c receivers. The messages are respectively:
{ an ASD message: current speaker probability for each user.
{ a NLU message: user sentence recognised by NLU module with a con dence
value.
{ a PR message: pointed object's semantic labels with relevance values for
each user.</p>
        <p>Messages received cause the update of the corresponding Bayesian network input
variables (current speaker variable, verbal act variable, pointing variable)</p>
        <p>
          The input fusion process is activated as soon as an user dialogue act is
recognised. Input variables are synchronised and propagated through the probabilistic
network to derive a common interpretation. To obtain multimodal uni cation,
di erent formalisms and approaches are adopted in literature, i.e. statistical
approaches [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], salience-based [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], or rule-based approaches [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], and biologically
motivated approaches [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. Here we propose a strategy that de nes random
variables validation rules based on the study discussed in [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. Several modules
collaborate in charge of performing this task. The Multimodal Input Integrator aims
at combining input variables coherently. In particular this module analyses
verbal actions, pointed objects, and speakers in order to understand the current
request. Since the variables evolve in real-time, the Multimodal Time Manager
is used to check consistency and prune out-of-date variables. In more detail,
starting from time-stamps related to the input variables, once a new speech
signal is captured, the module compares its time intervals with those computed
for each pointing variable, pruning o pointing gestures whose occurrence was
concluded more than 4 seconds before the start of the current speech signal. As
input variables come asynchronously, the State Monitor directs the entire
operation by observing changes in dialogue state. Therefore, the uni cation methods
are called by this component according to dialogue progresses.
        </p>
        <p>
          Next operations are in charge of the Dialogue Manager. This has been
implemented using the OpenDial framework [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Here, the Dialogue State manages
variables and their connections encoded in a Bayesian network, while the
Dialogue System provides the APIs to check and update the model. Once the system
has derived the current request, this level provides services to select the most
appropriate machine action to be performed.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>System Evaluation</title>
      <p>In order to evaluate various aspects of the proposed system an experimental
setting was adopted. As this is an ongoing work, a humanoid virtual conversational
agent is not yet present in the setup. Speci cally, a xed 3D scene was designed
in Unreal Engine 4 showing the user a part of San Lorenzo Charterhouse, namely
the Absis and the Presbitery of the Church. A simple question-answering based
interaction was modelled making users capable of multimodally interact with
the system in order to obtain information about a small set of objects in the
3D environment. The evaluation was conducted in our laboratory by analysing
interactions between the designed system and two users simultaneously. In order
to avoid wrong interpretations caused by background noises, the evaluation was
conducted in a room where other persons besides the observer and the evaluated
group were not admitted. Moreover, a threshold relative to the input sound
signal intensity was established in order to cut o environment noises. The entire
process was sliced up into the following steps:
1. System presentation: system functionality are presented to participants.
2. Training session: a video-clip presentation is shown and a rst guided
interaction is performed, in order to allow users to get acquainted with the
multimodal interface.
3. Task-oriented interaction: users are asked to cooperate in completing a
set of assigned tasks to test system functionality. Interactions are recorded
through the Kinect and data-log in order to subsequently compute the
success rate of each input recognition. Recorded data can be used to compose
a training set to automatically tune the parameters of the probabilistic
network.
4. Free session: users interact with the system for an arbitrary time interval.</p>
      <p>This phase was useful to collect further data concerned with the way users
would freely interact with such systems. Nevertheless, they were not yet
evaluated. The time spent by users during this phase could be used as implicit
estimation of their satisfaction.</p>
      <p>In order to evaluate the system and to identify principal causes of irregular
multimodal fusion, the success rate SRi was computed for each module i as
follows:</p>
      <p>SRi = SUi 100</p>
      <p>T I
Where SUi is the total number of successful interpretations reported by the
module recogniser i and T I is the total number of users interactions.</p>
      <p>
        A total of 6 groups, composed by 2 persons, was involved in the
evaluation, recording the data shown in Table 1. Speci cally, a total of 41'05" of task
interactions and a total of 76' of training interactions were analysed. In
particular, the 133 tasks interactions were used to estimate the success rate. Results
(Table 1) show that, starting from a correct recognition of each input signal,
the probabilistic network designed for the fusion engine is able to derive user
requests during multimodal group interactions. Most relevant cases of erratic
inferences are caused by a wrong input recognition. The Pointing Recognition
module shows the highest result, with a success rate of 97%. The module that
shows the worst behaviour is the Active Speaker Detection module. In
particular, this result can be described as related to the users tendency to overlap
themselves during collaborative interactions. Anyway, ASD performances may
be improved by combining sound source angle and users locations in 3D
environment with further features like users gaze direction and/or lips movements,
similar to what is discussed in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>The paper aims at showing a multi-channel inputs' fusion approach applied in
the development of a multimodal dialogue system for cultural heritage virtual
experiences. The promising results show that this approach is valuable for further
investigations. For this reason, starting from the architecture proposed in this
paper, our purpose is to further improve the performances extending the system
functionality by enriching the content and linguistic knowledge and applying the
pointing recognition to other objects in a more extended 3D scene. Furthermore,
we aim at processing new input signals and modelling a multi-party dialogue to
improve and promote collaborative interactions between users.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgment</title>
      <p>
        This work is funded by the Italian ongoing PRIN project CHROME - Cultural
Heritage Resources Orienting Multimodal Experience [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] #B52F15000450001.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Armstrong</surname>
            ,
            <given-names>D.F.</given-names>
          </string-name>
          :
          <article-title>The gestural theory of language origins</article-title>
          .
          <source>Sign Language Studies</source>
          <volume>8</volume>
          (
          <issue>3</issue>
          ),
          <volume>289</volume>
          {
          <fpage>314</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ballard</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hayhoe</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pook</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rao</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Deictic codes for the embodiment of cognition (</article-title>
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>H.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marshall</surname>
            ,
            <given-names>C.R.</given-names>
          </string-name>
          :
          <article-title>De nite knowledge and mutual knowledge (</article-title>
          <year>1981</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cutugno</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dell'Orletta</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poggi</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savy</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sorgente</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The chrome manifesto: integrating multimodal data into cultural heritage resources</article-title>
          .
          <source>In: Fifth Italian Conference on Computational Linguistics</source>
          , CLiC-it (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Di</given-names>
            <surname>Maro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Valentino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Riccio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Origlia</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Graph databases for designing high-performance speech recognition grammars</article-title>
          .
          <source>In: IWCS 2017|12th International Conference on Computational Semantics</source>
          , Short papers (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Eisenstein</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Christoudias</surname>
            ,
            <given-names>C.M.:</given-names>
          </string-name>
          <article-title>A salience-based approach to gesture-speech alignment</article-title>
          .
          <source>In: HLT-NAACL 2004: Main Proceedings</source>
          . pp.
          <volume>25</volume>
          {
          <fpage>32</fpage>
          . Association for Computational Linguistics, Boston, Massachusetts, USA (May 2 - May 7
          <year>2004</year>
          ), https://www.aclweb.org/anthology/N04-1004
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gebru</surname>
            ,
            <given-names>I.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evangelidis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horaud</surname>
          </string-name>
          , R.:
          <article-title>Tracking the active speaker based on a joint audio-visual observation model</article-title>
          .
          <source>In: IEEE International Conference on Computer Vision</source>
          Workshop (ICCVW) (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Grazioso</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cera</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Di</surname>
            <given-names>Maro</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Origlia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Cutugno</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          :
          <article-title>From linguistic linked open data to multimodal natural interaction: A case study</article-title>
          .
          <source>In: 2018 22nd International Conference Information Visualisation (IV)</source>
          . pp.
          <volume>315</volume>
          {
          <fpage>320</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hunt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGlashan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Speech recognition grammar speci cation version 1.0</article-title>
          . Tech. rep.,
          <source>W3C</source>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Johnston</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Uni cation-based multimodal parsing</article-title>
          .
          <source>In: Proceedings of the 17th International Conference on Computational Linguistics - Volume 1</source>
          . pp.
          <volume>624</volume>
          {
          <fpage>630</fpage>
          . COLING '
          <volume>98</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, Stroudsburg, PA, USA (
          <year>1998</year>
          ). https://doi.org/10.3115/980451.980949, https://doi.org/10.3115/980451.980949
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kopp</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gesellensetter</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , Kramer,
          <string-name>
            <given-names>N.C.</given-names>
            ,
            <surname>Wachsmuth</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.:</surname>
          </string-name>
          <article-title>A conversational agent as museum guide{design and evaluation of a real-world application</article-title>
          . In:
          <article-title>Intelligent virtual agents</article-title>
          . pp.
          <volume>329</volume>
          {
          <fpage>343</fpage>
          . Springer (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lepouras</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vassilakis</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Virtual museums for all: employing game technology for edutainment</article-title>
          .
          <source>Virtual reality 8(2)</source>
          ,
          <volume>96</volume>
          {
          <fpage>106</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lison</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kennington</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Opendial: A toolkit for developing spoken dialogue systems with probabilistic rules</article-title>
          .
          <source>Proceedings of ACL-2016 System</source>
          Demonstrations pp.
          <volume>67</volume>
          {
          <issue>72</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Marty</surname>
            ,
            <given-names>P.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>K.B.</given-names>
          </string-name>
          :
          <article-title>Museum informatics: People, information, and technology in museums</article-title>
          , vol.
          <volume>2</volume>
          . Taylor &amp;
          <string-name>
            <surname>Francis</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>McGlashan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fraser</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gilbert</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bilange</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heisterkamp</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Youd</surname>
          </string-name>
          , N.:
          <article-title>Dialogue management for telephone information systems</article-title>
          .
          <source>In: Proceedings of the third conference on Applied natural language processing</source>
          . pp.
          <volume>245</volume>
          {
          <fpage>246</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>McNeill</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Hand and mind: What gestures reveal about thought</article-title>
          . University of Chicago press (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Origlia</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paci</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cutugno</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Mwn-e: a graph database to merge morphosyntactic and phonological data for italian</article-title>
          .
          <source>Proc. of Subsidia</source>
          , page to appear (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Othman</surname>
            ,
            <given-names>M.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petrie</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Power</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Engaging visitors in museums with technology: scales for the measurement of visitor and multimedia guide experience</article-title>
          .
          <source>In: IFIP Conference on Human-Computer Interaction</source>
          . pp.
          <volume>92</volume>
          {
          <fpage>99</fpage>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Oviatt</surname>
            ,
            <given-names>S.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>DeAngeli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuhn</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Integration and synchronization of input modes during multimodal human-computer interaction</article-title>
          .
          <source>In: Proceedings of Conference on Human Factors in Computing Systems CHI '97 (March</source>
          <volume>22</volume>
          -27, Atlanta, GA). ACM Press, NY. pp.
          <volume>415</volume>
          {
          <issue>422</issue>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Pianta</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bentivogli</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girardi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>MultiWordNet: developing an aligned multilingual database</article-title>
          , pp.
          <volume>293</volume>
          {
          <issue>302</issue>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Richter</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carlmeyer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lier</surname>
            , F., Meyer zu Borgsen,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schlangen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kummert</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wachsmuth</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wrede</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Are you talking to me?: Improving the robustness of dialogue systems in a multi party hri scenario by incorporating gaze direction and lip movement of attendees</article-title>
          .
          <source>In: Proceedings of the Fourth International Conference on Human Agent Interaction</source>
          . pp.
          <volume>43</volume>
          {
          <fpage>50</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Schneegans</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Schoner, G.:
          <article-title>Dynamic eld theory as a framework for understanding embodied cognition</article-title>
          .
          <source>In: Handbook of Cognitive Science</source>
          , pp.
          <volume>241</volume>
          {
          <fpage>271</fpage>
          .
          <string-name>
            <surname>Elsevier</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Serban</surname>
            ,
            <given-names>I.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Henderson</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Charlin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pineau</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A survey of available corpora for building data-driven dialogue systems: The journal version</article-title>
          .
          <source>Dialogue &amp; Discourse</source>
          <volume>9</volume>
          (
          <issue>1</issue>
          ),
          <volume>1</volume>
          {
          <fpage>49</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Stefanov</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sugimoto</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beskow</surname>
          </string-name>
          , J.:
          <article-title>Look who's talking: Visual identi cation of the active speaker in multi-party human-robot interaction. Association for Computing Machinery</article-title>
          (ACM) pp.
          <volume>22</volume>
          {
          <issue>27</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Traum</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aggarwal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Artstein</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Foutz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerten</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katsamanis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leuski</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Noren</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Swartout</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Ada and grace: Direct interaction with museum visitors</article-title>
          .
          <source>In: International Conference on Intelligent Virtual Agents</source>
          . pp.
          <volume>245</volume>
          {
          <fpage>251</fpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Wachsmuth</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Communicative rhythm in gesture and speech</article-title>
          .
          <source>In: Proceedings of the International Gesture Workshop on Gesture-Based Communication in HumanComputer Interaction</source>
          . pp.
          <volume>277</volume>
          {
          <fpage>289</fpage>
          . GW '99, Springer-Verlag, London, UK, UK (
          <year>1999</year>
          ), http://dl.acm.org/citation.cfm?id=
          <volume>647591</volume>
          .
          <fpage>728724</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Webber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robinson</surname>
            ,
            <given-names>I.:</given-names>
          </string-name>
          <article-title>A programmatic introduction to neo4j</article-title>
          .
          <string-name>
            <surname>Addison-Wesley Professional</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oviatt</surname>
            ,
            <given-names>S.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>P.R.</given-names>
          </string-name>
          :
          <article-title>Multimodal integration - a statistical view</article-title>
          .
          <source>IEEE Trans. Multimedia</source>
          <volume>1</volume>
          ,
          <issue>334</issue>
          {
          <fpage>341</fpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Microsoft kinect sensor and its e ect</article-title>
          .
          <source>IEEE multimedia 19(2)</source>
          ,
          <volume>4</volume>
          {
          <fpage>10</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>