<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Chatbots in the museum</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefan Scha er</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oliver Gustke</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julia Oldemeier</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Norbert Reithinger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>German Research Center for Arti cial Intelligence (DFKI), Intelligent User Interfaces, Alt-Moabit 91c</institution>
          ,
          <addr-line>10559 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Linon Medien</institution>
          ,
          <addr-line>Steigerwaldblick 29, 97453 Schonungen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this short paper we report on work in progress in the research project \Chatbot in the museum" (ChiM). ChiM develops a practicable technical solution for the use of chatbots in the museum environment. The paper outlines conceptual work conducted so far, including the comprehension of three important research topics explored in ChiM, namely information processing, multimodal intent detection and dialog management for museum chatbots.</p>
      </abstract>
      <kwd-group>
        <kwd>Chatbot museum guide teractive user interfaces</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        A chatbot is a computer program that attempts to simulate the conversation of a
human being via text or voice interactions [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In the museum context chatbots
o er great potential, as existing \pain points" can be eliminated: in contrast
to personal tours (\takes place in an hour", \is cancelled today"), chatbots
are always available. Today's digital visitor guidance systems o er only
\oneway communication" and are not able to respond to questions from the visitor.
Chatbots have the potential to respond meaningfully to the user's input if the
input is processed properly.
      </p>
      <p>Conversations with museum visitors showed that they often have speci c
questions about certain objects. Classical audioguides can not answer speci c
questions. In the museum a chatbot could be the expert you can take with you,
answer questions and provide further information. Fig. 1 exempli es how the
conversational interaction between a visitor and the chatbot could look like.</p>
      <p>The aim of ChiM is to explore the usability of a chatbot as an interactive
system for knowledge and learning, as well as for e ective access and the
comprehensible presentation of museum information. A central question is therefore
how the existing information must be structured in order to relate to the visitors'
questions. ChiM develops new solutions for providing information and explores
how the latest research results in the eld of intention detection and dialog
management can be utilized for chatbots in the museum context. Speci cally, the
following research elds of human-technology interaction are to be investigated:
{ Information processing : adaptation of the existing content creation process
for classical audio and media guides, to be more \knowledge-based" for the
chatbot.
{ Multimodal intention detection: linguistic or text-based input processing
combined with image and other exhibition sensor (e.g. beacons) processing.
{ Dialog management for guided tours : intelligent dialog strategies to
determine context-relevant information and personalized information.</p>
      <p>In this paper conceptual work in the ChiM research elds information
processing, multimodal intent detection and dialog management for museum
chatbots is outlined. In the next section related work in chatbot relevant elds is
sketched. Section 3 describes the so far developed concepts for ChiM. Section
4 describes ongoing work on use case development. Conclusion and future work
are described in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Current research in the eld of interactive museum guides covers a wide range
of approaches, from beacons controlling the presentation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], over agent-based
techniques [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], to robotic museum guides [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In the museum eld, still no
elaborate technologies can be found that essentially utilize conversational digital
systems such as chatbots. Providers such as helloguide3 so far only carry over
the paradigm of entering numbers from a classical audio guide to chatbots. It is
not possible to engage in dialog or ask questions with such museum chatbots.
Chatbot/dialog platforms such as Alexa (Amazon), Dialog ow (Google), Wit.ai
(Facebook) or Watson (IBM) enable intention detection for many domains (eg.
      </p>
      <sec id="sec-2-1">
        <title>3 http://helloguide.de</title>
        <p>
          for ordering a pizza, weather report, ight booking, shopping, etc.) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. However,
most of these platforms only o er limited customization to their own domains,
and are limited in the choice of input and output modalities. In addition, they
are not able to manage a dialog with extensive knowledge bases. Therefore,
for special domains such as a tour through an exhibition, own solutions for
intent detection and dialog management have to be implemented. In many
conversational systems domain-speci c knowledge is mapped by dialog grammars
and state machines [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Dialog grammars and state machines provide a reliable
method for intent detection and can guide the user targeted-oriented through
the dialog, but are limited in exibility compared to the diversity of natural
language [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. As a result, the intuitive operation of such systems, and thus user
acceptance may su er [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Recent successes of statistical methods for natural
language processing (NLP) and dialog management open up new possibilities [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
Statistical approaches enable to create more exible models for intent detection
and dialog management based on existing training data. However, a common
problem is that there is little or no data for training. Hybrid methods that
combine domain-speci c knowledge with statistical approaches, as an alternative for
low-data domains, are subject of current research [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>The ChiM approach</title>
      <p>ChiM investigates hybrid methods for intention detection and dialog
management for the museum sector. A newly developed chatbot-based museum guide
usually represents a new knowledge domain for which training data is not yet
available. During the process of creating audio and multimedia guides, many
data are generated, which can also be used as training data if appropriate
transcription and indexing methods are applied. ChiM's approach is to combine the
development of a hybrid approach for intention detection and dialog
management with the creation process of museum tours, using the data generated by
authors and editors as training material for a statistical model. Domain-speci c
knowledge components, which can not or can hardly be statistically mapped,
will be realized through dialog grammars. In addition, the consideration of
further information channels (e.g. image recognition, exhibition sensor technology)
enables multimodal intent detection. Subsections 3.1-3.3 summarize the so far
developed concepts for the research areas explored in ChiM. Subsection 3.4
explains the iterative realization of ChiM.
3.1</p>
      <sec id="sec-3-1">
        <title>Information Processing</title>
        <p>In contrast to classical audio and multimedia guides, information for a chatbot
system must be structured di erently. The entire existing text creation process
must be converted into a more \knowledge-based" appraoch. ChiM needs (semi)
structured data to allow for a exible interaction with the content. For data
preparation, approaches such as taxonomies and the use of digital asset
management architectures such as Fedora Commons4 will be integrated into editors'
work ows. Further, it will be investigated whether and how the existing data for
museum guides and their contents can be prepared in such a way that they can
serve as training material.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Multimodal Intention Detection</title>
        <p>
          Detecting user intent in the museum environment is a complex, multimodal
process. Visitors can interact with text or voice input. Thus, on the one hand,
the linguistic or text-based input of the user must be processed. On the basis
of historical data and the results of the information processing, statistical and
rule-based procedures will be evaluated and used. On the other hand, the
consideration of further information channels, such as image processing and existing
exhibition sensors (e,g. beacons for localization), is of particular interest in a
museum context. For this, the MMIR (Multimodal Mobile Interaction and
Rendering) framework will be used and extend [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The framework supports the
creation of multi platform applications and enables straightforward integration
of existing libraries, like e.g. OpenCV for image recognition [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], and existing
location technologies [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] to explore the multimodal approach. By merging the
information channels, the multimodal intent detection transfers the existing
information into structured data that can be further processed to enable information
retrieval (the determination of the relevant data) and the provision of
information (the preparation and comprehensible presentation of the information). This
fusion of the individual information channels results in the multimodal intention
that represents the input of the dialog management [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Dialog management for museum tours</title>
        <p>The individual steps from intention detection over the retrieval of the
information to the provision of information are continuously carried out in a dialog
between the user and the chatbot. As the core of the chatbot, an intelligent
dialog management will process the multimodal intention and determine the
system reaction to o er personalized information. The dialog should always be
e ective, intuitive and continued with the best possible user experience. The
input/output modalities will be adapted to the situation using text, speech or
multimedia elements.</p>
        <p>
          More speci cally, the information presented to the users is decided from
the intention, the dialog history, the knowledge model from the information
processing and the general context. The processing of the dialog management will
be hybrid: from historical data typical museum guide sequences are learned. At
the same time, dialog rules will be created and the two approaches combined [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <sec id="sec-3-3-1">
          <title>4 http://fedora-commons.org/</title>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Iterative Realization</title>
        <p>The realization takes place in two iterations using a user-centered and
participative design approach. In the rst iteration, the editors will prepare the knowledge
for an exhibition. At the same time the basic ChiM functionality will be
investigated with focus groups in an iterative UX process and tested in the lab with
potential users. In the second iteration, the ChiM process will be implemented
within further exhibition guides and evaluated with users in the context of eld
studies.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Use Case Development</title>
      <p>As part of our ongoing work we elaborate on the proposed input and output
modalities of the system and exemplify the goal of our approach with respect
to the user by outlining rst use case ideas which are related to multimodal
interaction. As a precondition for all use cases it can be assumed that the chatbot
is installed on a smartphone which has the necessary technical speci cations, i.e.
microphone, camera, and iBeacon, or respectively bluetooth low energy.</p>
      <p>As main input modalities touch input on a virtual keyboard and alternatively
speech input enabled by automatic speech recognition (ASR) will be utilized to
interact with the chatbot. A smartphone camera will be used for computer vision
in order to recognize exhibits or other objects relevant for a speci c exhibition.
Further the proximity to exhibits will be detected by beacons and processed
analog to the context.</p>
      <p>The output modalities will include visual, auditory and tactile feedback. The
visual information comprises mainly the dialog with the chatbot and media
content like images and videos. The chatbot will make use of speech synthesis to
create auditory feedback for the system promts within the user-chatbot dialog.
System promts will be generated for meta communication, i.e. system con
guration, as well as for speci c information coming from the knowledge model
speci c to the exhibition. Further auditory content consists of recorded audio
material (also in combination with image or video) as in classical multimedia
guides. Tactile feedback can be used for alerts, e.g. if the user gets closer to the
next exhibit of a tour.</p>
      <p>With regard to the context in which this interaction will take place a number
of factors have to be considered. At the place itself, i.e. a museum or an
exhibition, the acoustic characteristics can be very di erent, as the exhibition areas
can be located in both large halls and smaller rooms. Further, people in the
surrounding area can on the one hand produce noise which is harmful for ASR,
on the other hand talking to the chatbot could distract other visitors. Within an
exhibition the auditory signal should therefore be received by the user via
headphones. It can further be necessary to avoid the usage of speech input if other
visitors are located right next to someone. The system should therefore always
provide touch input on a virtual keyboard as an alternative input mode. To
determine relevant information not only the actual user input but also the last user
inputs, questions, or interactions should be considered to enable a personalized
request.</p>
      <p>Based on these assumptions we so far generated the following ideas for
multimodal use cases:
{ Sequential usage of touch or speech with independent fusion : either touch
screen or speech input can be used to enter text based questions about the
exhibits to the chatbot.
{ Sequential usage of computer vision and textual information with combined
fusion: the camera can be used to enter visual information that is recognized
by means of computer vision. If an image was recognized speci c information
about the the corresponding exhibit can be asked, e.g. after taking a picture
from a sculpture the user can use touch screen or speech input to ask "Who
made this work?".
{ Sequential usage of textual information and exhibition sensors with combined
fusion: after speci c textual input via touch screen or speech input, e.g.
"Where does it go on?", exhibition sensor processing (e.g. beacons) can be
used to guide the user to the closet exhibit on the tour.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>The development of a practical technical solution for the use of chatbots in the
museum environment represents a challenging task: the editing process for audio
and media guides is a highly specialized process, and the extension to
knowledgebased approaches for the realization of a chatbot for museums has to be carried
out. The interplay of existing approaches for intention detection and dialog
management opens up a number of research questions, including how chatbots can be
used in complex environments such as museums. A solution for hybrid knowledge
processing and the technical implementation of information processing,
multimodal intent detection and dialog management for museum chatbots will be
explored in ChiM. Di erent multimodal use cases will be studied. The success
of the project will be assessed by a demonstrator implemented iteratively in two
development phases. To ensure the acceptance of ChiM the overall system will be
designed, developed, analyzed and optimized in cooperation with museums and
visitors. Focus groups and eld studies in various museums will be conducted.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We thank the Stadel Musem Frankfurt, the Germanische Nationalmuseum Nurnberg
and the LVR- LandesMuseum Bonn for their interest in supervising, testing and
evaluating the ChiM developments.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Tom</given-names>
            <surname>Bocklisch</surname>
          </string-name>
          , Joey Faulker, Nick Pawlowski, and
          <string-name>
            <given-names>Alan</given-names>
            <surname>Nichol</surname>
          </string-name>
          . Rasa:
          <article-title>Open source language understanding and dialogue management</article-title>
          .
          <source>arXiv preprint arXiv:1712.05181</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Massimo</given-names>
            <surname>Canonico and Luigi De Russis</surname>
          </string-name>
          .
          <article-title>A comparison and critique of natural language understanding tools</article-title>
          .
          <source>CLOUD COMPUTING</source>
          <year>2018</year>
          , page
          <volume>120</volume>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Zhiqiang</surname>
            <given-names>He</given-names>
          </string-name>
          , Binyue Cui,
          <string-name>
            <surname>Wei Zhou</surname>
            , and
            <given-names>Shigeki</given-names>
          </string-name>
          <string-name>
            <surname>Yokoi</surname>
          </string-name>
          .
          <article-title>A proposal of interaction system between visitor and collection in museum hall by ibeacon</article-title>
          .
          <source>In Computer Science &amp; Education (ICCSE)</source>
          ,
          <year>2015</year>
          10th International Conference on, pages
          <volume>427</volume>
          {
          <fpage>430</fpage>
          . IEEE,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>TechTarget</given-names>
            <surname>Homepage</surname>
          </string-name>
          . https://searchcrm.techtarget.com/de nition/chatbot,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Dan</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          and
          <string-name>
            <surname>James H Martin</surname>
          </string-name>
          .
          <article-title>Speech and language processing</article-title>
          , volume
          <volume>3</volume>
          . Pearson London:,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Kopp</surname>
          </string-name>
          , Lars Gesellensetter,
          <string-name>
            <surname>Nicole C Kra</surname>
          </string-name>
          <article-title>mer, and Ipke Wachsmuth. A conversational agent as museum guide{design and evaluation of a real-world application</article-title>
          .
          <source>In International Workshop on Intelligent Virtual Agents</source>
          , pages
          <volume>329</volume>
          {
          <fpage>343</fpage>
          . Springer,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Hui</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Houshang Darabi, Pat Banerjee, and Jing Liu.
          <article-title>Survey of wireless indoor positioning techniques and systems</article-title>
          .
          <source>IEEE Transactions on Systems, Man, and Cybernetics</source>
          , Part C (
          <article-title>Applications</article-title>
          and Reviews),
          <volume>37</volume>
          (
          <issue>6</issue>
          ):
          <volume>1067</volume>
          {
          <fpage>1080</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Christophe</given-names>
            <surname>Mollaret</surname>
          </string-name>
          , Alhayat Ali Mekonnen, Isabelle Ferrane, Julien Pinquier, and
          <string-name>
            <given-names>Frederic</given-names>
            <surname>Lerasle</surname>
          </string-name>
          .
          <article-title>Perceiving user's intention-for-interaction: A probabilistic multimodal data fusion scheme</article-title>
          .
          <source>In Multimedia and Expo (ICME)</source>
          ,
          <source>2015 IEEE International Conference on, pages 1{6</source>
          . IEEE,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Mo</surname>
          </string-name>
          ller,
          <string-name>
            <surname>Klaus-Peter Engelbrecht</surname>
            , Christine Kuhnel, Ina Wechsung, and
            <given-names>Benjamin</given-names>
          </string-name>
          <string-name>
            <surname>Weiss</surname>
          </string-name>
          .
          <article-title>A taxonomy of quality of service and quality of experience of multimodal human-machine interaction</article-title>
          .
          <source>In Quality of Multimedia Experience</source>
          ,
          <year>2009</year>
          . QoMEx 2009. International Workshop on, pages
          <volume>7</volume>
          {
          <fpage>12</fpage>
          . IEEE,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kari</surname>
            <given-names>Pulli</given-names>
          </string-name>
          , Anatoly Baksheev, Kirill Kornyakov, and
          <string-name>
            <given-names>Victor</given-names>
            <surname>Eruhimov</surname>
          </string-name>
          .
          <article-title>Realtime computer vision with opencv</article-title>
          .
          <source>Queue</source>
          ,
          <volume>10</volume>
          (
          <issue>4</issue>
          ):
          <fpage>40</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>M</given-names>
            <surname>Golam Rashed</surname>
          </string-name>
          , Ryota Suzuki, Antony Lam, Yoshinori Kobayashi, and
          <string-name>
            <given-names>Yoshinori</given-names>
            <surname>Kuno</surname>
          </string-name>
          .
          <article-title>Toward museum guide robots proactively initiating interaction with humans</article-title>
          .
          <source>In Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction Extended Abstracts</source>
          , pages
          <fpage>1</fpage>
          <article-title>{2</article-title>
          . ACM,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>Aaron</given-names>
            <surname>Ru</surname>
          </string-name>
          .
          <article-title>Mmir framework: Multimodal mobile interaction and rendering</article-title>
          .
          <source>In GI-Jahrestagung</source>
          , pages
          <volume>2702</volume>
          {
          <fpage>2713</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>Blaise</given-names>
            <surname>Thomson</surname>
          </string-name>
          .
          <article-title>Statistical methods for spoken dialogue management</article-title>
          .
          <source>Springer Science &amp; Business Media</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Jason D Williams</surname>
            ,
            <given-names>Kavosh</given-names>
          </string-name>
          <string-name>
            <surname>Asadi</surname>
          </string-name>
          , and
          <article-title>Geo rey Zweig</article-title>
          .
          <article-title>Hybrid code networks: practical and e cient end-to-end dialog control with supervised and reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1702.03274</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>