<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PyGenbot for IoT: a demonstration of how to generate any restricted stateless AIML FAQ-chatter bot from text files</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giovanni De Gasperis</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Universita` degli Studi dell'Aquila</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>email: giovanni.degasperis@univaq.it</string-name>
        </contrib>
      </contrib-group>
      <fpage>14</fpage>
      <lpage>16</lpage>
      <abstract>
        <p>-Internet of things applications (IoT) are required to interact with the user in the best natural possible way; the voice based conversation is the ultimate human-machine interaction in terms of easy to use and requirements from the user part, which also has the advantage for the user to interact hands free, non necessary watching a computer screen. Chatter bots are conversational agents that simulate, and capable to sustain, a conversation with a human. Technology do exists that allows to create a lexical knowledge base to be used by a restricted chatter bot, i.e. expert on a specicfi dominion. This work shows a methodology of restricted chatbot generation using Python program, called PyGenbot, that is capable to derive an AIML (Artificial Intelligence Markup Language) knowledge base starting from a simple textual data set, including: a FAQ, a</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>Internet of things applications (IoT) are required to interact
with the user in the best natural possible way; the voice
based conversation is the ultimate human-machine interaction
in terms of easy to use and requirements from the user part,
which also has the advantage for the user to interact hands free,
not necessary watching a computer screen such as the scenario
of a car driver. Many commercial solutions have come recently
from major smartphone corporations, mostly specialized on the
smartphone usage scenario: sending messages, handling the
calendar, fix an appointment, searching for a restaurant close
by. Also, home appliance with similar capabilities have shown
up in the market. Most of the time, these are proprietary
solution, not readily available to developers, but strictly integrated
into commercial products, or proposing a licensed cloud API.
The voice recognition phase is not in the focus of this work.
So I give it for granted that do exists a hardware device, or
cloud API that provides the voice-to–text recognition task. The
focus in this work is to generate a proper textual reply to a
text generated by the user by any means, typing or talking.</p>
      <p>
        I offer a general purpose tool that can be applied to
any IoT application, with limited computational capability by
using AIML automatically generated chatter bots. They are
conversational agents that simulate and sustain a conversation
with a human, mostly in a restricted knowledge domain. Since
ELIZA [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], text pattern recognition based chatter bots have
come a long way [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] . A.L.I.C.E. is an handy crafted
chatter bot composed of about 50’000 lexical categories edited
by a community of about 500 authors [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], aiming to be
unrestricted in its knowledge as a tentative to pass a limited
implementation of the Turing test known as the Loebner
Prize [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A.L.I.C.E.’s lexical knowledge base is described
using the Artificial Intelligence Markup Language, AIML [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
The lexical categories in AIML are defined by means of
(pattern, template) tuples in a XML derived syntax:
&lt;category&gt;
&lt;pattern&gt;WHAT IS LINUX&lt;/pattern&gt;
&lt;template&gt;
      </p>
      <p>Linux is an open-source
computer operating system
&lt;/template&gt;
&lt;/category&gt;</p>
      <p>Also, different categories with a common semantic
background can be linked together by means of a SRAI connection:
&lt;category&gt;
&lt;pattern&gt;WHAT IS GNU LINUX&lt;/pattern&gt;
&lt;template&gt;
&lt;srai&gt;</p>
      <p>WHAT IS LINUX
&lt;/srai&gt;
&lt;/template&gt;
&lt;/category&gt;</p>
      <p>In this way a tree of SRAI connection can link all of the
different lexical forms to their lemma. Using wildcards it is
also possible to filter out common words, isolating keywords.</p>
      <p>
        The reasoner, i.e. AIML interpreter, is a LISP program
proposed by Richard Wallace [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] designed to search for the
best text pattern matching given the user input so to give the
most appropriate answer during the written conversation. The
IBM question answering system (QAS), known as Watson,
won a challenge of the kind human-versus-machine [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], using
brute force search algorithm on an unrestricted knowledge
domain. However, in this work I concentrate to automate the
generation of only stateless restricted chatter bots, given
that their lexical knowledge can be expressed by means of
a combination of a frequently asked question/glossary set,
keywords, multiwords and stopwords lists. The input data
      </p>
      <p>FAQ
GLOSSARY</p>
      <p>Wictionary
lookup</p>
      <p>INPUT DATA
keywords</p>
      <p>
        SET
Stop-words
set can be arbitrary, so the methodology is unrestricted,
but not so for the final products, i.e. the FAQ chatter bots.
Also, the overall method does not depend on the language,
so multilingual IoT applications can be readily designed with
parallel free text corpora for each language. Here is shown how
to apply the PyGenbot Python program [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] designed to
automatically generate Artificial Intelligence Markup Language
(AIML) knowledge bases.
      </p>
    </sec>
    <sec id="sec-2">
      <title>II. INPUT DATA SET</title>
      <p>The set of textual data in input is defined as the following:</p>
    </sec>
    <sec id="sec-3">
      <title>1) a frequently asked questions (FAQ) file F</title>
      <p>2) a glossary file G
3) a keywords list file K
4) a stop words list file S
5) a multiwords list file M</p>
      <p>All files are simple free text documents, with some basic
structure in order to distinguish text of the questions from text
of the answers, or glossary items and their respective definition.
Keywords, multiwords and stop words are just files containing
a word on each line, or separated by comma. The FAQ file F
is completely defined by the chatter bot designer. It contains
questions and answers in the simple form:
Q &lt;question&gt; | {Q &lt;alternative version&gt;}
A &lt;answer &gt; | {A &lt;alternative version&gt;}</p>
      <p>Alternative versions of the question are useful to enlarge the
possibility to intercept the user input; alternative answers are
great to increase the variability of the answers given by the
chatter bot in response to the user input. The multiwords list
is very important to isolate conceptual entities that uses more
than a words, as for example “ operating system” or “ credit
card”.</p>
      <p>The input data set and the information workflow can be
summarized by the diagram in Fig.1.
User INPUT</p>
      <p>Stop words filter</p>
      <p>Question</p>
      <p>Answer</p>
      <p>The glossary item definition can be enriched using the
free online dictionary 1 or by using the Python NLTK 2.
Glossary items should cover the most significant terms of
the restricted knowledge domain about which the final chatter
bot is designed to be expert. Keywords should be selected
from the text of the questions in order to optimize the pattern
matching the user input. The stopwords list are just the most
common words of the language, i.e. elements of structural and
connective lexicon. The input set does indeed determine:
• the language (English , Italian, German, etc..)
• the restricted knowledge domain
• the vocabulary</p>
      <p>
        The generated chatter bot is considered to be stateless since
it can only demonstrate a purely reactive behavior, given
a textual stimulation. A more sophisticated prototype could
be built upon adopting proactive multi-agent system logic
frameworks like DALI [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] or AgenSpeak/Jason [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] middle
layer. By the way, a stateless chatter bot is what is needed
in the majority of IoT applications were an appliance need to
give a correct answer to a user or to set up a working parameter
to accomplish a user given task.
      </p>
      <p>PyGenbot is a Python program with about 750 lines of
code that takes as input the text files set and produces an
AIML file set, ready to be uploaded to any AIML interpreter,
which finally implements the actual stateless chatter bot able
to interact with the user. The usage scenario is analog to use a
compiler to produce machine language (AIML) from high level
source files (FAQ, keywords, multiwords, stop words), even if
in this case the underlying natural language is not context free
as programming languages.</p>
      <p>
        The algorithm as been published in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The reference
idea is shown in Fig. 1, by which the construction of this kind
of restricted chatter bots is inspired.
      </p>
      <p>PyGenbot generates three set of AIML files:
• the FAQ/keywords/multiwords categories
• the glossary categories and “ WHAT IS *” question
patterns
• the stopwords filtering categories</p>
      <p>The FAQ/keywords/multiwords AIML set can grow to
several thousands of categories, so the generation algorithm
needs to be tuned by the maximum number of categories
each AIML file can contain, given the complexity of the
FAQ/keywords/multiwords text file set and the final AIML
interpreter tool adopted. The output AIML 1.0 file set is
then ready to be used in a AIML hosting web services, like
http://pandorabots.com .</p>
    </sec>
    <sec id="sec-4">
      <title>1http://en.wiktionary.org last accessed June 2016 2http://nltk.org last accessed June 2016</title>
      <p>QUALITY ASSESMENT</p>
      <p>
        It is necessary to introduce a measurable metric of the
correctness of the final chatter bot. As already proposed in
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a three level metric can be adopted:
• Level 0: the resulting chatter bot does give a correct
answer for all the questions included in the FAQ, with
exact text matching
• Level 1: the resulting chatter bot does give at least 50%
of correct answers, not using the exact wording of the
original FAQ questions text, but with the same semantic
• Level 2: the resulting chatter bot does give at least
50% of correct answers using questions with completely
different wording, but same semantic of the original FAQ
questions
      </p>
      <p>It has been experimentally proven that all chatter bots
generated with PyGenbot are at least of Level 0 quality, and very
often can reach Level 1 quality if the FAQ/keywords/glossary
set is accurately designed and well written. The demonstrator
at the WOA 2016 Workshop is aimed to confirm
experimentally this statement.</p>
      <p>V.</p>
      <p>CONCLUSION</p>
      <p>The proposed demonstrator, the PyGenbot program, is
capable of generating lexical knowledge bases for AIML based
stateless chatter bots. This work illustrated the underline
engineered knowledge-unrestricted methodology, also proposing a
quality assessment procedure that should objectively
demonstrate that restricted chatter bot can be generated starting from
arbitrary text files, independent from the language.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Weizenbaum</surname>
          </string-name>
          , “
          <article-title>Eliza a computer program for the study of natural language communication between man and machine,”</article-title>
          <source>Communications of the ACM</source>
          , vol.
          <volume>9</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>36</fpage>
          -
          <lpage>45</lpage>
          ,
          <year>1966</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Epstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Roberts</surname>
          </string-name>
          , and
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Beber, Parsing the turing test : philosophical and methodological issues in the quest for the thinking computer</article-title>
          . New York: Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Wallace</surname>
          </string-name>
          , The Anatomy of
          <string-name>
            <surname>A.L.I.C.E</surname>
          </string-name>
          , ser.
          <source>Parsing the Turing Test</source>
          . New York: Springer,
          <year>2008</year>
          , pp.
          <fpage>181</fpage>
          -
          <lpage>210</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mauldin</surname>
          </string-name>
          , Chatterbots, tinymuds, and
          <article-title>the turing test: Entering the loebner prize competition, ser</article-title>
          .
          <source>AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence. AAAI Press</source>
          ,
          <year>1994</year>
          , vol.
          <volume>1</volume>
          , pp.
          <fpage>16</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Baker</surname>
          </string-name>
          , Final Jeopardy:
          <article-title>Man vs. Machine and the Quest to Know Everything</article-title>
          . New York: Houghton Mifflin Harcourt Publishing Company,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>De Gasperis</surname>
          </string-name>
          , “
          <article-title>Building an aiml chatter bot knowledge-base starting from a faq and a glossary,” Journal of e-Learning and Knowledge Society-English Version</article-title>
          , vol.
          <volume>6</volume>
          , no.
          <issue>2</issue>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>De Gasperis</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Chiari</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Florio</surname>
          </string-name>
          , “
          <article-title>AIML knowledge base construction from text corpora,” in Artificial intelligence, evolutionary computing and metaheuristics</article-title>
          . Springer,
          <year>2013</year>
          , pp.
          <fpage>287</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>De Gasperis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Costantini</surname>
          </string-name>
          ,
          <article-title>and multi agent systems framework, doi DALI GitHub Software Repository</article-title>
          , July //github.com/AAAI-DISIM-UnivAQ/DALI. G. Nazzicone, “Dali 10.5281/zenodo.11042,”
          <year>2014</year>
          , DALI: http:
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R. H.</given-names>
            <surname>Bordini</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Hu</surname>
          </string-name>
          <article-title>¨bner, “BDI agent programming in agentspeak using Jason (tutorial paper),” in Computational Logic in Multi-Agent Systems</article-title>
          , 6th International Workshop, CLIMA VI,
          <article-title>Revised Selected and Invited Papers, ser</article-title>
          . Lecture Notes in Computer Science,
          <string-name>
            <given-names>F.</given-names>
            <surname>Toni</surname>
          </string-name>
          and P. Torroni, Eds., vol.
          <volume>3900</volume>
          . Springer,
          <year>2006</year>
          , pp.
          <fpage>143</fpage>
          -
          <lpage>164</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>