1


    PyGenbot for IoT: a demonstration of how to
  generate any restricted stateless AIML FAQ-chatter
                   bot from text files
   Giovanni De Gasperis, Dipartimento di Ingegneria e Scienze dell’Informazione e Matematica, Università degli
                          Studi dell’Aquila, Italy, email: giovanni.degasperis@univaq.it


   Abstract—Internet of things applications (IoT) are required           with a human, mostly in a restricted knowledge domain. Since
to interact with the user in the best natural possible way;              ELIZA [1], text pattern recognition based chatter bots have
the voice based conversation is the ultimate human-machine               come a long way [2], [3] . A.L.I.C.E. is an handy crafted
interaction in terms of easy to use and requirements from the            chatter bot composed of about 50’000 lexical categories edited
user part, which also has the advantage for the user to interact         by a community of about 500 authors [3], aiming to be
hands free, non necessary watching a computer screen. Chatter
bots are conversational agents that simulate, and capable to
                                                                         unrestricted in its knowledge as a tentative to pass a limited
sustain, a conversation with a human. Technology do exists               implementation of the Turing test known as the Loebner
that allows to create a lexical knowledge base to be used by             Prize [4]. A.L.I.C.E.’s lexical knowledge base is described
a restricted chatter bot, i.e. expert on a specific dominion. This       using the Artificial Intelligence Markup Language, AIML [3].
work shows a methodology of restricted chatbot generation using          The lexical categories in AIML are defined by means of
Python program, called PyGenbot, that is capable to derive an            (pattern, template) tuples in a XML derived syntax:
AIML (Artificial Intelligence Markup Language) knowledge base
starting from a simple textual data set, including: a FAQ, a                 <category>
keywords, a stopwords, a multiwords and a glossary file set.                    <pattern>WHAT IS LINUX</pattern>
Any WOA attendee is welcome to supply arbitrary and simple                         <template>
formatted text files; then using PyGenbot, I will first edit the text               Linux is an open-source
input files needed to generate automatically the corresponding                      computer operating system
AIML knowledge base set that can be used with any standard                         </template>
AIML interpreter to implement the desired chatter bot, which                 </category>
can then be integrated into an IoT application.
                                                                           Also, different categories with a common semantic back-
                                                                         ground can be linked together by means of a SRAI connection:
                    I. INTRODUCTION
                                                                             <category>
   Internet of things applications (IoT) are required to interact               <pattern>WHAT IS GNU LINUX</pattern>
with the user in the best natural possible way; the voice                          <template>
based conversation is the ultimate human-machine interaction                        <srai>
in terms of easy to use and requirements from the user part,                         WHAT IS LINUX
which also has the advantage for the user to interact hands free,                   </srai>
not necessary watching a computer screen such as the scenario                      </template>
of a car driver. Many commercial solutions have come recently                </category>
from major smartphone corporations, mostly specialized on the
smartphone usage scenario: sending messages, handling the                   In this way a tree of SRAI connection can link all of the
calendar, fix an appointment, searching for a restaurant close           different lexical forms to their lemma. Using wildcards it is
by. Also, home appliance with similar capabilities have shown            also possible to filter out common words, isolating keywords.
up in the market. Most of the time, these are proprietary solu-             The reasoner, i.e. AIML interpreter, is a LISP program
tion, not readily available to developers, but strictly integrated       proposed by Richard Wallace [3] designed to search for the
into commercial products, or proposing a licensed cloud API.             best text pattern matching given the user input so to give the
The voice recognition phase is not in the focus of this work.            most appropriate answer during the written conversation. The
So I give it for granted that do exists a hardware device, or            IBM question answering system (QAS), known as Watson,
cloud API that provides the voice-to–text recognition task. The          won a challenge of the kind human-versus-machine [5], using
focus in this work is to generate a proper textual reply to a            brute force search algorithm on an unrestricted knowledge
text generated by the user by any means, typing or talking.              domain. However, in this work I concentrate to automate the
   I offer a general purpose tool that can be applied to                 generation of only stateless restricted chatter bots, given
any IoT application, with limited computational capability by            that their lexical knowledge can be expressed by means of
using AIML automatically generated chatter bots. They are                a combination of a frequently asked question/glossary set,
conversational agents that simulate and sustain a conversation           keywords, multiwords and stopwords lists. The input data


                                                                    14
                                                                                                                                                        2


                                                                                User INPUT     Stop words filter   Keywords match   Question   Answer

                    Stop-words
     FAQ
                          SET                    PyGenBot
                                                                          Fig. 2. Ideal textual data processing from the user input to the right answer.
                                keywords
                                                    Answering
              INPUT DATA                            Chatterbot
                                  SET               Generator
                                                    Algorithm                The glossary item definition can be enriched using the
                                                                          free online dictionary 1 or by using the Python NLTK 2 .
                    Multi-words                                           Glossary items should cover the most significant terms of
   GLOSSARY
                          SET
                                                                          the restricted knowledge domain about which the final chatter
                                                     AIML                 bot is designed to be expert. Keywords should be selected
                                                   Knowledge
                                                     Base
                                                                          from the text of the questions in order to optimize the pattern
      Wictionary                                                          matching the user input. The stopwords list are just the most
       lookup
                                                                          common words of the language, i.e. elements of structural and
                                                                          connective lexicon. The input set does indeed determine:
                                                                             • the language (English , Italian, German, etc..)
Fig. 1. Input data set and information workflow of the AIML generation       • the restricted knowledge domain
process.                                                                     • the vocabulary
                                                                             The generated chatter bot is considered to be stateless since
                                                                          it can only demonstrate a purely reactive behavior, given
set can be arbitrary, so the methodology is unrestricted,                 a textual stimulation. A more sophisticated prototype could
but not so for the final products, i.e. the FAQ chatter bots.             be built upon adopting proactive multi-agent system logic
Also, the overall method does not depend on the language,                 frameworks like DALI [8] or AgenSpeak/Jason [9] middle
so multilingual IoT applications can be readily designed with             layer. By the way, a stateless chatter bot is what is needed
parallel free text corpora for each language. Here is shown how           in the majority of IoT applications were an appliance need to
to apply the PyGenbot Python program [6], [7] designed to au-             give a correct answer to a user or to set up a working parameter
tomatically generate Artificial Intelligence Markup Language              to accomplish a user given task.
(AIML) knowledge bases.
                                                                                       III. THE PYGENBOT PROGRAM
                    II.    INPUT DATA SET                                    PyGenbot is a Python program with about 750 lines of
                                                                          code that takes as input the text files set and produces an
   The set of textual data in input is defined as the following:          AIML file set, ready to be uploaded to any AIML interpreter,
   1) a frequently asked questions (FAQ) file F                           which finally implements the actual stateless chatter bot able
   2) a glossary file G                                                   to interact with the user. The usage scenario is analog to use a
   3) a keywords list file K                                              compiler to produce machine language (AIML) from high level
   4) a stop words list file S                                            source files (FAQ, keywords, multiwords, stop words), even if
   5) a multiwords list file M                                            in this case the underlying natural language is not context free
   All files are simple free text documents, with some basic              as programming languages.
structure in order to distinguish text of the questions from text            The algorithm as been published in [6], [7]. The reference
of the answers, or glossary items and their respective definition.        idea is shown in Fig. 1, by which the construction of this kind
Keywords, multiwords and stop words are just files containing             of restricted chatter bots is inspired.
a word on each line, or separated by comma. The FAQ file F                   PyGenbot generates three set of AIML files:
is completely defined by the chatter bot designer. It contains               • the FAQ/keywords/multiwords categories
questions and answers in the simple form:                                    • the glossary categories and “WHAT IS *” question
                                                                                 patterns
Q <question> | {Q <alternative version>}                                     • the stopwords filtering categories
A <answer > | {A <alternative version>}                                      The FAQ/keywords/multiwords AIML set can grow to sev-
                                                                          eral thousands of categories, so the generation algorithm
   Alternative versions of the question are useful to enlarge the
                                                                          needs to be tuned by the maximum number of categories
possibility to intercept the user input; alternative answers are
                                                                          each AIML file can contain, given the complexity of the
great to increase the variability of the answers given by the
                                                                          FAQ/keywords/multiwords text file set and the final AIML
chatter bot in response to the user input. The multiwords list
                                                                          interpreter tool adopted. The output AIML 1.0 file set is
is very important to isolate conceptual entities that uses more
                                                                          then ready to be used in a AIML hosting web services, like
than a words, as for example “operating system” or “credit
                                                                          http://pandorabots.com .
card”.
   The input data set and the information workflow can be                   1 http://en.wiktionary.org last accessed June 2016

summarized by the diagram in Fig.1.                                         2 http://nltk.org last accessed June 2016


                                                                     15
                                                                                  3


                IV. QUALITY ASSESMENT
   It is necessary to introduce a measurable metric of the
correctness of the final chatter bot. As already proposed in
[7], a three level metric can be adopted:
   • Level 0: the resulting chatter bot does give a correct
       answer for all the questions included in the FAQ, with
       exact text matching
   • Level 1: the resulting chatter bot does give at least 50%
       of correct answers, not using the exact wording of the
       original FAQ questions text, but with the same semantic
   • Level 2: the resulting chatter bot does give at least
       50% of correct answers using questions with completely
       different wording, but same semantic of the original FAQ
       questions
   It has been experimentally proven that all chatter bots gen-
erated with PyGenbot are at least of Level 0 quality, and very
often can reach Level 1 quality if the FAQ/keywords/glossary
set is accurately designed and well written. The demonstrator
at the WOA 2016 Workshop is aimed to confirm experimen-
tally this statement.
                       V. CONCLUSION
   The proposed demonstrator, the PyGenbot program, is ca-
pable of generating lexical knowledge bases for AIML based
stateless chatter bots. This work illustrated the underline engi-
neered knowledge-unrestricted methodology, also proposing a
quality assessment procedure that should objectively demon-
strate that restricted chatter bot can be generated starting from
arbitrary text files, independent from the language.
                              R EFERENCES
[1] J. Weizenbaum, “Eliza a computer program for the study of natural
    language communication between man and machine,” Communications
    of the ACM, vol. 9, no. 1, pp. 36–45, 1966.
[2] R. Epstein, G. Roberts, and G. Beber, Parsing the turing test : philosoph-
    ical and methodological issues in the quest for the thinking computer.
    New York: Springer, 2008.
[3] R. S. Wallace, The Anatomy of A.L.I.C.E, ser. Parsing the Turing Test.
    New York: Springer, 2008, pp. 181–210.
[4] M. Mauldin, Chatterbots, tinymuds, and the turing test: Entering the
    loebner prize competition, ser. AAAI ’94 Proceedings of the twelfth
    national conference on Artificial intelligence. AAAI Press, 1994, vol. 1,
    pp. 16–21.
[5] S. Baker, Final Jeopardy: Man vs. Machine and the Quest to Know
    Everything. New York: Houghton Mifflin Harcourt Publishing Company,
    2011.
[6] G. De Gasperis, “Building an aiml chatter bot knowledge-base starting
    from a faq and a glossary,” Journal of e-Learning and Knowledge
    Society-English Version, vol. 6, no. 2, 2010.
[7] G. De Gasperis, I. Chiari, and N. Florio, “AIML knowledge base
    construction from text corpora,” in Artificial intelligence, evolutionary
    computing and metaheuristics. Springer, 2013, pp. 287–318.
[8] G. De Gasperis, S. Costantini, and G. Nazzicone, “Dali
    multi agent systems framework, doi 10.5281/zenodo.11042,”
    DALI GitHub Software Repository, July 2014, DALI: http:
    //github.com/AAAI-DISIM-UnivAQ/DALI.
[9] R. H. Bordini and J. F. Hübner, “BDI agent programming in agentspeak
    using Jason (tutorial paper),” in Computational Logic in Multi-Agent
    Systems, 6th International Workshop, CLIMA VI, Revised Selected and
    Invited Papers, ser. Lecture Notes in Computer Science, F. Toni and
    P. Torroni, Eds., vol. 3900. Springer, 2006, pp. 143–164.


                                                                             16