A Python-based Assistant Agent able to Interact
                with Natural Language
                                                  Fabio Longo, Corrado Santoro
                                                    University of Catania
                                          Department of Mathematics and Informatics
                                                    Viale Andrea Doria, 6
                                                   95125 - Catania, ITALY
                                     EMail: flongo@policlinico.unict.it, santoro@dmi.unict.it


   Abstract—This paper describes the software architecture and        the peer is not a human being but an artificial system with
functionalities of an assistant agent, developed by the authors,      obvious limited capabilities and intelligence.
able to interact with the user through the natural language.             The technological aspects behind such a form of assistance
The agent is implemented by means of PROFETA, a Python-
based BDI engine developed within the author’s research group.        are multiple: not only a good Natural Language Processing
The agent is composed of two parts: (i) the speech-to-text and        (NLP) engine is needed, but also a reasoner tool is mandatory,
text-to-speech services, and (ii) the reasoning engine. As for the    that should be able to understand the context of discussion
former part, the agent exploits Clould services (in particular        and interpret the meaning of sentences accordingly. The NLP
those provided by Microsoft Bing and IBM Watson); to this             engine has the task of performing speech-to-text processing
aim, a flexible software architecture is designed in order to
connect first-class entities of PROFETA (i.e. sensors and actions)    and, having obtained the text part, applying a syntax analyser,
to the cloud world. The reasoning engine is instead designed          in order to extract the meaningful parts of the phrase and
by means of the declarative language provided by PROFETA:             classify the lemmas. Once the lemmas have been extracted,
utterances said by the user become PROFETA beliefs that can,          they have to be interpreted in order to catch the meaning
in turn, trigger reasoning rules. In order to show effectiveness of   and then execute the proper actions; this operation is usually
the solution, a case-study of speech-based interaction to browse
Wikipedia is presented.                                               performed by using tools that implement forms of—more or
                                                                      less flexible—string pattern matching. But pattern matching
                        I. I NTRODUCTION                              usually does not suffice for a powerful natural language
   With the advent of smartphones, the technological advances         interaction; indeed meaning of utterances is also strongly
in artificial intelligence made it possible the implementation        based on the evolution of dialogues, therefore what happened
of speech-based assistant agents, like Siri or the Google             in the previous interactions of the dialogue is strongly im-
Assistant. Being predicted in many science fiction movies,            portant for a correct interpretation: in other words, pattern
speech assistants are kind of personal agents that represent          matching must be integrated with state- or knowledge-data.
the natural evolution of assistant agents introduced in the ’90       In this sense, what is needed is a tool that is able to support
in application software to help the user in her/his day-to-day        logic-/knwoledge-based programming and, among such kind
activities [7], [18], [17], [6], [16]. But while such kind of         of tools, the PROFETA [12], [13], [11], [10] programming
assistants were strongly specialised for the activities proper        platform appears an interesting solution since it is able to
of the associated application, the aim of speech assistants is        provide all the cited characteristics.
more ambitious: to help the user in any kind of day-to-day               In this context, this paper presents the software architecture
activities. We indeed expect that such entities should be able        of a personal assistant agent based on the PROFETA program-
to interpret—and execute the associated tasks—utterances like         ming platform and able to interact with a human by using
“Play some jazz”, “Order coffee pods”, “Book me a flight to           the speech and the natural language. The solution is based
London”, etc.                                                         on a software architecture that, by means of the integration
   All of the exemplified activities however are not simple           of cloud services with PROFETA, is able to perform speech-
“commands”, but, in general, could require a form of—more             to-text and text-to-speech operations. Interpreted phrases then
or less complex—interaction; as an example, at the first phrase       become beliefs and can thus be used to trigger production
above the assistant could answer something like “I’ve found           rules that drive actions of the agent. As a case-study, the
some tracks by Chet Baker, are they ok for you?”, and the             paper reports the application of the software architecture to
user could reply “No, I prefer Thelonious Monk”; placing              the implementation of agent able to query Wikipedia on the
an order could imply to know the price and delivery time              basis of the questions provided by human user.
of some offers, and the user (always by means of speed-based             The paper is structured as follows. Section II provides an
interaction) could be asked to make a selection. The concept          overview of PROFETA. Section III describes the architecture
is that we, as users, expect that speech-based assistants can         and functionalities of the speech-to-text/text-to-speech inter-
establish a meaningful dialogue with us, even if we know that         face. Section IV presents the case-study. Section V concludes


                                                                  142
the paper.                                                                           III. T HE STT/TTS I NTERFACE
                                                                      Implementing an agent able to interact with natural language
                   II. OVERVIEW OF PROFETA                         implies to include and use speech-to-text (STT) and text-to-
                                                                   speech (TTS) services. Although many libraries exist (like
   PROFETA1 [12] is a Python platform for programming the          Sphinx [1], for example), Cloud-based services appear a more
behaviour of an autonomous system (agent or robot) based           interesting solution, mainly because they are continuously
on the Belief-Desire-Intention paradigm [8], [15]. It allows a     improved and upgraded, and made always more precise.
programmer to implement the behaviour of an agent by using         Indeed, as for STT, we made some test using both the Sphinx
two basic concepts: beliefs and plans.                             library and the Microsoft Bing Speech service, and, as a
   Beliefs represent the data part of an agent program, i.e. the   result, we obtained a greater accuracy using the latter solution.
knowledge of the agent and are expressed by using logic            On the other hand, for the TTS, we experienced the IBM
predicates, i.e. atomic formulae with ground terms; a belief       Watson text-to-speech service [3], which gives a more natural
can be generated on the basis of data coming from the agent        feeling of human voice, than others Python-based engines like
reference environment or as a result of a reasoning process.       eSpeak [2], SAPI [4], etc.
   Plans represent the computational part of an agent program         In order to include STT and TTS services within PROFETA,
and are expressed as production rules, triggered by a specific     the proper abstractions provided by the tool must be used; the
event which can be the assertion/retract of a belief or the        STT is implemented as a PROFETA sensor that we called
request to achieve a specific goal; the result of a plan is a      hearer, while TTS is encapsulated inside an action called say.
sequence of actions that represent what the agent has to do
                                                                                "play some jazz"
as a response to that event. Plans are written using Python
statements that however are interpreted, by the PROFETA                                                                heard("play some jazz")
                                                                                                          hearer
engine, as declarative production rules. The syntax is inspired                                           sensor

by AgentSpeak [9]. A plan includes a head and a body; the
head that is composed of the triggering event plus a condition,                                                        "hello, how are you?"
i.e. a predicate on the knowledge that must be met for the           say("en","hello","how are you?")
                                                                                                           say
                                                                                                          action
plan to be triggered; the body is the list of actions that must
be executed when the plan is triggered.
   The interaction of a PROFETA agent with the environment                        Fig. 1.   The ”hearer” sensor and ”say” action
is performed by using two types of software entities. Actions,
already cited in the context of plans, are responsible of acting      As Figure 1 show, the hearer has the task of sampling the
onto the environment (as the name suggests), and are executed      microphone, sending audio data to the STT cloud service,
as a result of the triggering of a plan. Sensors are instead       and gathering the response as a string; as a consequence, a
software entities with the task of polling or receiving data       heard() belief is asserted whose term is the interpreted
from the environment or other external entities; a sensor          string. This assertion can, in turn, trigger the proper rules
having gathered a useful information can generate a belief thus    according to the agent program implemented.
transforming data in agent knowledge; as a result, a belief           The say action is used to let the agent pronounce a sentence
generated by a sensor can enrich agent’s knowledge and/or          (see Figure 1). The parameters of the action are strings:
trigger a plan thus provoking the reaction of the agent.           the first string is the language, while the other parameters
   PROFETA engine includes a scheduler that runs a loop            represent the various parts of the sentence itself; action im-
continuously executing the following activities:                   plementation concatenates such parts and sends the resulting
  1) Sensor Activation. All defined sensors are activated and      string to the TTS cloud service; the reply will be the audio
     their polling code is executed; if a sensor generates a       samples of the recited phrase that are sent to the audio device
     belief, it is processed and placed into an event queue.       for playing.
  2) Event Handling. An event is extracted from the event             From the implementation point of view, a sensor in PRO-
     queue and all applicable plans are searched for.              FETA is simply made of class that extends the Sensor
  3) Plan Selection. For each applicable plan, the condition       base class overriding the sense() method; this method must
     is tested and, if true, the plan is selected for execution.   implement to code for data sampling, returning the relevant
  4) Plan Execution. The actions of the selected plan are          belief (or None if no data has been sampled). In a similar
     executed sequentially.                                        way, a PROFETA action is implemented as a sub-class of the
                                                                   Action base class and overriding the execute() method
   For more details about PROFETA internals, working               with the proper code. Given this, implementing the STT
scheme, syntax and semantics, the reader can consult the           and TTS services appears a straightforward task, however
relevant bibliography [12], [13], [11], [10].                      some aspects must be taken into account. The first aspect
                                                                   regards the performances: the interaction with cloud services
  1 http://github.com/corradosantoro/profeta                       could introduce latencies that can affect the performances of


                                                               143
the overall system; indeed, in PROFETA, the invocation of                             is generated as a result value of the sense() method.
sense() and execute() methods is made synchronously                                      For the performance reasons already cited, the task de-
within the main interpretation loop of the agent program: if                          scribed above must be executed in an asynchronous way with
such methods experience delays (that are indeed unavoidable                           respect to the main PROFETA loop. This objective is achieved
when a network transaction is performed), the overall per-                            by encapsulating the hearer inside a AsyncSensorProxy, a
formances are affected. The second aspect is related to the                           library class provided by PROFETA which has the specific
generality of the solution: even if, in our implementation, we                        task of making a sensor asynchronous [14].
decided to use Microsoft Bing for STT and IBM Watson
for TTS, other engines are available in the web, therefore                            B. The TTS Interface
a software architecture able to let the programmer to easily                             Text-to-speech is performed by a specific PROFETA action.
change the desired engine is really opportune. Such aspects                           Also in this case, it is preferable to have an asynchronous
are dealt with in the next subsection.                                                execution of the cloud interaction with respect to the main
                                                                                      PROFETA loop. This is performed by exploiting the Asyn-
A. The STT Interface                                                                  cAction base class provided by the PROFETA library which
  The STT interface is made by three basic classes and a                              is, in turn, derived from Action.
specialised class; the relevant UML diagram is reported in
Figure 2.                                                                                         PROFETA.action

                                                                                           + execute()
                                                                                                                                PROFETA.AsyncAction
                 PROFETA.Sensor
                                                           AudioSource                                                       + create_service()


                                                     + open()
                                                     + close()
                        hearer                       + read() : audiodata

              + audiosource                                                                                 say
              + stt
                                                                  STT                      + service: STT                                 TTS              AudioPlayer
              + sense(): Belief                                                                                                + audioplayer
                                                  + connect()                              + create_service()
                                                  + translate(audiodata) : string                                                                     + open()
                                                                                           + execute()                                                + play(audiodata)
                                                                                                                               + connect()
                                                                                                                               + translate(string)


                                                         BingSTTService
                                                                                                                                    WatsonTTS
                                                  + connect()
                                                  + translate(audiodata) : string                                            + connect()
                                                                                                                             + translate(string)


                                  Fig. 2.   The STT Interface                                                      Fig. 3.       The TTS Interface

    The principal entity is hearer which is a subclass of                                The TTS interface, whose class diagram is reported in
the basic PROFETA class Sensor. An instance of hearer                                 Figure 3, is composed of the following classes: say, TTS,
contains a reference to other two objects: AudioSource and                            AudioPlayer and WatsonTTS. Class say is the async-action
STT; the former has the objective of capturing audio data                             which is directly invoked by the PROFETA program and
from the microphone while the latter implements the network                           coordinates all the text-to-speech activities. TTS is an abstract
transaction with a cloud STT services. STT is defined as an                           class that represents the cloud service and (in a way similar
abstract class: its methods are empty and their implementation                        as to STT) must be subclassed with the implementation of
is left to a derived class that will include the code for the                         the code for interaction with a specific TTS service; in our
specific cloud service to be used. In our implementation, we                          case, this is performed by the class WatsonTTS that includes
subclassed STT as BingSTT, whose methods implement the                                the code for interaction with IBM Watson. Class TTS contains
interaction with the Microsoft Bing service.                                          also a reference to AudioPlayer, which has the task of playing
    The task of hearer is coded in its sense() method. First                          to the audio device the audio stream returned by the TTS
it invokes the read() method of AudioSource to retrieve                               service.
audio data samples; this sampling is performed by entering in a                          The working scheme of the TTS interface is based on a
loop which listens for incoming sounds from the microphone,                           specific usage protocol that the programmer has to respect; in
filtering ambient noises properly, until any source of sound                          particular, in the say class, two methods must be overridden:
is perceived; then, the audio stream is caught until silence is                       create_service() and execute(). The former method
identified again and the relevant data are returned as a result                       is called when the class is instantiated and has the specific
of read(). Subsequently, the hearer activates the STT by                              task of creating the TTS object and performing, in turn, the
invoking the translate() method; this method receives the                             initial connection to the cloud service. The latter method is
audio stream as input and is expected to return the translated                        called when the action is explicitly invoked within a PROFETA
string or None if translation is impossible. Such a result (if                        program and contains the code that retrieves parameters,
not None) is converted in lowercase2 and the heard() belief                           composes the string to say and invokes the translate()
 2 This is required because, for syntax reasons, in PROFETA string constants          method of TTS that concretely executes text-to-speech and
must be in lowercase.                                                                 plays the resulting audio stream.


                                                                                    144
     
 1   stage("main")
 2   +start() >> [ +language("en"), show("starting...") ]
 3   +heard("laura") >> [ random_greetings("GreetMessage"), say("en", "GreetMessage"), set_stage("wiki") ]
 4
 5   stage("wiki")
 6   +heard("X") >> [ terms_to_belief("X") ]
 7   +generic_phrase("change language") >> [ set_stage("language") ]
 8   +generic_phrase("X") / language("L") >> [ wiki_search("L", "X", "Y"),
 9                                                           say("L", " I have found the following options ", "Y") ]
10   +search("X") / language("L") >> [ wiki_say("L", "X") ]
11   +timeout() >> [ set_stage("main") ]
12
13   stage("language")
14   +start() >> [ say("en", "what language do you desire?") ]
15   +heard("english") >> [ +language("en"), say("en", " I’ve set english"), set_stage("wiki") ]
16   +heard("italian") >> [ +language("it"), say("en", " I’ve set italian"), set_stage("wiki") ]
17   +heard("french") >> [ +language("fr"), say("en", " I’ve set french"), set_stage("wiki") ]
18   +heard("X") >> [ say("it", " I’ve not understood the language that you desire or I’m not able to support it") ]
                                                                                                                                                                   
                                                                        Fig. 4.   The listing of Laura


                           IV. T HE W IKIPEDIA AGENT                                        •  heard(terms). The belief already described in Sec-
        The agent we developed as a proof-of-concepts for our                                  tion III used as output of the hearer sensor. It is defined
     STT/TTS interface is a simple assistant able to browse                                    as a reactor i.e. a belief that can only trigger PROFETA
     Wikipedia by means of natural language. We called the                                     program plans but does not enters in the knowledge base3 .
     assistant Laura and this is the name the assistant itself respond                      • search(terms). It is a reactor generated by a pro-

     to in its behaviour.                                                                      cessing of data heard by the hearer: if the sentence
                                                                                               said includes an explicit searching request, this reactor
                                                         "change
                                  "laura"
                                                        language"
                                                                                               is asserted.
                    init                         wiki                  language
                                                                                            • generic_phrase(terms). Like the previous one, it
                                                                                               is a reactor generated by a processing of data heard by
                             timeout (30 secs)          language set                           the hearer, but is generated if the sentence said does not
                                                                                               include a specific searching request.
                      Fig. 5.     The basic behaviour of Laura                              • language(lang). It is a belief that stores, in the
                                                                                               knowledge base, the settings relevant to the current lan-
        Laura is implemented as a finite-state machine, sketched in                            guage.
                                                                                            • timeout(). A reactor used to signal the 30 seconds of
     Figure 5, in which each state (which is indeed a macro-state)
     represents a condition, in Laura’s behaviour, corresponding                               inactivity.
     to certain dialogue abilities. The basic working scheme is the                         The complete PROFETA program that controls Laura’s
     following: in the initial state, init, the agent waits for her name                 behaviour is reported in Figure 4 (the code of actions is not
     in order to be “woken-up”; after that, Laura enters in the wiki                     reported for brevity reasons, but their role is described in the
     state in which she is able to hear the term to be searched for                      text below).
     in Wikipedia. In the wiki state Laura is able to respond to the                        The macro-states of the finite-state machine of Figure 5 are
     following phrases:                                                                  clearly identified since they are called stages in PROFETA and
        • “search terms”, the specific terms are sent to wikipedia
                                                                                         are used to specify that certain plans are valid (i.e. triggerable)
           and, when the result page is returned, the summary data                       in that state.
                                                                                            In stage main, the program waits for the assertion of
           is recited using text-to-speech;
                                                                                         heard("laura") reactor (this happens if the user pro-
        • “change language”, it makes Laura enter in the language
                                                                                         nounces “Laura”) and, on the occurrence of such an event (line
           state, letting the user to set a new language for both
                                                                                         3), it enters into the wiki stage. In such a reaction, a greeting
           Wikipedia search and text-to-speech;
                                                                                         message is recited: this message is generated by action ran-
        • any other terms, the list of possible options for the said
                                                                                         dom greetings() that picks a random welcome string (from
           terms is searched for in wikipedia and such a list is recited
                                                                                         a predefined set) and bounds it to variable GreetMessage;
           by Laura; if the user wants a specific term, s/he can ask
                                                                                         the use of random greeting message is to avoid a repetitive
           it using the phrase “search terms”.
                                                                                         behaviour, from Laura, that, in the long term, could boring
        The wiki state is abandoned on the basis of two events: after
                                                                                         the user.
     a timeout of 30 of inactivity (in this case the state reached is                       In the wiki stage, the arrival of a heard() belief causes
     once again init), or when the user says “change language”;                          plan in line 6 to be triggered: the consequence is the call of
     in th latter case, Laura enters in the language state asking the                    action terms to belief() that has the basic task of interpreting
     user for the new language desired.
        To support the cited activities, the following beliefs are used:                   3 See [12] for details about the kind of beliefs supported by PROFETA.


                                                                                     145
the command according to the cases listed above and depicted          [2] espeak         speech        synthesizer.       [Online].     Available:
in Figure 5. If “search terms” is pronounced (e.g. “search                http://espeak.sourceforge.net/
                                                                      [3] Microsoft speech application program interface. [Online]. Available:
Palermo”), the terms to belief() action asserts the search()              http://www.ibm.com/watson/services/text-to-speech/
reactor, using terms as parameters; this causes plan in line 10       [4] Microsoft speech application program interface. [Online]. Available:
to be executed: first the current language is retrieved, then             http://en.wikipedia.org/wiki/Microsoft Speech API
action wiki say() is executes which will search the terms             [5] Natural language toolkit. [Online]. Available: http://www.nltk.org/
inside Wikipedia reciting then the relevant summary text. If          [6] M. Bombara, D. Calı̀, and C. Santoro, “KORE: A multi-agent system to
                                                                          assist museum visitors,” in WOA 2003: Dagli Oggetti agli Agenti. 4th
“change language” is pronounced, plan in line 7 is triggered              AI*IA/TABOO Joint Workshop ”From Objects to Agents”: Intelligent
and the agent enters into stage language, asking the user                 Systems and Pervasive Computing, 10-11 September 2003, Villasimius,
the new language to switch to; when the new language is                   CA, Italy, 2003, pp. 175–178.
                                                                      [7] J. M. Bradshaw, Ed., Software Agents. AAAI Press/The MIT Press,
successfully selected, the agent returns into the wiki stage              1997.
(lines 15—17). If other terms are said, the plan in line 8 is         [8] M. E. Bratman, Intentions, Plans and Practical Reason.          Harvard
executed that causes a generic search into Wikipedia for all              University Press, 1987.
the pages that are related to the terms themselves: the list of       [9] M. d’Inverno and M. Luck, “Engineering agentspeak(l): A
                                                                          formal computational model,” Journal of Logic and Computation,
options is then recited by Laura.                                         vol. 8, no. 3, pp. 233–260, 1998. [Online]. Available:
                                                                          http://eprints.ecs.soton.ac.uk/3846/
                          V. C ONCLUSIONS                            [10] L. Fichera, D. Marletta, V. Nicosia, and C. Santoro, “Flexible robot
   This paper has described the software architecture and the             strategy design using belief-desire-intention model,” in Research and
                                                                          Education in Robotics - EUROBOT 2010 - International Conference,
working scheme of an assistant agent able to interact with                Rapperswil-Jona, Switzerland, May 27-30, 2010, Revised Selected Pa-
the user with natural language. The basic aspects of the                  pers, 2010, pp. 57–71.
desired solution are the use of the PROFETA BDI tool as the [11] ——, “A methodology to extend imperative languages with agentspeak
execution platform and the organisation in a flexible software            declarative constructs,” in Proceedings of the 11th WOA 2010 Workshop,
                                                                          Dagli Oggetti Agli Agenti, Rimini, Italy, September 5-7, 2010., 2010.
architecture in order to exploit cloud computing for speech- [12] L. Fichera, F. Messina, G. Pappalardo, and C. Santoro, “A python
to-text and text-to-speech services.                                      framework for programming autonomous robots using a declarative
   The assistant implemented, called Laura, has the objective             approach,” Sci. Comput. Program., vol. 139, pp. 36–55, 2017.
of helping the user in browsing Wikipedia with speech-based [13] G. Fortino, W. Russo, and C. Santoro, “Translating statecharts-based
                                                                          into BDI agents: The DSC/PROFETA case,” in Multiagent System Tech-
interaction. It served as a proof-of-concepts to understand the           nologies - 11th German Conference, MATES 2013, Koblenz, Germany,
validity of the software architecture and the applicability of            September 16-20, 2013. Proceedings, 2013, pp. 264–277.
PROFETA to such kind of contexts.                                    [14] F. Messina, G. Pappalardo, and C. Santoro, “Integrating cloud services
                                                                          in behaviour programming for autonomous robots,” in Algorithms and
   Starting from such experience, we plan, in future work, to             Architectures for Parallel Processing - 13th International Conference,
improve understanding abilities of Laura, including a library to          ICA3PP 2013, Vietri sul Mare, Italy, December 18-20, 2013, Proceed-
parse natural language sentences (like NLTK [5]), also trans-             ings, Part II, pp. 295–302.
lating the parsed terms into a proper beliefs better representing [15] A. Rao and M. Georgeff, “BDI agents: From theory to practice,” in
                                                                          Proceedings of the first international conference on multi-agent systems
the predicates of a common knowledge; the objective is to have            (ICMAS-95). San Francisco, CA, 1995, pp. 312–319.
an artificial system which can show a rational behaviour that [16] A. D. Stefano, G. Pappalardo, C. Santoro, and E. Tramontana, “A multi-
can also be adopted in all contexts needing a form of specific            agent reflective architecture for user assistance and its application to
                                                                          e-commerce,” in Cooperative Information Agents VI, 6th International
user assistance.                                                          Workshop, CIA 2002, Madrid, Spain, September 18-20, 2002, Proceed-
                                                                          ings, 2002, pp. 90–103.
                             R EFERENCES                             [17] A. D. Stefano and C. Santoro, “Netchaser: Agent support for personal
 [1] Cmusphinx: Open-source speech recognition toolkit. [Online].         mobility,” IEEE Internet Computing, vol. 4, no. 2, pp. 74–79, 2000.
     Available: http://cmusphinx.github.io/                     [18] G. Weiss, Ed., Multiagent Systems. The MIT Press, April 1999.


                                                                       146