=Paper=
{{Paper
|id=Vol-2215/paper22
|storemode=property
|title=A Python-based Assistant Agent able to Interact with Natural Language
|pdfUrl=https://ceur-ws.org/Vol-2215/paper_22.pdf
|volume=Vol-2215
|authors=Fabio Longo ,Corrado Santoro
|dblpUrl=https://dblp.org/rec/conf/woa/LongoS18
}}
==A Python-based Assistant Agent able to Interact with Natural Language ==
A Python-based Assistant Agent able to Interact
with Natural Language
Fabio Longo, Corrado Santoro
University of Catania
Department of Mathematics and Informatics
Viale Andrea Doria, 6
95125 - Catania, ITALY
EMail: flongo@policlinico.unict.it, santoro@dmi.unict.it
Abstract—This paper describes the software architecture and the peer is not a human being but an artificial system with
functionalities of an assistant agent, developed by the authors, obvious limited capabilities and intelligence.
able to interact with the user through the natural language. The technological aspects behind such a form of assistance
The agent is implemented by means of PROFETA, a Python-
based BDI engine developed within the author’s research group. are multiple: not only a good Natural Language Processing
The agent is composed of two parts: (i) the speech-to-text and (NLP) engine is needed, but also a reasoner tool is mandatory,
text-to-speech services, and (ii) the reasoning engine. As for the that should be able to understand the context of discussion
former part, the agent exploits Clould services (in particular and interpret the meaning of sentences accordingly. The NLP
those provided by Microsoft Bing and IBM Watson); to this engine has the task of performing speech-to-text processing
aim, a flexible software architecture is designed in order to
connect first-class entities of PROFETA (i.e. sensors and actions) and, having obtained the text part, applying a syntax analyser,
to the cloud world. The reasoning engine is instead designed in order to extract the meaningful parts of the phrase and
by means of the declarative language provided by PROFETA: classify the lemmas. Once the lemmas have been extracted,
utterances said by the user become PROFETA beliefs that can, they have to be interpreted in order to catch the meaning
in turn, trigger reasoning rules. In order to show effectiveness of and then execute the proper actions; this operation is usually
the solution, a case-study of speech-based interaction to browse
Wikipedia is presented. performed by using tools that implement forms of—more or
less flexible—string pattern matching. But pattern matching
I. I NTRODUCTION usually does not suffice for a powerful natural language
With the advent of smartphones, the technological advances interaction; indeed meaning of utterances is also strongly
in artificial intelligence made it possible the implementation based on the evolution of dialogues, therefore what happened
of speech-based assistant agents, like Siri or the Google in the previous interactions of the dialogue is strongly im-
Assistant. Being predicted in many science fiction movies, portant for a correct interpretation: in other words, pattern
speech assistants are kind of personal agents that represent matching must be integrated with state- or knowledge-data.
the natural evolution of assistant agents introduced in the ’90 In this sense, what is needed is a tool that is able to support
in application software to help the user in her/his day-to-day logic-/knwoledge-based programming and, among such kind
activities [7], [18], [17], [6], [16]. But while such kind of of tools, the PROFETA [12], [13], [11], [10] programming
assistants were strongly specialised for the activities proper platform appears an interesting solution since it is able to
of the associated application, the aim of speech assistants is provide all the cited characteristics.
more ambitious: to help the user in any kind of day-to-day In this context, this paper presents the software architecture
activities. We indeed expect that such entities should be able of a personal assistant agent based on the PROFETA program-
to interpret—and execute the associated tasks—utterances like ming platform and able to interact with a human by using
“Play some jazz”, “Order coffee pods”, “Book me a flight to the speech and the natural language. The solution is based
London”, etc. on a software architecture that, by means of the integration
All of the exemplified activities however are not simple of cloud services with PROFETA, is able to perform speech-
“commands”, but, in general, could require a form of—more to-text and text-to-speech operations. Interpreted phrases then
or less complex—interaction; as an example, at the first phrase become beliefs and can thus be used to trigger production
above the assistant could answer something like “I’ve found rules that drive actions of the agent. As a case-study, the
some tracks by Chet Baker, are they ok for you?”, and the paper reports the application of the software architecture to
user could reply “No, I prefer Thelonious Monk”; placing the implementation of agent able to query Wikipedia on the
an order could imply to know the price and delivery time basis of the questions provided by human user.
of some offers, and the user (always by means of speed-based The paper is structured as follows. Section II provides an
interaction) could be asked to make a selection. The concept overview of PROFETA. Section III describes the architecture
is that we, as users, expect that speech-based assistants can and functionalities of the speech-to-text/text-to-speech inter-
establish a meaningful dialogue with us, even if we know that face. Section IV presents the case-study. Section V concludes
142
the paper. III. T HE STT/TTS I NTERFACE
Implementing an agent able to interact with natural language
II. OVERVIEW OF PROFETA implies to include and use speech-to-text (STT) and text-to-
speech (TTS) services. Although many libraries exist (like
PROFETA1 [12] is a Python platform for programming the Sphinx [1], for example), Cloud-based services appear a more
behaviour of an autonomous system (agent or robot) based interesting solution, mainly because they are continuously
on the Belief-Desire-Intention paradigm [8], [15]. It allows a improved and upgraded, and made always more precise.
programmer to implement the behaviour of an agent by using Indeed, as for STT, we made some test using both the Sphinx
two basic concepts: beliefs and plans. library and the Microsoft Bing Speech service, and, as a
Beliefs represent the data part of an agent program, i.e. the result, we obtained a greater accuracy using the latter solution.
knowledge of the agent and are expressed by using logic On the other hand, for the TTS, we experienced the IBM
predicates, i.e. atomic formulae with ground terms; a belief Watson text-to-speech service [3], which gives a more natural
can be generated on the basis of data coming from the agent feeling of human voice, than others Python-based engines like
reference environment or as a result of a reasoning process. eSpeak [2], SAPI [4], etc.
Plans represent the computational part of an agent program In order to include STT and TTS services within PROFETA,
and are expressed as production rules, triggered by a specific the proper abstractions provided by the tool must be used; the
event which can be the assertion/retract of a belief or the STT is implemented as a PROFETA sensor that we called
request to achieve a specific goal; the result of a plan is a hearer, while TTS is encapsulated inside an action called say.
sequence of actions that represent what the agent has to do
"play some jazz"
as a response to that event. Plans are written using Python
statements that however are interpreted, by the PROFETA heard("play some jazz")
hearer
engine, as declarative production rules. The syntax is inspired sensor
by AgentSpeak [9]. A plan includes a head and a body; the
head that is composed of the triggering event plus a condition, "hello, how are you?"
i.e. a predicate on the knowledge that must be met for the say("en","hello","how are you?")
say
action
plan to be triggered; the body is the list of actions that must
be executed when the plan is triggered.
The interaction of a PROFETA agent with the environment Fig. 1. The ”hearer” sensor and ”say” action
is performed by using two types of software entities. Actions,
already cited in the context of plans, are responsible of acting As Figure 1 show, the hearer has the task of sampling the
onto the environment (as the name suggests), and are executed microphone, sending audio data to the STT cloud service,
as a result of the triggering of a plan. Sensors are instead and gathering the response as a string; as a consequence, a
software entities with the task of polling or receiving data heard() belief is asserted whose term is the interpreted
from the environment or other external entities; a sensor string. This assertion can, in turn, trigger the proper rules
having gathered a useful information can generate a belief thus according to the agent program implemented.
transforming data in agent knowledge; as a result, a belief The say action is used to let the agent pronounce a sentence
generated by a sensor can enrich agent’s knowledge and/or (see Figure 1). The parameters of the action are strings:
trigger a plan thus provoking the reaction of the agent. the first string is the language, while the other parameters
PROFETA engine includes a scheduler that runs a loop represent the various parts of the sentence itself; action im-
continuously executing the following activities: plementation concatenates such parts and sends the resulting
1) Sensor Activation. All defined sensors are activated and string to the TTS cloud service; the reply will be the audio
their polling code is executed; if a sensor generates a samples of the recited phrase that are sent to the audio device
belief, it is processed and placed into an event queue. for playing.
2) Event Handling. An event is extracted from the event From the implementation point of view, a sensor in PRO-
queue and all applicable plans are searched for. FETA is simply made of class that extends the Sensor
3) Plan Selection. For each applicable plan, the condition base class overriding the sense() method; this method must
is tested and, if true, the plan is selected for execution. implement to code for data sampling, returning the relevant
4) Plan Execution. The actions of the selected plan are belief (or None if no data has been sampled). In a similar
executed sequentially. way, a PROFETA action is implemented as a sub-class of the
Action base class and overriding the execute() method
For more details about PROFETA internals, working with the proper code. Given this, implementing the STT
scheme, syntax and semantics, the reader can consult the and TTS services appears a straightforward task, however
relevant bibliography [12], [13], [11], [10]. some aspects must be taken into account. The first aspect
regards the performances: the interaction with cloud services
1 http://github.com/corradosantoro/profeta could introduce latencies that can affect the performances of
143
the overall system; indeed, in PROFETA, the invocation of is generated as a result value of the sense() method.
sense() and execute() methods is made synchronously For the performance reasons already cited, the task de-
within the main interpretation loop of the agent program: if scribed above must be executed in an asynchronous way with
such methods experience delays (that are indeed unavoidable respect to the main PROFETA loop. This objective is achieved
when a network transaction is performed), the overall per- by encapsulating the hearer inside a AsyncSensorProxy, a
formances are affected. The second aspect is related to the library class provided by PROFETA which has the specific
generality of the solution: even if, in our implementation, we task of making a sensor asynchronous [14].
decided to use Microsoft Bing for STT and IBM Watson
for TTS, other engines are available in the web, therefore B. The TTS Interface
a software architecture able to let the programmer to easily Text-to-speech is performed by a specific PROFETA action.
change the desired engine is really opportune. Such aspects Also in this case, it is preferable to have an asynchronous
are dealt with in the next subsection. execution of the cloud interaction with respect to the main
PROFETA loop. This is performed by exploiting the Asyn-
A. The STT Interface cAction base class provided by the PROFETA library which
The STT interface is made by three basic classes and a is, in turn, derived from Action.
specialised class; the relevant UML diagram is reported in
Figure 2. PROFETA.action
+ execute()
PROFETA.AsyncAction
PROFETA.Sensor
AudioSource + create_service()
+ open()
+ close()
hearer + read() : audiodata
+ audiosource say
+ stt
STT + service: STT TTS AudioPlayer
+ sense(): Belief + audioplayer
+ connect() + create_service()
+ translate(audiodata) : string + open()
+ execute() + play(audiodata)
+ connect()
+ translate(string)
BingSTTService
WatsonTTS
+ connect()
+ translate(audiodata) : string + connect()
+ translate(string)
Fig. 2. The STT Interface Fig. 3. The TTS Interface
The principal entity is hearer which is a subclass of The TTS interface, whose class diagram is reported in
the basic PROFETA class Sensor. An instance of hearer Figure 3, is composed of the following classes: say, TTS,
contains a reference to other two objects: AudioSource and AudioPlayer and WatsonTTS. Class say is the async-action
STT; the former has the objective of capturing audio data which is directly invoked by the PROFETA program and
from the microphone while the latter implements the network coordinates all the text-to-speech activities. TTS is an abstract
transaction with a cloud STT services. STT is defined as an class that represents the cloud service and (in a way similar
abstract class: its methods are empty and their implementation as to STT) must be subclassed with the implementation of
is left to a derived class that will include the code for the the code for interaction with a specific TTS service; in our
specific cloud service to be used. In our implementation, we case, this is performed by the class WatsonTTS that includes
subclassed STT as BingSTT, whose methods implement the the code for interaction with IBM Watson. Class TTS contains
interaction with the Microsoft Bing service. also a reference to AudioPlayer, which has the task of playing
The task of hearer is coded in its sense() method. First to the audio device the audio stream returned by the TTS
it invokes the read() method of AudioSource to retrieve service.
audio data samples; this sampling is performed by entering in a The working scheme of the TTS interface is based on a
loop which listens for incoming sounds from the microphone, specific usage protocol that the programmer has to respect; in
filtering ambient noises properly, until any source of sound particular, in the say class, two methods must be overridden:
is perceived; then, the audio stream is caught until silence is create_service() and execute(). The former method
identified again and the relevant data are returned as a result is called when the class is instantiated and has the specific
of read(). Subsequently, the hearer activates the STT by task of creating the TTS object and performing, in turn, the
invoking the translate() method; this method receives the initial connection to the cloud service. The latter method is
audio stream as input and is expected to return the translated called when the action is explicitly invoked within a PROFETA
string or None if translation is impossible. Such a result (if program and contains the code that retrieves parameters,
not None) is converted in lowercase2 and the heard() belief composes the string to say and invokes the translate()
2 This is required because, for syntax reasons, in PROFETA string constants method of TTS that concretely executes text-to-speech and
must be in lowercase. plays the resulting audio stream.
144
1 stage("main")
2 +start() >> [ +language("en"), show("starting...") ]
3 +heard("laura") >> [ random_greetings("GreetMessage"), say("en", "GreetMessage"), set_stage("wiki") ]
4
5 stage("wiki")
6 +heard("X") >> [ terms_to_belief("X") ]
7 +generic_phrase("change language") >> [ set_stage("language") ]
8 +generic_phrase("X") / language("L") >> [ wiki_search("L", "X", "Y"),
9 say("L", " I have found the following options ", "Y") ]
10 +search("X") / language("L") >> [ wiki_say("L", "X") ]
11 +timeout() >> [ set_stage("main") ]
12
13 stage("language")
14 +start() >> [ say("en", "what language do you desire?") ]
15 +heard("english") >> [ +language("en"), say("en", " I’ve set english"), set_stage("wiki") ]
16 +heard("italian") >> [ +language("it"), say("en", " I’ve set italian"), set_stage("wiki") ]
17 +heard("french") >> [ +language("fr"), say("en", " I’ve set french"), set_stage("wiki") ]
18 +heard("X") >> [ say("it", " I’ve not understood the language that you desire or I’m not able to support it") ]
Fig. 4. The listing of Laura
IV. T HE W IKIPEDIA AGENT • heard(terms). The belief already described in Sec-
The agent we developed as a proof-of-concepts for our tion III used as output of the hearer sensor. It is defined
STT/TTS interface is a simple assistant able to browse as a reactor i.e. a belief that can only trigger PROFETA
Wikipedia by means of natural language. We called the program plans but does not enters in the knowledge base3 .
assistant Laura and this is the name the assistant itself respond • search(terms). It is a reactor generated by a pro-
to in its behaviour. cessing of data heard by the hearer: if the sentence
said includes an explicit searching request, this reactor
"change
"laura"
language"
is asserted.
init wiki language
• generic_phrase(terms). Like the previous one, it
is a reactor generated by a processing of data heard by
timeout (30 secs) language set the hearer, but is generated if the sentence said does not
include a specific searching request.
Fig. 5. The basic behaviour of Laura • language(lang). It is a belief that stores, in the
knowledge base, the settings relevant to the current lan-
Laura is implemented as a finite-state machine, sketched in guage.
• timeout(). A reactor used to signal the 30 seconds of
Figure 5, in which each state (which is indeed a macro-state)
represents a condition, in Laura’s behaviour, corresponding inactivity.
to certain dialogue abilities. The basic working scheme is the The complete PROFETA program that controls Laura’s
following: in the initial state, init, the agent waits for her name behaviour is reported in Figure 4 (the code of actions is not
in order to be “woken-up”; after that, Laura enters in the wiki reported for brevity reasons, but their role is described in the
state in which she is able to hear the term to be searched for text below).
in Wikipedia. In the wiki state Laura is able to respond to the The macro-states of the finite-state machine of Figure 5 are
following phrases: clearly identified since they are called stages in PROFETA and
• “search terms”, the specific terms are sent to wikipedia
are used to specify that certain plans are valid (i.e. triggerable)
and, when the result page is returned, the summary data in that state.
In stage main, the program waits for the assertion of
is recited using text-to-speech;
heard("laura") reactor (this happens if the user pro-
• “change language”, it makes Laura enter in the language
nounces “Laura”) and, on the occurrence of such an event (line
state, letting the user to set a new language for both
3), it enters into the wiki stage. In such a reaction, a greeting
Wikipedia search and text-to-speech;
message is recited: this message is generated by action ran-
• any other terms, the list of possible options for the said
dom greetings() that picks a random welcome string (from
terms is searched for in wikipedia and such a list is recited
a predefined set) and bounds it to variable GreetMessage;
by Laura; if the user wants a specific term, s/he can ask
the use of random greeting message is to avoid a repetitive
it using the phrase “search terms”.
behaviour, from Laura, that, in the long term, could boring
The wiki state is abandoned on the basis of two events: after
the user.
a timeout of 30 of inactivity (in this case the state reached is In the wiki stage, the arrival of a heard() belief causes
once again init), or when the user says “change language”; plan in line 6 to be triggered: the consequence is the call of
in th latter case, Laura enters in the language state asking the action terms to belief() that has the basic task of interpreting
user for the new language desired.
To support the cited activities, the following beliefs are used: 3 See [12] for details about the kind of beliefs supported by PROFETA.
145
the command according to the cases listed above and depicted [2] espeak speech synthesizer. [Online]. Available:
in Figure 5. If “search terms” is pronounced (e.g. “search http://espeak.sourceforge.net/
[3] Microsoft speech application program interface. [Online]. Available:
Palermo”), the terms to belief() action asserts the search() http://www.ibm.com/watson/services/text-to-speech/
reactor, using terms as parameters; this causes plan in line 10 [4] Microsoft speech application program interface. [Online]. Available:
to be executed: first the current language is retrieved, then http://en.wikipedia.org/wiki/Microsoft Speech API
action wiki say() is executes which will search the terms [5] Natural language toolkit. [Online]. Available: http://www.nltk.org/
inside Wikipedia reciting then the relevant summary text. If [6] M. Bombara, D. Calı̀, and C. Santoro, “KORE: A multi-agent system to
assist museum visitors,” in WOA 2003: Dagli Oggetti agli Agenti. 4th
“change language” is pronounced, plan in line 7 is triggered AI*IA/TABOO Joint Workshop ”From Objects to Agents”: Intelligent
and the agent enters into stage language, asking the user Systems and Pervasive Computing, 10-11 September 2003, Villasimius,
the new language to switch to; when the new language is CA, Italy, 2003, pp. 175–178.
[7] J. M. Bradshaw, Ed., Software Agents. AAAI Press/The MIT Press,
successfully selected, the agent returns into the wiki stage 1997.
(lines 15—17). If other terms are said, the plan in line 8 is [8] M. E. Bratman, Intentions, Plans and Practical Reason. Harvard
executed that causes a generic search into Wikipedia for all University Press, 1987.
the pages that are related to the terms themselves: the list of [9] M. d’Inverno and M. Luck, “Engineering agentspeak(l): A
formal computational model,” Journal of Logic and Computation,
options is then recited by Laura. vol. 8, no. 3, pp. 233–260, 1998. [Online]. Available:
http://eprints.ecs.soton.ac.uk/3846/
V. C ONCLUSIONS [10] L. Fichera, D. Marletta, V. Nicosia, and C. Santoro, “Flexible robot
This paper has described the software architecture and the strategy design using belief-desire-intention model,” in Research and
Education in Robotics - EUROBOT 2010 - International Conference,
working scheme of an assistant agent able to interact with Rapperswil-Jona, Switzerland, May 27-30, 2010, Revised Selected Pa-
the user with natural language. The basic aspects of the pers, 2010, pp. 57–71.
desired solution are the use of the PROFETA BDI tool as the [11] ——, “A methodology to extend imperative languages with agentspeak
execution platform and the organisation in a flexible software declarative constructs,” in Proceedings of the 11th WOA 2010 Workshop,
Dagli Oggetti Agli Agenti, Rimini, Italy, September 5-7, 2010., 2010.
architecture in order to exploit cloud computing for speech- [12] L. Fichera, F. Messina, G. Pappalardo, and C. Santoro, “A python
to-text and text-to-speech services. framework for programming autonomous robots using a declarative
The assistant implemented, called Laura, has the objective approach,” Sci. Comput. Program., vol. 139, pp. 36–55, 2017.
of helping the user in browsing Wikipedia with speech-based [13] G. Fortino, W. Russo, and C. Santoro, “Translating statecharts-based
into BDI agents: The DSC/PROFETA case,” in Multiagent System Tech-
interaction. It served as a proof-of-concepts to understand the nologies - 11th German Conference, MATES 2013, Koblenz, Germany,
validity of the software architecture and the applicability of September 16-20, 2013. Proceedings, 2013, pp. 264–277.
PROFETA to such kind of contexts. [14] F. Messina, G. Pappalardo, and C. Santoro, “Integrating cloud services
in behaviour programming for autonomous robots,” in Algorithms and
Starting from such experience, we plan, in future work, to Architectures for Parallel Processing - 13th International Conference,
improve understanding abilities of Laura, including a library to ICA3PP 2013, Vietri sul Mare, Italy, December 18-20, 2013, Proceed-
parse natural language sentences (like NLTK [5]), also trans- ings, Part II, pp. 295–302.
lating the parsed terms into a proper beliefs better representing [15] A. Rao and M. Georgeff, “BDI agents: From theory to practice,” in
Proceedings of the first international conference on multi-agent systems
the predicates of a common knowledge; the objective is to have (ICMAS-95). San Francisco, CA, 1995, pp. 312–319.
an artificial system which can show a rational behaviour that [16] A. D. Stefano, G. Pappalardo, C. Santoro, and E. Tramontana, “A multi-
can also be adopted in all contexts needing a form of specific agent reflective architecture for user assistance and its application to
e-commerce,” in Cooperative Information Agents VI, 6th International
user assistance. Workshop, CIA 2002, Madrid, Spain, September 18-20, 2002, Proceed-
ings, 2002, pp. 90–103.
R EFERENCES [17] A. D. Stefano and C. Santoro, “Netchaser: Agent support for personal
[1] Cmusphinx: Open-source speech recognition toolkit. [Online]. mobility,” IEEE Internet Computing, vol. 4, no. 2, pp. 74–79, 2000.
Available: http://cmusphinx.github.io/ [18] G. Weiss, Ed., Multiagent Systems. The MIT Press, April 1999.
146