A Python-based Assistant Agent able to Interact with Natural Language Fabio Longo, Corrado Santoro University of Catania Department of Mathematics and Informatics Viale Andrea Doria, 6 95125 - Catania, ITALY EMail: flongo@policlinico.unict.it, santoro@dmi.unict.it Abstract—This paper describes the software architecture and the peer is not a human being but an artificial system with functionalities of an assistant agent, developed by the authors, obvious limited capabilities and intelligence. able to interact with the user through the natural language. The technological aspects behind such a form of assistance The agent is implemented by means of PROFETA, a Python- based BDI engine developed within the author’s research group. are multiple: not only a good Natural Language Processing The agent is composed of two parts: (i) the speech-to-text and (NLP) engine is needed, but also a reasoner tool is mandatory, text-to-speech services, and (ii) the reasoning engine. As for the that should be able to understand the context of discussion former part, the agent exploits Clould services (in particular and interpret the meaning of sentences accordingly. The NLP those provided by Microsoft Bing and IBM Watson); to this engine has the task of performing speech-to-text processing aim, a flexible software architecture is designed in order to connect first-class entities of PROFETA (i.e. sensors and actions) and, having obtained the text part, applying a syntax analyser, to the cloud world. The reasoning engine is instead designed in order to extract the meaningful parts of the phrase and by means of the declarative language provided by PROFETA: classify the lemmas. Once the lemmas have been extracted, utterances said by the user become PROFETA beliefs that can, they have to be interpreted in order to catch the meaning in turn, trigger reasoning rules. In order to show effectiveness of and then execute the proper actions; this operation is usually the solution, a case-study of speech-based interaction to browse Wikipedia is presented. performed by using tools that implement forms of—more or less flexible—string pattern matching. But pattern matching I. I NTRODUCTION usually does not suffice for a powerful natural language With the advent of smartphones, the technological advances interaction; indeed meaning of utterances is also strongly in artificial intelligence made it possible the implementation based on the evolution of dialogues, therefore what happened of speech-based assistant agents, like Siri or the Google in the previous interactions of the dialogue is strongly im- Assistant. Being predicted in many science fiction movies, portant for a correct interpretation: in other words, pattern speech assistants are kind of personal agents that represent matching must be integrated with state- or knowledge-data. the natural evolution of assistant agents introduced in the ’90 In this sense, what is needed is a tool that is able to support in application software to help the user in her/his day-to-day logic-/knwoledge-based programming and, among such kind activities [7], [18], [17], [6], [16]. But while such kind of of tools, the PROFETA [12], [13], [11], [10] programming assistants were strongly specialised for the activities proper platform appears an interesting solution since it is able to of the associated application, the aim of speech assistants is provide all the cited characteristics. more ambitious: to help the user in any kind of day-to-day In this context, this paper presents the software architecture activities. We indeed expect that such entities should be able of a personal assistant agent based on the PROFETA program- to interpret—and execute the associated tasks—utterances like ming platform and able to interact with a human by using “Play some jazz”, “Order coffee pods”, “Book me a flight to the speech and the natural language. The solution is based London”, etc. on a software architecture that, by means of the integration All of the exemplified activities however are not simple of cloud services with PROFETA, is able to perform speech- “commands”, but, in general, could require a form of—more to-text and text-to-speech operations. Interpreted phrases then or less complex—interaction; as an example, at the first phrase become beliefs and can thus be used to trigger production above the assistant could answer something like “I’ve found rules that drive actions of the agent. As a case-study, the some tracks by Chet Baker, are they ok for you?”, and the paper reports the application of the software architecture to user could reply “No, I prefer Thelonious Monk”; placing the implementation of agent able to query Wikipedia on the an order could imply to know the price and delivery time basis of the questions provided by human user. of some offers, and the user (always by means of speed-based The paper is structured as follows. Section II provides an interaction) could be asked to make a selection. The concept overview of PROFETA. Section III describes the architecture is that we, as users, expect that speech-based assistants can and functionalities of the speech-to-text/text-to-speech inter- establish a meaningful dialogue with us, even if we know that face. Section IV presents the case-study. Section V concludes 142 the paper. III. T HE STT/TTS I NTERFACE Implementing an agent able to interact with natural language II. OVERVIEW OF PROFETA implies to include and use speech-to-text (STT) and text-to- speech (TTS) services. Although many libraries exist (like PROFETA1 [12] is a Python platform for programming the Sphinx [1], for example), Cloud-based services appear a more behaviour of an autonomous system (agent or robot) based interesting solution, mainly because they are continuously on the Belief-Desire-Intention paradigm [8], [15]. It allows a improved and upgraded, and made always more precise. programmer to implement the behaviour of an agent by using Indeed, as for STT, we made some test using both the Sphinx two basic concepts: beliefs and plans. library and the Microsoft Bing Speech service, and, as a Beliefs represent the data part of an agent program, i.e. the result, we obtained a greater accuracy using the latter solution. knowledge of the agent and are expressed by using logic On the other hand, for the TTS, we experienced the IBM predicates, i.e. atomic formulae with ground terms; a belief Watson text-to-speech service [3], which gives a more natural can be generated on the basis of data coming from the agent feeling of human voice, than others Python-based engines like reference environment or as a result of a reasoning process. eSpeak [2], SAPI [4], etc. Plans represent the computational part of an agent program In order to include STT and TTS services within PROFETA, and are expressed as production rules, triggered by a specific the proper abstractions provided by the tool must be used; the event which can be the assertion/retract of a belief or the STT is implemented as a PROFETA sensor that we called request to achieve a specific goal; the result of a plan is a hearer, while TTS is encapsulated inside an action called say. sequence of actions that represent what the agent has to do "play some jazz" as a response to that event. Plans are written using Python statements that however are interpreted, by the PROFETA heard("play some jazz") hearer engine, as declarative production rules. The syntax is inspired sensor by AgentSpeak [9]. A plan includes a head and a body; the head that is composed of the triggering event plus a condition, "hello, how are you?" i.e. a predicate on the knowledge that must be met for the say("en","hello","how are you?") say action plan to be triggered; the body is the list of actions that must be executed when the plan is triggered. The interaction of a PROFETA agent with the environment Fig. 1. The ”hearer” sensor and ”say” action is performed by using two types of software entities. Actions, already cited in the context of plans, are responsible of acting As Figure 1 show, the hearer has the task of sampling the onto the environment (as the name suggests), and are executed microphone, sending audio data to the STT cloud service, as a result of the triggering of a plan. Sensors are instead and gathering the response as a string; as a consequence, a software entities with the task of polling or receiving data heard() belief is asserted whose term is the interpreted from the environment or other external entities; a sensor string. This assertion can, in turn, trigger the proper rules having gathered a useful information can generate a belief thus according to the agent program implemented. transforming data in agent knowledge; as a result, a belief The say action is used to let the agent pronounce a sentence generated by a sensor can enrich agent’s knowledge and/or (see Figure 1). The parameters of the action are strings: trigger a plan thus provoking the reaction of the agent. the first string is the language, while the other parameters PROFETA engine includes a scheduler that runs a loop represent the various parts of the sentence itself; action im- continuously executing the following activities: plementation concatenates such parts and sends the resulting 1) Sensor Activation. All defined sensors are activated and string to the TTS cloud service; the reply will be the audio their polling code is executed; if a sensor generates a samples of the recited phrase that are sent to the audio device belief, it is processed and placed into an event queue. for playing. 2) Event Handling. An event is extracted from the event From the implementation point of view, a sensor in PRO- queue and all applicable plans are searched for. FETA is simply made of class that extends the Sensor 3) Plan Selection. For each applicable plan, the condition base class overriding the sense() method; this method must is tested and, if true, the plan is selected for execution. implement to code for data sampling, returning the relevant 4) Plan Execution. The actions of the selected plan are belief (or None if no data has been sampled). In a similar executed sequentially. way, a PROFETA action is implemented as a sub-class of the Action base class and overriding the execute() method For more details about PROFETA internals, working with the proper code. Given this, implementing the STT scheme, syntax and semantics, the reader can consult the and TTS services appears a straightforward task, however relevant bibliography [12], [13], [11], [10]. some aspects must be taken into account. The first aspect regards the performances: the interaction with cloud services 1 http://github.com/corradosantoro/profeta could introduce latencies that can affect the performances of 143 the overall system; indeed, in PROFETA, the invocation of is generated as a result value of the sense() method. sense() and execute() methods is made synchronously For the performance reasons already cited, the task de- within the main interpretation loop of the agent program: if scribed above must be executed in an asynchronous way with such methods experience delays (that are indeed unavoidable respect to the main PROFETA loop. This objective is achieved when a network transaction is performed), the overall per- by encapsulating the hearer inside a AsyncSensorProxy, a formances are affected. The second aspect is related to the library class provided by PROFETA which has the specific generality of the solution: even if, in our implementation, we task of making a sensor asynchronous [14]. decided to use Microsoft Bing for STT and IBM Watson for TTS, other engines are available in the web, therefore B. The TTS Interface a software architecture able to let the programmer to easily Text-to-speech is performed by a specific PROFETA action. change the desired engine is really opportune. Such aspects Also in this case, it is preferable to have an asynchronous are dealt with in the next subsection. execution of the cloud interaction with respect to the main PROFETA loop. This is performed by exploiting the Asyn- A. The STT Interface cAction base class provided by the PROFETA library which The STT interface is made by three basic classes and a is, in turn, derived from Action. specialised class; the relevant UML diagram is reported in Figure 2. PROFETA.action + execute() PROFETA.AsyncAction PROFETA.Sensor AudioSource + create_service() + open() + close() hearer + read() : audiodata + audiosource say + stt STT + service: STT TTS AudioPlayer + sense(): Belief + audioplayer + connect() + create_service() + translate(audiodata) : string + open() + execute() + play(audiodata) + connect() + translate(string) BingSTTService WatsonTTS + connect() + translate(audiodata) : string + connect() + translate(string) Fig. 2. The STT Interface Fig. 3. The TTS Interface The principal entity is hearer which is a subclass of The TTS interface, whose class diagram is reported in the basic PROFETA class Sensor. An instance of hearer Figure 3, is composed of the following classes: say, TTS, contains a reference to other two objects: AudioSource and AudioPlayer and WatsonTTS. Class say is the async-action STT; the former has the objective of capturing audio data which is directly invoked by the PROFETA program and from the microphone while the latter implements the network coordinates all the text-to-speech activities. TTS is an abstract transaction with a cloud STT services. STT is defined as an class that represents the cloud service and (in a way similar abstract class: its methods are empty and their implementation as to STT) must be subclassed with the implementation of is left to a derived class that will include the code for the the code for interaction with a specific TTS service; in our specific cloud service to be used. In our implementation, we case, this is performed by the class WatsonTTS that includes subclassed STT as BingSTT, whose methods implement the the code for interaction with IBM Watson. Class TTS contains interaction with the Microsoft Bing service. also a reference to AudioPlayer, which has the task of playing The task of hearer is coded in its sense() method. First to the audio device the audio stream returned by the TTS it invokes the read() method of AudioSource to retrieve service. audio data samples; this sampling is performed by entering in a The working scheme of the TTS interface is based on a loop which listens for incoming sounds from the microphone, specific usage protocol that the programmer has to respect; in filtering ambient noises properly, until any source of sound particular, in the say class, two methods must be overridden: is perceived; then, the audio stream is caught until silence is create_service() and execute(). The former method identified again and the relevant data are returned as a result is called when the class is instantiated and has the specific of read(). Subsequently, the hearer activates the STT by task of creating the TTS object and performing, in turn, the invoking the translate() method; this method receives the initial connection to the cloud service. The latter method is audio stream as input and is expected to return the translated called when the action is explicitly invoked within a PROFETA string or None if translation is impossible. Such a result (if program and contains the code that retrieves parameters, not None) is converted in lowercase2 and the heard() belief composes the string to say and invokes the translate() 2 This is required because, for syntax reasons, in PROFETA string constants method of TTS that concretely executes text-to-speech and must be in lowercase. plays the resulting audio stream. 144  1 stage("main") 2 +start() >> [ +language("en"), show("starting...") ] 3 +heard("laura") >> [ random_greetings("GreetMessage"), say("en", "GreetMessage"), set_stage("wiki") ] 4 5 stage("wiki") 6 +heard("X") >> [ terms_to_belief("X") ] 7 +generic_phrase("change language") >> [ set_stage("language") ] 8 +generic_phrase("X") / language("L") >> [ wiki_search("L", "X", "Y"), 9 say("L", " I have found the following options ", "Y") ] 10 +search("X") / language("L") >> [ wiki_say("L", "X") ] 11 +timeout() >> [ set_stage("main") ] 12 13 stage("language") 14 +start() >> [ say("en", "what language do you desire?") ] 15 +heard("english") >> [ +language("en"), say("en", " I’ve set english"), set_stage("wiki") ] 16 +heard("italian") >> [ +language("it"), say("en", " I’ve set italian"), set_stage("wiki") ] 17 +heard("french") >> [ +language("fr"), say("en", " I’ve set french"), set_stage("wiki") ] 18 +heard("X") >> [ say("it", " I’ve not understood the language that you desire or I’m not able to support it") ]   Fig. 4. The listing of Laura IV. T HE W IKIPEDIA AGENT • heard(terms). The belief already described in Sec- The agent we developed as a proof-of-concepts for our tion III used as output of the hearer sensor. It is defined STT/TTS interface is a simple assistant able to browse as a reactor i.e. a belief that can only trigger PROFETA Wikipedia by means of natural language. We called the program plans but does not enters in the knowledge base3 . assistant Laura and this is the name the assistant itself respond • search(terms). It is a reactor generated by a pro- to in its behaviour. cessing of data heard by the hearer: if the sentence said includes an explicit searching request, this reactor "change "laura" language" is asserted. init wiki language • generic_phrase(terms). Like the previous one, it is a reactor generated by a processing of data heard by timeout (30 secs) language set the hearer, but is generated if the sentence said does not include a specific searching request. Fig. 5. The basic behaviour of Laura • language(lang). It is a belief that stores, in the knowledge base, the settings relevant to the current lan- Laura is implemented as a finite-state machine, sketched in guage. • timeout(). A reactor used to signal the 30 seconds of Figure 5, in which each state (which is indeed a macro-state) represents a condition, in Laura’s behaviour, corresponding inactivity. to certain dialogue abilities. The basic working scheme is the The complete PROFETA program that controls Laura’s following: in the initial state, init, the agent waits for her name behaviour is reported in Figure 4 (the code of actions is not in order to be “woken-up”; after that, Laura enters in the wiki reported for brevity reasons, but their role is described in the state in which she is able to hear the term to be searched for text below). in Wikipedia. In the wiki state Laura is able to respond to the The macro-states of the finite-state machine of Figure 5 are following phrases: clearly identified since they are called stages in PROFETA and • “search terms”, the specific terms are sent to wikipedia are used to specify that certain plans are valid (i.e. triggerable) and, when the result page is returned, the summary data in that state. In stage main, the program waits for the assertion of is recited using text-to-speech; heard("laura") reactor (this happens if the user pro- • “change language”, it makes Laura enter in the language nounces “Laura”) and, on the occurrence of such an event (line state, letting the user to set a new language for both 3), it enters into the wiki stage. In such a reaction, a greeting Wikipedia search and text-to-speech; message is recited: this message is generated by action ran- • any other terms, the list of possible options for the said dom greetings() that picks a random welcome string (from terms is searched for in wikipedia and such a list is recited a predefined set) and bounds it to variable GreetMessage; by Laura; if the user wants a specific term, s/he can ask the use of random greeting message is to avoid a repetitive it using the phrase “search terms”. behaviour, from Laura, that, in the long term, could boring The wiki state is abandoned on the basis of two events: after the user. a timeout of 30 of inactivity (in this case the state reached is In the wiki stage, the arrival of a heard() belief causes once again init), or when the user says “change language”; plan in line 6 to be triggered: the consequence is the call of in th latter case, Laura enters in the language state asking the action terms to belief() that has the basic task of interpreting user for the new language desired. To support the cited activities, the following beliefs are used: 3 See [12] for details about the kind of beliefs supported by PROFETA. 145 the command according to the cases listed above and depicted [2] espeak speech synthesizer. [Online]. Available: in Figure 5. If “search terms” is pronounced (e.g. “search http://espeak.sourceforge.net/ [3] Microsoft speech application program interface. [Online]. Available: Palermo”), the terms to belief() action asserts the search() http://www.ibm.com/watson/services/text-to-speech/ reactor, using terms as parameters; this causes plan in line 10 [4] Microsoft speech application program interface. [Online]. Available: to be executed: first the current language is retrieved, then http://en.wikipedia.org/wiki/Microsoft Speech API action wiki say() is executes which will search the terms [5] Natural language toolkit. [Online]. Available: http://www.nltk.org/ inside Wikipedia reciting then the relevant summary text. If [6] M. Bombara, D. Calı̀, and C. Santoro, “KORE: A multi-agent system to assist museum visitors,” in WOA 2003: Dagli Oggetti agli Agenti. 4th “change language” is pronounced, plan in line 7 is triggered AI*IA/TABOO Joint Workshop ”From Objects to Agents”: Intelligent and the agent enters into stage language, asking the user Systems and Pervasive Computing, 10-11 September 2003, Villasimius, the new language to switch to; when the new language is CA, Italy, 2003, pp. 175–178. [7] J. M. Bradshaw, Ed., Software Agents. AAAI Press/The MIT Press, successfully selected, the agent returns into the wiki stage 1997. (lines 15—17). If other terms are said, the plan in line 8 is [8] M. E. Bratman, Intentions, Plans and Practical Reason. Harvard executed that causes a generic search into Wikipedia for all University Press, 1987. the pages that are related to the terms themselves: the list of [9] M. d’Inverno and M. Luck, “Engineering agentspeak(l): A formal computational model,” Journal of Logic and Computation, options is then recited by Laura. vol. 8, no. 3, pp. 233–260, 1998. [Online]. Available: http://eprints.ecs.soton.ac.uk/3846/ V. C ONCLUSIONS [10] L. Fichera, D. Marletta, V. Nicosia, and C. Santoro, “Flexible robot This paper has described the software architecture and the strategy design using belief-desire-intention model,” in Research and Education in Robotics - EUROBOT 2010 - International Conference, working scheme of an assistant agent able to interact with Rapperswil-Jona, Switzerland, May 27-30, 2010, Revised Selected Pa- the user with natural language. The basic aspects of the pers, 2010, pp. 57–71. desired solution are the use of the PROFETA BDI tool as the [11] ——, “A methodology to extend imperative languages with agentspeak execution platform and the organisation in a flexible software declarative constructs,” in Proceedings of the 11th WOA 2010 Workshop, Dagli Oggetti Agli Agenti, Rimini, Italy, September 5-7, 2010., 2010. architecture in order to exploit cloud computing for speech- [12] L. Fichera, F. Messina, G. Pappalardo, and C. Santoro, “A python to-text and text-to-speech services. framework for programming autonomous robots using a declarative The assistant implemented, called Laura, has the objective approach,” Sci. Comput. Program., vol. 139, pp. 36–55, 2017. of helping the user in browsing Wikipedia with speech-based [13] G. Fortino, W. Russo, and C. Santoro, “Translating statecharts-based into BDI agents: The DSC/PROFETA case,” in Multiagent System Tech- interaction. It served as a proof-of-concepts to understand the nologies - 11th German Conference, MATES 2013, Koblenz, Germany, validity of the software architecture and the applicability of September 16-20, 2013. Proceedings, 2013, pp. 264–277. PROFETA to such kind of contexts. [14] F. Messina, G. Pappalardo, and C. Santoro, “Integrating cloud services in behaviour programming for autonomous robots,” in Algorithms and Starting from such experience, we plan, in future work, to Architectures for Parallel Processing - 13th International Conference, improve understanding abilities of Laura, including a library to ICA3PP 2013, Vietri sul Mare, Italy, December 18-20, 2013, Proceed- parse natural language sentences (like NLTK [5]), also trans- ings, Part II, pp. 295–302. lating the parsed terms into a proper beliefs better representing [15] A. Rao and M. Georgeff, “BDI agents: From theory to practice,” in Proceedings of the first international conference on multi-agent systems the predicates of a common knowledge; the objective is to have (ICMAS-95). San Francisco, CA, 1995, pp. 312–319. an artificial system which can show a rational behaviour that [16] A. D. Stefano, G. Pappalardo, C. Santoro, and E. Tramontana, “A multi- can also be adopted in all contexts needing a form of specific agent reflective architecture for user assistance and its application to e-commerce,” in Cooperative Information Agents VI, 6th International user assistance. Workshop, CIA 2002, Madrid, Spain, September 18-20, 2002, Proceed- ings, 2002, pp. 90–103. R EFERENCES [17] A. D. Stefano and C. Santoro, “Netchaser: Agent support for personal [1] Cmusphinx: Open-source speech recognition toolkit. [Online]. mobility,” IEEE Internet Computing, vol. 4, no. 2, pp. 74–79, 2000. Available: http://cmusphinx.github.io/ [18] G. Weiss, Ed., Multiagent Systems. The MIT Press, April 1999. 146