=Paper=
{{Paper
|id=Vol-2970/gdepaper5
|storemode=property
|title=Natural Language Question Answering with Goal-directed Answer Set Programming
|pdfUrl=https://ceur-ws.org/Vol-2970/gdepaper5.pdf
|volume=Vol-2970
|authors=Kinjal Basu,Gopal Gupta
|dblpUrl=https://dblp.org/rec/conf/iclp/0002G21
}}
==Natural Language Question Answering with Goal-directed Answer Set Programming==
Natural Language Question Answering with Goal-directed Answer Set Programming Kinjal Basu, Gopal Gupta The University of Texas at Dallas, Richardson, Texas, USA Abstract Understanding the meaning of a text is a fundamental challenge of natural language understanding (NLU) research. An ideal NLU system should process a language in a way that is not exclusive to a single task or a dataset. To do so, knowledge driven generalized semantic representation for English text is utmost important for any NLU applications. Ideally, for any realistic (human like) NLU system, commonsense reasoning must be an integral part of it and goal directed answer-set- programming (ASP) is indispensable to do commonsense reasoning. Keeping all of these in mind, we have developed various NLU application ranging from visual question answering to a conversational agent. In contrast to existing purely machine learning-based methods for the same tasks, we have shown, our applications not only maintain high accuracy but also provides explanation for the answer it computes. Keywords Answer Set Programming, Natural Language Understanding, Question Answering Conversational Agent 1. Introduction ented conversation, a human remembers all the details given in the past and most of the time performs non- The long term goal of natural language understanding monotonic reasoning to accomplish the assigned task. (NLU) research is to make applications, e.g., chatbots We believe that an automated QA system or a goal ori- and visual/textual question answering (QA) systems, that ented closed domain chatbot should work in a similar act exactly like a human assistant. A human assistant way. will understand the user’s intent and fulfill the task. The If we want to build AI systems that emulate humans, task can be answering questions about a story or an im- then understanding natural language sentences is the age, giving directions to a place, or reserving a table foremost priority for any NLU application. In an ideal in a restaurant by knowing user’s preferences. Human scenario, an NLU application should map the sentence level understanding of natural language is needed for to the knowledge (semantics) it represents, augment it an NLU application that aspires to act exactly like a hu- with commonsense knowledge related to the concepts man. To understand the meaning of a natural language involved–just as humans do—then use the combined sentence, humans first process the syntactic structure of knowledge to do the required reasoning. In this paper, we the sentence and then infer its meaning. Also, humans introduce to one of our algorithm [1] for automatically use commonsense knowledge to understand the often generating the semantics corresponding to each English complex and ambiguous meaning of natural language sentence using the comprehensive verb-lexicon for En- sentences. Humans interpret a passage as a sequence of glish verbs - VerbNet [2]. For each English verb, VerbNet sentences and will normally process the events in the gives the syntactic and semantic patterns. The algorithm story in the same order as the sentences. Once humans employs partial syntactic matching between parse-tree understand the meaning of a passage, they can answer of a sentence and a verb’s frame syntax from VerbNet questions posed, along with an explanation for the an- to obtain the meaning of the sentence in terms of Verb- swer. Similarly, for visual question answering, an image Net’s primitive predicates. This matching is motivated by should be represented in human’s mind, then it is able denotational semantics of programming languages and to answer natural language questions by understanding can be thought of as mapping parse-trees of sentences to the intent. Moreover, by using commonsense, a human knowledge that is constructed out of semantics provided assistant understands the user’s intended task and asks by VerbNet. The VerbNet semantics is expressed using a questions to the user about the required information to set of primitive predicates that can be thought of as the successfully carry-out the task. Also, to hold a goal ori- semantic algebra of the denotational semantics. Answering questions about a given picture, or Visual ICLP’21: International Conference on Logic Programming, September, Question Answering (VQA) can be processed similar to 2021 the textual QA. To answer questions about a picture, hu- " Kinjal.Basu@utdallas.edu (K. Basu); gupta@utdallas.edu mans generally first recognize the objects in the picture, (G. Gupta) © 2021 Copyright for this paper by its authors. Use permitted under Creative then they reason with the questions asked using their Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) commonsense knowledge. To be effective, we believe a VQA system should work in a similar way. Thus, to NP V NP perceive a picture, ideally, a system should have intu- Example “She grabbed the rail” Syntax Agent V Theme itive abilities like object and attribute recognition and Semantics Continue(E,Theme),Cause(Agent,E) understanding of spatial-relationships. To answer ques- Contact(During(E),Agent,Theme) tions, it must use reasoning. Natural language questions are complex and ambiguous by nature, and also require Figure 1: VerbNet frame instance for the verb class grab commonsense knowledge for their interpretation. Most importantly, reasoning skills such as counting, inference, comparison, etc., are needed to answer these questions. 2. Semantic Algebra: these are the basic domains Here, we present out VQA work — AQuA (ASP-based along with the associated operations; meaning Visual Question Answering), that closely simulates the of a program is expressed in terms of these basic above described way of an ideal VQA [3]. operations applied to the elements in the domain. 3. Valuation Function: these are mappings from 2. Background abstract syntax trees (and possibly the semantic algebra) to values in the semantic algebra. Answer Set Programming (ASP): An answer set pro- gram is a collection of rules of the form - Given a program P written in language L, P’s denotation (meaning), expressed in terms of the semantic algebra, 𝑙0 ← 𝑙1 , ... , 𝑙𝑚 , 𝑛𝑜𝑡 𝑙𝑚+1 , ... , 𝑛𝑜𝑡 𝑙𝑛 . is obtained by applying the valuation function of L to Classical logic denotes each 𝑙𝑖 is a literal [4]. In an ASP program P’s syntax tree. Details can be found elsewhere rule, the left hand side is called the head and the right- [8]. hand side is the body. Constraints are ASP rules without VerbNet: Inspired by Beth Levin’s classification of verbs head, whereas facts are without body. The variables start and their syntactic alternations [9], VerbNet [2] is the with an uppercase letter, while the predicates and the largest online network of English verbs. A verb class in constants begin with a lowercase. We will follow this VerbNet is mainly expressed by syntactic frames, thematic convention throughout the paper. The semantics of ASP roles, and semantic representation. The VerbNet lexicon is based on the stable model semantics of logic program- identifies thematic roles and syntactic patterns of each ming [5]. ASP supports negation as failure [4], allowing verb class and infers the common syntactic structure and it to elegantly model common sense reasoning, default semantic relations for all the member verbs. Figure 1 rules with exceptions, etc., and serves as the secret sauce shows an example of a VerbNet frame of the verb class for AQuA’s sophistication. grab. s(CASP) System: s(CASP) [6] is a query-driven, goal- directed implementation of ASP that includes constraint 3. Commonsense Reasoning with solving over reals. Goal-directed execution of s(CASP) is indispensable for automating commonsense reasoning, Default Theories as traditional grounding and SAT-solver based implemen- As mentioned earlier, a realistic socialbot should be able tations of ASP may not be scalable. There are three major to understand and reason like a human. In human to advantages of using the s(CASP) system: (i) s(CASP) does human conversations, we do not always tell every detail, not ground the program, which makes our framework we expect the listener to fill gaps through their common- scalable, (ii) it only explores the parts of the knowledge sense knowledge and commonsense reasoning. Thus, to base that are needed to answer a query, and (iii) it pro- obtain a conversational bot, we need to automate com- vides natural language justification (proof tree) for an monsense reasoning, i.e., automate the human thought answer [7]. process. The human thought process is flexible and non- Denotational Semantics: In programming language re- monotonic in nature, which means “what we believe to- search, denotational semantics is a widely used approach day may become false in the future with new knowledge”. to formalize the meaning of a programming language in We can model commonsense reasoning with (i) default terms of mathematical objects (called domains, such as rules, (ii) exceptions to defaults, (iii) preferences over integers, truth-values, tuple of values, and, mathematical multiple defaults [5], and (iv) modeling multiple worlds functions) [8]. Denotational semantics of a programming [4, 10]. language has three components [8]: Much of human knowledge consists of default rules, for example, the rule: Normally, birds fly. However, there 1. Syntax: specified as abstract syntax trees. are exceptions to defaults, for example, penguins are ex- ceptional birds that do not fly. Reasoning with default rules is non-monotonic, as a conclusion drawn using a tions and does not need any annotation or generation default rule may have to be withdrawn if more knowl- of function units such as what is employed by several edge becomes available and the exceptional case applies. approaches proposed for the CLEVR dataset [12, 13, 14]. For example, if we are told that Tweety is a bird, we will Also, instead of predicting an answer, AQuA augments conclude it flies. Later, knowing that Tweety is a penguin the parsed question with commonsense knowledge to will cause us to withdraw our earlier conclusion. truly understand it and to compute the correct answer Humans often make inferences in the absence of com- (e.g., it understands that block means cube, or shiny object plete information. Such an inference may be revised later means metal object). as more information becomes available. This human- style reasoning is elegantly captured by default rules and 4.1. Technical Approach exceptions. Preferences are needed when there are mul- tiple default rules, in which case additional information AQuA represents knowledge using ASP paradigm and it gleaned from the context is used to resolve which rule is made up of five modules that perform the following is applicable. One could argue that expert knowledge tasks: (i) object detection and feature extraction using the amounts to learning defaults, exceptions and preferences YOLO algorithm [11], (ii) preprocessing of the natural in the field that a person is an expert in. language question, (iii) semantic relation extraction from Also, humans can naturally deal with multiple worlds. the question, (iv) Query generation based on semantic These worlds may be consistent with each other in some analysis, and (v) commonsense knowledge representa- parts, but inconsistent in other parts. For example, ani- tion. AQuA runs on the query-driven, scalable s(CASP) mals don’t talk like humans in the real world, however, in [6] answer set programming system that can provide a the cartoon world, animals do talk like humans. So, a fish proof tree as a justification for a query being processed. called Nemo, may be able to swim in both the real world Figure 2 shows AQuA’s architecture. The five modules and the cartoon world, but can only talk in the cartoon are labeled, respectively, YOLO, Preprocessor, Semantic world. Humans have no trouble separating cartoon world Relation Extractor (SRE), Query Generator, and Common- from real world and switching between the two as the sit- sense Knowledge. uation demands. Default reasoning augmented with the Preprocessor module extracts information from the ability to operate in multiple worlds, allows one to closely question by using Stanford CoreNLP parts-of-speech represent the human thought process. Default rules with (POS) tagger and dependency graph generator. The out- exceptions and preferences and multiple worlds can be put of the Preprocessing module will be consumed by elegantly realized with answer set programming [4, 10] the Query Generator and the Semantic Relation Extrac- and the s(CASP) system [6]. tion (SRE) modules. AQuA transforms natural language questions to a logical representation before feeding it to the ASP engine. The logical representation module 4. Visual Question Answering is inspired by Neo-Davidsonian formalism [15], where every event is recognized with a unique identifier. Next, Our work — AQuA (ASP-based Question Answering) is the semantic relation labeling is the process of assigning an Answer Set Programming (ASP) based visual question relationship labels to two different phrases in a sentence answering framework that truly “understands” an input picture and answers natural language questions about that picture [3]. This framework achieves 93.7% accu- racy on CLEVR dataset, which exceeds human baseline performance. What is significant is that AQuA trans- lates a question into an ASP query without requiring any training. AQuA replicates a human’s VQA behavior by incorporating commonsense knowledge and using ASP for reasoning. VQA in the AQuA framework em- ploys the following sources of knowledge: (i) knowledge about objects extracted using the YOLO algorithm [11], (ii) semantic relations extracted from the question, (iii) query generated from the question, and (iv) common- sense knowledge. AQuA runs on the query-driven, scal- able s(CASP) [6] answer set programming system that can provide a proof tree as a justification for the query being processed. Figure 2: AQuA System Architecture AQuA processes and reasons over raw textual ques- Question Type Accuracy (%) oversimplified spatial reasoning. Exist 96 Count 91.7 Shape 87.42 5. Textual Question Answering Color 94.32 Compare Value 92.89 Size 92.17 Unlike programming languages, the denotation of a nat- Material 96.14 ural language can be quite ambiguous. English is no Less Than 97.7 exception and the meaning of a word or sentence may Greater Than 98.6 Compare Integer 98.05 depend on the context. The generation of correct knowl- Equal NA* edge from a sentence, hence, is quite hard. We have Shape 94.01 Color 94.87 developed a VerbNet based algorithm for semantic gener- Query Attribute 94.39 ation of English text. In this section, we present a novel Size 93.82 Material 94.75 approach to automatically map parse trees of simple En- glish sentences to their denotations, i.e., knowledge they Table 1 represent [17]. We applied this approach to construct AQuA Performance Results two NLU applications that we present here: SQuARE * Equality questions are minuscule in number so currently (Semantic-based Question Answering and Reasoning En- ignored. gine) and StaCACK (Stateful Conversational Agent using Commonsense Knowledge). based on the context. To understand the CLEVR dataset questions, AQuA requires two types of semantic rela- 5.1. Semantics-driven ASP Code tions (i.e., quantification and property) to be extracted (if Generation they exists) from the questions. Based on the knowledge from a question, AQuA generates a list of ASP clauses Similar to the denotational approach for meaning rep- with the query, which runs on the s(CASP) engine to resentation of a programming language, an ideal NLU find the answer. In general, questions with one-word system should use denotational semantics to composi- answer are categorized into: (i) yes/no questions, and tionally map text syntax to its meaning. Knowledge prim- (ii) attribute/value questions. Similar to a human, AQuAitives should be represented using the semantic algebra requires commonsense knowledge to correctly compute [8] of well understood concepts. Then the semantics answers to questions. For the CLEVR dataset questions, along with the commonsense knowledge represented us- AQuA needs to have commonsense knowledge about ing the same semantic algebra can be used to construct different NLU applications, such as QA system, chatbot, properties (e.g., color, size, material), directions (e.g., left, front), and shapes (e.g., cube, sphere). AQuA will not information extraction system, text summarization, etc. be able to understand question phrases such as ’... red The ambiguous nature of natural language is the main hurdle in treating it as a programming language. English metal cube ...’, unless it knows red is a color, metal is a material, and cube is a shape. Finally, the ASP engine is no exception and the meaning of an English word or is the brain of our system. All the knowledge (image sentence may depend on the context. The algorithm we representation,commonsense knowledge, semantic rela- present takes the syntactic parse tree of an English sen- tions) and the query in ASP syntax are executed using tence and uses VerbNet to automatically map the parse the query-driven s(CASP) system tree to its denotation, i.e., the knowledge it represents. An English sentence that consists of an action verb (i.e., not a be verb) always describes an event. The verb 4.2. Experiments and Results also constrains the relation among the event participants. We tested our AQuA framework on the CLEVR dataset VerbNet encapsulates all of this information using verb [16] and we got accuracy of 93.7% with 42,314 correct classes that represent a verb set with similar meanings. answers out of 45,157 questions. This performance is So each verb is a part of one or more classes. For each beyond the average human accuracy. Quantitative results class, it provides the skeletal parse tree (frame syntax) for each question type are summarized in Table 1. for different usage of the verb class and the respective We have extensively studied the 2,843 questions that semantics (frame semantic). The semantic definition of produced erroneous results. Our manual analysis showed each frame uses pre-defined predicates of VerbNet that that mismatch happens mostly because of errors caused have thematic-roles (AGENT, THEME, etc.) as arguments. by the YOLO module: failing to detect a partially vis- Thus, we can imagine VerbNet as a very large valuation ible object, wrongly detecting a shadow as an object, (semantic) function that maps syntax tree patterns to wrongly detecting two overlapping objects as one, etc. their respective meanings. As we use ASP to represent Other reasons for wrong answers are wrong parsing or the knowledge, the algorithm generates the sentence’s Algorithm 1 Semantic Knowledge Generation Algorithm 2 Partial Tree Matching Input: 𝑝𝑡 : constituency parse tree of a sentence Input: 𝑝𝑡 : constituency parse tree of a sentence; s: Output: semantics: sentence semantics frame syntax; v: verb 1: procedure GetSentenceSemantics(𝑝𝑡 ) Output: tr: thematic role set or empty-set: {} 2: verbs ← getVerbs(𝑝𝑡 ) ◁ returns list of verbs present 1: procedure GetThematicRoles(𝑝𝑡 , s, v) in the sentence 2: root ← getSubTree(node(v), 𝑝𝑡 ) ◁ returns the 3: semantics ← {} ◁ initialization sub-tree from the parent of the verb node 4: for each 𝑣 ∈ verbs do 3: while root do 5: classes ← getVNClasses(v) ◁ get the VerbNet 4: 𝑡𝑟 ← 𝑔𝑒𝑡𝑀 𝑎𝑡𝑐ℎ𝑖𝑛𝑔(𝑟𝑜𝑜𝑡, 𝑠) ◁ if s matches the classes of the verb tree return thematic-roles, else {} 6: for each 𝑐 ∈ classes do 5: if 𝑡𝑟 ̸= {} then return tr 7: frames ← getVNFrames(c) ◁ get the 6: end if VerbNet frames of the class 7: root ← getSubTree(root, 𝑝𝑡 ) ◁ returns false if 8: for each 𝑓 ∈ frames do root equals 𝑝𝑡 9: thematicRoles ← 8: end while getThematicRoles(𝑝𝑡 , f.syntax, v) ◁ see Algorithm 2 9: return {} 10: semantics ← semantics ∪ 10: end procedure getSemantics(thematicRoles, f.semantics) 11: ◁ map the thematic roles into the frame semantics 12: end for 13: end for predicates) from VerbNet using the verb-class of the verb. 14: end for The algorithm finds the grounded thematic-role variables 15: return semantics by doing a partial tree matching (described in Algorithm 16: end procedure 2) between each gathered frame syntax and 𝑝𝑡 . From the verb node of 𝑝𝑡 , the partial tree matching algorithm performs a bottom-up search and, at each level through semantic definition in ASP. Our goal is to find the partial a depth-first traversal, it tries to match the skeletal parse matching between the sentence parse tree and the Verb- tree of the frame syntax. If the algorithm finds an exact Net frame syntax and ground the thematic-role variables or a partial match (by skipping words, e.g., prepositions), so that we can get the semantics of the sentence from the it returns the thematic roles to the parent Algorithm 1. frame semantics and represent it in ASP. Finally, Algorithm 1 grounds the pre-defined predicate The illustration of the process of semantic knowledge with the values of thematic roles and generates ASP code. generation from a sentence is described in the Figure 3. The ASP code generated by the above mentioned ap- We have used Stanford’s CoreNLP parser [18] to generate proach represents the meaning of a sentence comprised the parse tree, 𝑝𝑡 , of an English sentence. The semantic of an action verb. Since VerbNet does not cover the se- generator component consists of the valuation function mantics of the ‘be’ verbs (i.e., am, is, are, have, etc.), for to map the 𝑝𝑡 to its meaning. To accomplish this, we sentences containing ‘be’ verbs, the semantic generator have introduced Semantic Knowledge Generation algo- uses pre-defined handcrafted mapping of the parsed in- rithm (Algorithm 1). First, the algorithm collects the list formation (i.e., syntactic parse tree, dependency graph, of verbs mentioned in the sentence and for each verb etc.) to its semantics. Also, this semantics is represented it accumulates all the syntactic (frame syntax) and cor- as ASP code. The generated ASP code can now be used responding semantic information (thematic roles and in various applications, such as natural language QA, summarization, information extraction, Conversational Agents (CA), etc. Sentence Parse Tree (John grabbed Stanford CoreNLP the apple there) Parser 5.2. SQuARE Question answering system for reading comprehension contact(during(grab),agent(john),theme(the_apple)). continue(event(grab),theme(the_apple)). is a challenging task for the NLU research community. transfer(during(grab),theme(the_apple)). Semantic Generator In recent times with the advancement of ML applied cause(agent(john),event(grab)). (Valuation Function ... to NLU, researchers have created more advanced QA ... Verb VerbNet (grab) Frames systems that show outstanding performance in QA for Sentence Semantics reading-comprehension tasks. However, for these high Represented in ASP VerbNet performing neural-networks based agents, the question rises whether they really “understand” the text or not. Figure 3: English to ASP translation process These systems are outstanding in learning data patterns and then predicting the answers that require shallow Text Question The total count of all the objects that john is possessing at time t6 is 1, because [the_milk] is the list of all the objects that are possessed by john at time t6, because Syntactic Syntactic the_milk is possessed by john at time t6, because Parse Tree Natural Language Parse Tree time t6 comes after time t5, and (Text) (Question) the_milk is possessed by john at time t5, because Processor (CoreNLP & spaCy) time t5 comes after time t4, and the_milk is possessed by john at time t4, and there is no evidence that the_milk is not possessed by john at time t5. Semantic Valuation Function there is no evidence that the_milk is not possessed by john at time t6. ASP Query The list [the_milk] is generated after removing duplicates from the list [the_milk], Generator Generator because Commonsense Knowledge The list [] is generated after removing duplicates from the list []. Semantic 1 is the length of the list [the_milk], because Knowledge in 0 is the length of the list []. ASP s(CASP) ASP Query Engine Figure 5: Natural language justification Answer Figure 4: SQuARE Framework only give a snippet of knowledge (due to space constraint) generated from the third sentence of the story (VerbNet details of the verb - grab is given in Figure 1). or no reasoning capabilities. Moreover, for some QA 1 contact(t3,during(grab),agent(john), tasks, if a system claims that it performs equal or bet- theme(the_apple)). ter than a human in terms of accuracy, then the system 2 cause(t3,agent(john),event(grab)). 3 transfer(t3,during(grab), must also show human level intelligence in explaining theme(the_apple)). its answers. Taking all this into account, we have created our SQuARE QA system that uses ML based parser to Question and ASP Query: For the question - “How generate the syntax tree and uses Algorithm 1 to trans- many objects is John carrying?”, the ASP query generator late a sentence into its knowledge in ASP. By using the generates a generic query-rule and the specific ASP query ASP-coded knowledge along with pre-defined generic (it uses the process template for counting). commonsense knowledge, SQuARE outperforms other count_object(T,Per,Count) :- ML based systems by achieving 100% accuracy in 18 tasks findall(O,property(possession,T,Per,O),Os), (99.9% accuracy in all 20 tasks) of the bAbI QA dataset set(Os,Objects),list_length(Objects,Count). (note that the 0.01% inaccuracy is due to the dataset’s flaw, ?- count_object(t6,john,Count). not of our system). SQuARE is also capable of generating English justification of its answers. Answer: The s(CASP) system finds the correct answer - SQuARE is composed of two main sub systems: the 1. semantic generator and the ASP query generator. Both Justification: The s(CASP) system generated justifica- subsystems inside the SQuARE architecture (illustrated tion for this answer is shown in Figure 5. in Figure 4) share the common valuation function. Example: To demonstrate the power of the SQuARE 5.3. StaCACK system, we next discuss a full-fledged example showing Conversational AI has been an active area of research, the data-flow and the intermediate results. starting from a rule-based system, such as ELIZA [19] and Story: A customized segment of a story from the bAbI PARRY [20], to the recent open domain, data-driven CAs QA dataset about counting objects (Task-7) is taken. like Amazon’s Alexa, Google Assistant, or Apple’s Siri. 1 John moved to the bedroom. Early rule-based bots were based on just syntax analysis, 2 John got the football there. while the main challenge of modern ML based chat-bots 3 John grabbed the apple there. is the lack of “understanding” of the conversation. A re- 4 John picked up the milk there. alistic socialbot should be able to understand and reason 5 John gave the apple to Mary. like a human. In human to human conversations, we 6 John left the football. do not always tell every detail, we expect the listener to Parsed Output: CoreNLP and spaCy parsers parse each fill gaps through their commonsense knowledge. Also, sentence of the story and pass the parsed information our thinking process is flexible and non-monotonic in to the semantic generator. Details are omitted due to nature, which means “what we believe today may become lack of space, however, parsing can be easily done at false in the future with new knowledge”. We can model https://corenlp.run/. this human thinking process with (i) default rules, (ii) Semantics: From the parsed information, the semantic exceptions to defaults, and (iii) preferences over multiple generator generates the semantic knowledge in ASP. We defaults [4]. Start exceptions and preferences in ASP. Task-specific CAs follow a certain scheme in their in- Understand user intent Yes quiry that can be modeled as a finite state machine (FSM). Intent Incomplete The FSM is illustrated in Figure 6. However, the tasks in Informa on Ask preferences based on the intent each state transition are not simple as in every level it Complete Informa on Has Updates Has Preference requires different types of (commonsense) reasoning. Verify and update query Updates StaCACK achieves 100% accuracy on the Facebook No Updates No fy No Results No Results No-Updates bAbI dialog dataset suit [21] (including OOV: out-of- Execute query Ask other details vocabulary datasets) of five tasks created for a restau- Results No more No rant reservation dialog system. In addition, StaCACK User Unsa sfied details Provide result(s) User Sa sfied Thank you gree ngs can answer questions that ML chatbots cannot without More details proper training (details are given in following sections). Complete task and give details End We focus on agents that are designed for a specific tasks (e.g., restaurant reservation). Figure 6: FSM for StaCACK framework Example: StaCACK is able to hold the conversation in a more natural way by using commonsense knowledge, which may not be possible with a rule-based system based on a monotonic logic. Following example shows how Sta- Following the discussion above, we have created Sta- CACK can understand the cuisine preference of a user, CACK, a general closed-domain chatbot framework. Sta- just by performing reasoning over commonsense infor- CACK is a stateful framework that maintains states by mation about a cuisine (that curry is predominant in remembering every past dialog between the user and Indian and Thai cuisine). itself. The main difference between StaCACK and the other stateful or stateless chatbot models is the use of User: Good morning. StaCACK: Hello what can i help you with commonsense knowledge for understanding user utter- today? ances and generating responses. Moreover, it is capable User: Can you make a restaurant of doing non-monotonic reasoning by using defaults with reservation in London in a cheap price range? StaCACK: I’m on it. MemNN Model (AM+ Mitra SQu- User:Tasks NG+ et al. ARE StaCACK: Any preference on a type of NL) cuisine? User: Anything, except Lebanese food. Single Supporting Fact 100 100 100 StaCACK: Do you want to have Chinese Two Supporting Facts 98 100 100 food? Three Supporting Facts 95 100 100 User: I want to have curry. Two Arg. Relation 100 100 100 StaCACK: Do you like Indian or Thai? Three Arg. Relation 99 100 99.8 User: Thai. Yes/No Questions 100 100 100 Counting 97 100 100 Lists/Sets 97 100 100 5.4. Experiments and Results Simple Negation 100 100 100 The SQuARE and the StaCACK system have been tested Indefinite Knowledge 98 100 98.2 on the bAbI QA [22] and the bAbI dialog dataset respec- Basic Coreference 100 100 100 Conjunction 100 100 100 tively [21]. With the aim of improving NLU research, Compound Coreference 100 100 100 Facebook researchers have created the bAbI datasets suit Time Reasoning 100 100 100 comprising of different NLU application-oriented sim- Basic Deduction 100 100 100 ple task-based datasets. The datasets are designed in Basic Induction 99 93.6 100 such a way that it becomes easy for human to reason Positional Reasoning 60 100 100 and reach an answer with proper justification whereas Size Reasoning 95 100 100 difficult for machines due to the lack of understanding Path Finding 35 100 100 about the language. In the SQuARE system, the accuracy Agent’s Motivations 100 100 100 has been calculated by matching the generated answer MEAN ACCURACY 94 100 100 with the actual answer given in the bAbI QA dataset. Table 2 Whereas, StaCACK’s accuracy is calculated on the basis SQuARE accuracy (%) comparison of per-response as well as per-dialog. Table 2 and table 3 compares our results in terms of accuracy with the ex- Mem2Seq BossNet StaCACK believe that intelligent systems that emulate human abil- Task 1 100 (100) 100 (100) 100 (100) ity should follow this approach, especially, if we desire Task 2 100 (100) 100 (100) 100 (100) true understanding and explainability. Task 3 94.7 (62.1) 95.2 (63.8) 100 (100) CASPR’s conversation planning is centered around a Task 4 100 (100) 100 (100) 100 (100) loop in which it moves from topic to topic, and within a Task 5 97.9 (69.6) 97.3 (65.6) 100 (100) topic, it moves from one attribute of that topic to another. Task 1 (OOV) 94.0 (62.2) 100 (100) 100 (100) Thus, CASPR has an outer conversation loop to hold the Task 2 conversation at the topmost level and an inner loop in (OOV) 86.5 (12.4) 100 (100) 100 (100) which it moves from attribute to attribute of a topic. The Task 3 logic of these loops is slightly involved, as a user may (OOV) 90.3 (38.7) 95.7 (66.6) 100 (100) return to a topic or an attribute at any time, and CASPR Task 4 (OOV) 100 (100) 100 (100) 100 (100) must remember where the user left off in that topic or Task 5 attribute. For the inner loops, CASPR uses a template, (OOV) 84.5 (2.3) 91.7 (18.5) 100 (100) called conversational knowledge template (CKT), that can be used to automatically generate code that loops Table 3 over the attributes of a topic, or loops through various StaCACK accuracy per response (per dialog) in %. dialogs (mini-CKT) that need to be spoken by CASPR for a given topic. isting state-of-the-art results for SQuARE and StaCACK system respectively. 7. Conclusion and Future Work 6. Social-Bot In this paper, we discussed about our ASP based ap- proaches to overcome the challenges of NLU. In the pro- Using the similar technology of the StaCACK system, We cess of that we presented a visual question answering have designed and developed the CASPR system, a social- framework — AQuA. In the textual QA domain, we in- bot designed to compete in the Amazon Alexa Socialbot troduced to our novel semantics-driven English text to Challenge 4. CASPR’s distinguishing characteristic is answer set program generator. Also, we showed how that it will use automated commonsense reasoning to commonsense reasoning coded in ASP can be leveraged truly “understand” dialogs, allowing it to converse like a to develop advanced NLU applications, such as SQuARE human. Three main requirements of a socialbot are that it and StaCACK. We make use of the s(CASP) engine, a should be able to “understand” users’ utterances, possess query-driven implementation of ASP, to perform reason- a strategy for holding a conversation, and be able to learn ing while generating a natural language explanation for new knowledge. We developed techniques such as con- any computed answer. At the end, we discussed about the versational knowledge template (CKT) to approximate design philosophy behind our social-bot CASPR and how commonsense reasoning needed to hold a conversation we have qualified to participate in the Amazon Alexa on specific topics. Socialbot Challenge 4. As part of future work, we plan Our philosophy is to design a socialbot that emulates, to extend the SQuARE system to handle more complex as much as possible, the way humans conduct social con- sentences and eventually handle complex stories. Our versations. Humans employ both learned-pattern match- goal is also to develop an open-domain conversational ing (e.g., recognizing user sentiments) and commonsense AI chatbot based on automated commonsense reason- reasoning (e.g., if a user starts talking about having seen ing that can “converse” with a human based on “truly the Eiffel Tower, we infer that they must have traveled to understanding” that person’s utterances. France in the past) during a conversation. Thus, ideally, a socialbot should make use of both machine learning as well as commonsense reasoning technologies. Our References goal is to use the appropriate technology for a task, i.e., [1] K. Basu, S. C. Varanasi, F. Shakerin, G. Gupta, use machine learning and commonsense reasoning for Square: Semantics-based question answering and respective tasks that they are good at. Machine learning reasoning engine, arXiv preprint arXiv:2009.10239 is good for tasks such as parsing, topic modeling, and (2020). sentiment detection while commonsense reasoning is [2] K. Kipper, A. Korhonen, N. Ryant, M. Palmer, A good for tasks such as generating a response to an ut- large-scale classification of english verbs, Language terance. In a nutshell, we should use machine learning Resources and Evaluation 42 (2008) 21–40. doi:10. for modeling System 1 thinking and commonsense rea- 1007/s10579-007-9048-2. soning for modeling System 2 thinking [23]. We strongly [3] K. Basu, F. Shakerin, G. Gupta, Aqua: Asp-based 55–60. doi:10.3115/v1/P14-5010. visual question answering, in: International Sympo- [19] J. Weizenbaum, ELIZA—a computer program for sium on Practical Aspects of Declarative Languages, the study of natural language communication be- Springer, 2020, pp. 57–72. tween man and machine, CACM 9 (1966) 36–45. [4] M. Gelfond, Y. Kahl, Knowledge representation, [20] K. M. Colby, S. Weber, F. D. Hilf, Artificial paranoia, reasoning, and the design of intelligent agents: Artificial Intelligence 2 (1971) 1–25. The answer-set programming approach, Cambridge [21] A. Bordes, Y.-L. Boureau, J. Weston, Learning University Press, 2014. end-to-end goal-oriented dialog, arXiv preprint [5] M. Gelfond, V. Lifschitz, The stable model semantics arXiv:1605.07683 (2016). for logic programming., in: ICLP/SLP, volume 88, [22] J. Weston, et al., Towards AI-Complete Question 1988, pp. 1070–1080. Answering: A Set of Prerequisite Toy Tasks, arXiv [6] J. Arias, M. Carro, E. Salazar, K. Marple, G. Gupta, preprint arXiv:1502.05698 (2015). Constraint answer set programming without [23] D. Kahneman, Thinking, fast and slow, Macmillan, grounding, TPLP 18 (2018) 337–354. doi:10.1017/ 2011. S1471068418000285. [7] J. Arias, M. Carro, Z. Chen, G. Gupta, Justifications for goal-directed constraint answer set program- ming, arXiv preprint arXiv:2009.10238 (2020). [8] D. A. Schmidt, Denotational semantics: a methodol- ogy for language development, William C, Brown Publishers, Dubuque, IA, USA, 1986. [9] B. Levin, English verb classes and alternations: A preliminary investigation, U. Chicago Press, 1993. doi:10.1075/fol.2.1.16noe. [10] C. Baral, Knowledge representation, reasoning and declarative problem solving, Cambridge Uni. Press, 2003. [11] J. Redmon, A. Farhadi, Yolov3: An incremental im- provement, arXiv preprint arXiv:1804.02767 (2018). [12] J. Johnson, et al., Inferring and executing programs for visual reasoning, in: Proceedings of the IEEE In- ternational Conference on Computer Vision, 2017, pp. 2989–2998. [13] J. Suarez, J. Johnson, F.-F. Li, Ddrprog: A clevr dif- ferentiable dynamic reasoning programmer, arXiv preprint arXiv:1803.11361 (2018). [14] K. Yi, et al., Neural-symbolic VQA: Disentangling reasoning from vision and language understanding, in: NIPS’18, 2018, pp. 1031–1042. [15] D. Davidson, Inquiries into truth and interpretation: Philosophical essays, volume 2, Oxford University Press, 2001. [16] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei- Fei, C. Lawrence Zitnick, R. Girshick, Clevr: A diagnostic dataset for compositional language and elementary visual reasoning, in: IEEE CVPR’17, 2017, pp. 2901–2910. [17] K. Basu, S. Varanasi, F. Shakerin, J. Arias, G. Gupta, Knowledge-driven natural language understanding of english text and its applications, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2021, pp. 12554–12563. [18] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, D. McClosky, The Stanford CoreNLP NLP toolkit, in: ACL System Demonstrations, 2014, pp.