=Paper=
{{Paper
|id=Vol-2970/gdepaper5
|storemode=property
|title=Natural Language Question Answering with Goal-directed Answer Set Programming
|pdfUrl=https://ceur-ws.org/Vol-2970/gdepaper5.pdf
|volume=Vol-2970
|authors=Kinjal Basu,Gopal Gupta
|dblpUrl=https://dblp.org/rec/conf/iclp/0002G21
}}
==Natural Language Question Answering with Goal-directed Answer Set Programming==
<pdf width="1500px">https://ceur-ws.org/Vol-2970/gdepaper5.pdf</pdf>
<pre>
Natural Language Question Answering with Goal-directed
Answer Set Programming
Kinjal Basu, Gopal Gupta
The University of Texas at Dallas, Richardson, Texas, USA


                                          Abstract
                                          Understanding the meaning of a text is a fundamental challenge of natural language understanding (NLU) research. An
                                          ideal NLU system should process a language in a way that is not exclusive to a single task or a dataset. To do so, knowledge
                                          driven generalized semantic representation for English text is utmost important for any NLU applications. Ideally, for any
                                          realistic (human like) NLU system, commonsense reasoning must be an integral part of it and goal directed answer-set-
                                          programming (ASP) is indispensable to do commonsense reasoning. Keeping all of these in mind, we have developed various
                                          NLU application ranging from visual question answering to a conversational agent. In contrast to existing purely machine
                                          learning-based methods for the same tasks, we have shown, our applications not only maintain high accuracy but also
                                          provides explanation for the answer it computes.

                                          Keywords
                                          Answer Set Programming, Natural Language Understanding, Question Answering Conversational Agent


1. Introduction                                                                                                    ented conversation, a human remembers all the details
                                                                                                                   given in the past and most of the time performs non-
The long term goal of natural language understanding                                                               monotonic reasoning to accomplish the assigned task.
(NLU) research is to make applications, e.g., chatbots                                                             We believe that an automated QA system or a goal ori-
and visual/textual question answering (QA) systems, that                                                           ented closed domain chatbot should work in a similar
act exactly like a human assistant. A human assistant                                                              way.
will understand the user’s intent and fulfill the task. The                                                           If we want to build AI systems that emulate humans,
task can be answering questions about a story or an im-                                                            then understanding natural language sentences is the
age, giving directions to a place, or reserving a table                                                            foremost priority for any NLU application. In an ideal
in a restaurant by knowing user’s preferences. Human                                                               scenario, an NLU application should map the sentence
level understanding of natural language is needed for                                                              to the knowledge (semantics) it represents, augment it
an NLU application that aspires to act exactly like a hu-                                                          with commonsense knowledge related to the concepts
man. To understand the meaning of a natural language                                                               involved–just as humans do—then use the combined
sentence, humans first process the syntactic structure of                                                          knowledge to do the required reasoning. In this paper, we
the sentence and then infer its meaning. Also, humans                                                              introduce to one of our algorithm [1] for automatically
use commonsense knowledge to understand the often                                                                  generating the semantics corresponding to each English
complex and ambiguous meaning of natural language                                                                  sentence using the comprehensive verb-lexicon for En-
sentences. Humans interpret a passage as a sequence of                                                             glish verbs - VerbNet [2]. For each English verb, VerbNet
sentences and will normally process the events in the                                                              gives the syntactic and semantic patterns. The algorithm
story in the same order as the sentences. Once humans                                                              employs partial syntactic matching between parse-tree
understand the meaning of a passage, they can answer                                                               of a sentence and a verb’s frame syntax from VerbNet
questions posed, along with an explanation for the an-                                                             to obtain the meaning of the sentence in terms of Verb-
swer. Similarly, for visual question answering, an image                                                           Net’s primitive predicates. This matching is motivated by
should be represented in human’s mind, then it is able                                                             denotational semantics of programming languages and
to answer natural language questions by understanding                                                              can be thought of as mapping parse-trees of sentences to
the intent. Moreover, by using commonsense, a human                                                                knowledge that is constructed out of semantics provided
assistant understands the user’s intended task and asks                                                            by VerbNet. The VerbNet semantics is expressed using a
questions to the user about the required information to                                                            set of primitive predicates that can be thought of as the
successfully carry-out the task. Also, to hold a goal ori-                                                         semantic algebra of the denotational semantics.
                                                                                                                      Answering questions about a given picture, or Visual
ICLP’21: International Conference on Logic Programming, September,                                                 Question Answering (VQA) can be processed similar to
2021                                                                                                               the textual QA. To answer questions about a picture, hu-
" Kinjal.Basu@utdallas.edu (K. Basu); gupta@utdallas.edu
                                                                                                                   mans generally first recognize the objects in the picture,
(G. Gupta)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative   then they reason with the questions asked using their
                                    Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)                                        commonsense knowledge. To be effective, we believe
a VQA system should work in a similar way. Thus, to        NP V NP
perceive a picture, ideally, a system should have intu-    Example      “She grabbed the rail”
                                                           Syntax       Agent V Theme
itive abilities like object and attribute recognition and
                                                           Semantics Continue(E,Theme),Cause(Agent,E)
understanding of spatial-relationships. To answer ques-
                                                                        Contact(During(E),Agent,Theme)
tions, it must use reasoning. Natural language questions
are complex and ambiguous by nature, and also require Figure 1: VerbNet frame instance for the verb class grab
commonsense knowledge for their interpretation. Most
importantly, reasoning skills such as counting, inference,
comparison, etc., are needed to answer these questions.      2. Semantic Algebra: these are the basic domains
Here, we present out VQA work — AQuA (ASP-based                 along with the associated operations; meaning
Visual Question Answering), that closely simulates the          of a program is expressed in terms of these basic
above described way of an ideal VQA [3].                        operations applied to the elements in the domain.
                                                                  3. Valuation Function: these are mappings from
2. Background                                                        abstract syntax trees (and possibly the semantic
                                                                     algebra) to values in the semantic algebra.
Answer Set Programming (ASP): An answer set pro-
gram is a collection of rules of the form -                 Given a program P written in language L, P’s denotation
                                                            (meaning), expressed in terms of the semantic algebra,
        𝑙0 ← 𝑙1 , ... , 𝑙𝑚 , 𝑛𝑜𝑡 𝑙𝑚+1 , ... , 𝑛𝑜𝑡 𝑙𝑛 .      is obtained by applying the valuation function of L to
Classical logic denotes each 𝑙𝑖 is a literal [4]. In an ASP program P’s syntax tree. Details can be found elsewhere
rule, the left hand side is called the head and the right- [8].
hand side is the body. Constraints are ASP rules without      VerbNet: Inspired by Beth Levin’s classification of verbs
head, whereas facts are without body. The variables start     and their syntactic alternations [9], VerbNet [2] is the
with an uppercase letter, while the predicates and the        largest online network of English verbs. A verb class in
constants begin with a lowercase. We will follow this         VerbNet is mainly expressed by syntactic frames, thematic
convention throughout the paper. The semantics of ASP         roles, and semantic representation. The VerbNet lexicon
is based on the stable model semantics of logic program-      identifies thematic roles and syntactic patterns of each
ming [5]. ASP supports negation as failure [4], allowing      verb class and infers the common syntactic structure and
it to elegantly model common sense reasoning, default         semantic relations for all the member verbs. Figure 1
rules with exceptions, etc., and serves as the secret sauce   shows an example of a VerbNet frame of the verb class
for AQuA’s sophistication.                                    grab.
s(CASP) System: s(CASP) [6] is a query-driven, goal-
directed implementation of ASP that includes constraint       3. Commonsense Reasoning with
solving over reals. Goal-directed execution of s(CASP) is
indispensable for automating commonsense reasoning,              Default Theories
as traditional grounding and SAT-solver based implemen-
                                                           As mentioned earlier, a realistic socialbot should be able
tations of ASP may not be scalable. There are three major
                                                           to understand and reason like a human. In human to
advantages of using the s(CASP) system: (i) s(CASP) does
                                                           human conversations, we do not always tell every detail,
not ground the program, which makes our framework
                                                           we expect the listener to fill gaps through their common-
scalable, (ii) it only explores the parts of the knowledge
                                                           sense knowledge and commonsense reasoning. Thus, to
base that are needed to answer a query, and (iii) it pro-
                                                           obtain a conversational bot, we need to automate com-
vides natural language justification (proof tree) for an
                                                           monsense reasoning, i.e., automate the human thought
answer [7].
                                                           process. The human thought process is flexible and non-
Denotational Semantics: In programming language re- monotonic in nature, which means “what we believe to-
search, denotational semantics is a widely used approach day may become false in the future with new knowledge”.
to formalize the meaning of a programming language in We can model commonsense reasoning with (i) default
terms of mathematical objects (called domains, such as rules, (ii) exceptions to defaults, (iii) preferences over
integers, truth-values, tuple of values, and, mathematical multiple defaults [5], and (iv) modeling multiple worlds
functions) [8]. Denotational semantics of a programming [4, 10].
language has three components [8]:                            Much of human knowledge consists of default rules,
                                                           for example, the rule: Normally, birds fly. However, there
     1. Syntax: specified as abstract syntax trees.        are exceptions to defaults, for example, penguins are ex-
                                                           ceptional birds that do not fly. Reasoning with default
rules is non-monotonic, as a conclusion drawn using a          tions and does not need any annotation or generation
default rule may have to be withdrawn if more knowl-           of function units such as what is employed by several
edge becomes available and the exceptional case applies.       approaches proposed for the CLEVR dataset [12, 13, 14].
For example, if we are told that Tweety is a bird, we will     Also, instead of predicting an answer, AQuA augments
conclude it flies. Later, knowing that Tweety is a penguin     the parsed question with commonsense knowledge to
will cause us to withdraw our earlier conclusion.              truly understand it and to compute the correct answer
   Humans often make inferences in the absence of com-         (e.g., it understands that block means cube, or shiny object
plete information. Such an inference may be revised later      means metal object).
as more information becomes available. This human-
style reasoning is elegantly captured by default rules and     4.1. Technical Approach
exceptions. Preferences are needed when there are mul-
tiple default rules, in which case additional information      AQuA represents knowledge using ASP paradigm and it
gleaned from the context is used to resolve which rule         is made up of five modules that perform the following
is applicable. One could argue that expert knowledge           tasks: (i) object detection and feature extraction using the
amounts to learning defaults, exceptions and preferences       YOLO algorithm [11], (ii) preprocessing of the natural
in the field that a person is an expert in.                    language question, (iii) semantic relation extraction from
   Also, humans can naturally deal with multiple worlds.       the question, (iv) Query generation based on semantic
These worlds may be consistent with each other in some         analysis, and (v) commonsense knowledge representa-
parts, but inconsistent in other parts. For example, ani-      tion. AQuA runs on the query-driven, scalable s(CASP)
mals don’t talk like humans in the real world, however, in     [6] answer set programming system that can provide a
the cartoon world, animals do talk like humans. So, a fish     proof tree as a justification for a query being processed.
called Nemo, may be able to swim in both the real world        Figure 2 shows AQuA’s architecture. The five modules
and the cartoon world, but can only talk in the cartoon        are labeled, respectively, YOLO, Preprocessor, Semantic
world. Humans have no trouble separating cartoon world         Relation Extractor (SRE), Query Generator, and Common-
from real world and switching between the two as the sit-      sense Knowledge.
uation demands. Default reasoning augmented with the              Preprocessor module extracts information from the
ability to operate in multiple worlds, allows one to closely   question by using Stanford CoreNLP parts-of-speech
represent the human thought process. Default rules with        (POS) tagger and dependency graph generator. The out-
exceptions and preferences and multiple worlds can be          put of the Preprocessing module will be consumed by
elegantly realized with answer set programming [4, 10]         the Query Generator and the Semantic Relation Extrac-
and the s(CASP) system [6].                                    tion (SRE) modules. AQuA transforms natural language
                                                               questions to a logical representation before feeding it
                                                               to the ASP engine. The logical representation module
4. Visual Question Answering                                   is inspired by Neo-Davidsonian formalism [15], where
                                                               every event is recognized with a unique identifier. Next,
Our work — AQuA (ASP-based Question Answering) is              the semantic relation labeling is the process of assigning
an Answer Set Programming (ASP) based visual question          relationship labels to two different phrases in a sentence
answering framework that truly “understands” an input
picture and answers natural language questions about
that picture [3]. This framework achieves 93.7% accu-
racy on CLEVR dataset, which exceeds human baseline
performance. What is significant is that AQuA trans-
lates a question into an ASP query without requiring
any training. AQuA replicates a human’s VQA behavior
by incorporating commonsense knowledge and using
ASP for reasoning. VQA in the AQuA framework em-
ploys the following sources of knowledge: (i) knowledge
about objects extracted using the YOLO algorithm [11],
(ii) semantic relations extracted from the question, (iii)
query generated from the question, and (iv) common-
sense knowledge. AQuA runs on the query-driven, scal-
able s(CASP) [6] answer set programming system that
can provide a proof tree as a justification for the query
being processed.                                           Figure 2: AQuA System Architecture
    AQuA processes and reasons over raw textual ques-
              Question Type                   Accuracy (%)          oversimplified spatial reasoning.
                  Exist                            96
                 Count                            91.7
                            Shape           87.42                   5. Textual Question Answering
                            Color           94.32
    Compare Value                                     92.89
                             Size           92.17                   Unlike programming languages, the denotation of a nat-
                           Material         96.14                   ural language can be quite ambiguous. English is no
                          Less Than          97.7                   exception and the meaning of a word or sentence may
                         Greater Than        98.6
   Compare Integer                                    98.05         depend on the context. The generation of correct knowl-
                            Equal           NA*
                                                                    edge from a sentence, hence, is quite hard. We have
                            Shape           94.01
                            Color           94.87                   developed a VerbNet based algorithm for semantic gener-
    Query Attribute                                   94.39         ation of English text. In this section, we present a novel
                             Size           93.82
                           Material         94.75                   approach to automatically map parse trees of simple En-
                                                                    glish sentences to their denotations, i.e., knowledge they
Table 1                                                             represent [17]. We applied this approach to construct
AQuA Performance Results
                                                                    two NLU applications that we present here: SQuARE
 * Equality questions are minuscule in number so currently          (Semantic-based Question Answering and Reasoning En-
                          ignored.
                                                                    gine) and StaCACK (Stateful Conversational Agent using
                                                                    Commonsense Knowledge).
based on the context. To understand the CLEVR dataset
questions, AQuA requires two types of semantic rela-                5.1. Semantics-driven ASP Code
tions (i.e., quantification and property) to be extracted (if            Generation
they exists) from the questions. Based on the knowledge
from a question, AQuA generates a list of ASP clauses   Similar to the denotational approach for meaning rep-
with the query, which runs on the s(CASP) engine to     resentation of a programming language, an ideal NLU
find the answer. In general, questions with one-word    system should use denotational semantics to composi-
answer are categorized into: (i) yes/no questions, and  tionally map text syntax to its meaning. Knowledge prim-
(ii) attribute/value questions. Similar to a human, AQuAitives should be represented using the semantic algebra
requires commonsense knowledge to correctly compute     [8] of well understood concepts. Then the semantics
answers to questions. For the CLEVR dataset questions,  along with the commonsense knowledge represented us-
AQuA needs to have commonsense knowledge about          ing the same semantic algebra can be used to construct
                                                        different NLU applications, such as QA system, chatbot,
properties (e.g., color, size, material), directions (e.g., left,
front), and shapes (e.g., cube, sphere). AQuA will not  information extraction system, text summarization, etc.
be able to understand question phrases such as ’... red The ambiguous nature of natural language is the main
                                                        hurdle in treating it as a programming language. English
metal cube ...’, unless it knows red is a color, metal is a
material, and cube is a shape. Finally, the ASP engine  is no exception and the meaning of an English word or
is the brain of our system. All the knowledge (image    sentence may depend on the context. The algorithm we
representation,commonsense knowledge, semantic rela-    present takes the syntactic parse tree of an English sen-
tions) and the query in ASP syntax are executed using   tence and uses VerbNet to automatically map the parse
the query-driven s(CASP) system                         tree to its denotation, i.e., the knowledge it represents.
                                                            An English sentence that consists of an action verb
                                                        (i.e., not a be verb) always describes an event. The verb
4.2. Experiments and Results                            also constrains the relation among the event participants.
We tested our AQuA framework on the CLEVR dataset       VerbNet    encapsulates all of this information using verb
[16] and we got accuracy of 93.7% with 42,314 correct classes that represent a verb set with similar meanings.
answers out of 45,157 questions. This performance is So each verb is a part of one or more classes. For each
beyond the average human accuracy. Quantitative results class, it provides the skeletal parse tree (frame syntax)
for each question type are summarized in Table 1.       for different usage of the verb class and the respective
   We have extensively studied the 2,843 questions that semantics (frame semantic). The semantic definition of
produced erroneous results. Our manual analysis showed each frame uses pre-defined predicates of VerbNet that
that mismatch happens mostly because of errors caused have thematic-roles (AGENT, THEME, etc.) as arguments.
by the YOLO module: failing to detect a partially vis- Thus, we can imagine VerbNet as a very large valuation
ible object, wrongly detecting a shadow as an object, (semantic) function that maps syntax tree patterns to
wrongly detecting two overlapping objects as one, etc. their respective meanings. As we use ASP to represent
Other reasons for wrong answers are wrong parsing or the knowledge, the algorithm generates the sentence’s
Algorithm 1 Semantic Knowledge Generation                                           Algorithm 2 Partial Tree Matching
    Input: 𝑝𝑡 : constituency parse tree of a sentence                                     Input: 𝑝𝑡 : constituency parse tree of a sentence; s:
    Output: semantics: sentence semantics                                           frame syntax; v: verb
 1: procedure GetSentenceSemantics(𝑝𝑡 )                                                  Output: tr: thematic role set or empty-set: {}
 2:     verbs ← getVerbs(𝑝𝑡 ) ◁ returns list of verbs present                         1: procedure GetThematicRoles(𝑝𝑡 , s, v)
    in the sentence                                                                   2:     root ← getSubTree(node(v), 𝑝𝑡 )          ◁ returns the
 3:     semantics ← {}                          ◁ initialization                         sub-tree from the parent of the verb node
 4:     for each 𝑣 ∈ verbs do                                                         3:     while root do
 5:         classes ← getVNClasses(v)        ◁ get the VerbNet                        4:         𝑡𝑟 ← 𝑔𝑒𝑡𝑀 𝑎𝑡𝑐ℎ𝑖𝑛𝑔(𝑟𝑜𝑜𝑡, 𝑠) ◁ if s matches the
    classes of the verb                                                                  tree return thematic-roles, else {}
 6:         for each 𝑐 ∈ classes do                                                   5:         if 𝑡𝑟 ̸= {} then return tr
 7:              frames ← getVNFrames(c)               ◁ get the                      6:         end if
    VerbNet frames of the class                                                       7:         root ← getSubTree(root, 𝑝𝑡 )    ◁ returns false if
 8:               for each 𝑓 ∈ frames do                                                 root equals 𝑝𝑡
 9:                   thematicRoles ←                                                 8:     end while
    getThematicRoles(𝑝𝑡 , f.syntax, v)      ◁ see Algorithm 2                         9:     return {}
10:                   semantics ← semantics ∪                                       10: end procedure
    getSemantics(thematicRoles, f.semantics)
11:        ◁ map the thematic roles into the frame semantics
12:               end for
13:         end for
                                                                                    predicates) from VerbNet using the verb-class of the verb.
14:     end for                                                                     The algorithm finds the grounded thematic-role variables
15:     return semantics                                                            by doing a partial tree matching (described in Algorithm
16: end procedure                                                                   2) between each gathered frame syntax and 𝑝𝑡 . From
                                                                                    the verb node of 𝑝𝑡 , the partial tree matching algorithm
                                                                                    performs a bottom-up search and, at each level through
semantic definition in ASP. Our goal is to find the partial                         a depth-first traversal, it tries to match the skeletal parse
matching between the sentence parse tree and the Verb-                              tree of the frame syntax. If the algorithm finds an exact
Net frame syntax and ground the thematic-role variables                             or a partial match (by skipping words, e.g., prepositions),
so that we can get the semantics of the sentence from the                           it returns the thematic roles to the parent Algorithm 1.
frame semantics and represent it in ASP.                                            Finally, Algorithm 1 grounds the pre-defined predicate
   The illustration of the process of semantic knowledge                            with the values of thematic roles and generates ASP code.
generation from a sentence is described in the Figure 3.                               The ASP code generated by the above mentioned ap-
We have used Stanford’s CoreNLP parser [18] to generate                             proach represents the meaning of a sentence comprised
the parse tree, 𝑝𝑡 , of an English sentence. The semantic                           of an action verb. Since VerbNet does not cover the se-
generator component consists of the valuation function                              mantics of the ‘be’ verbs (i.e., am, is, are, have, etc.), for
to map the 𝑝𝑡 to its meaning. To accomplish this, we                                sentences containing ‘be’ verbs, the semantic generator
have introduced Semantic Knowledge Generation algo-                                 uses pre-defined handcrafted mapping of the parsed in-
rithm (Algorithm 1). First, the algorithm collects the list                         formation (i.e., syntactic parse tree, dependency graph,
of verbs mentioned in the sentence and for each verb                                etc.) to its semantics. Also, this semantics is represented
it accumulates all the syntactic (frame syntax) and cor-                            as ASP code. The generated ASP code can now be used
responding semantic information (thematic roles and                                 in various applications, such as natural language QA,
                                                                                    summarization, information extraction, Conversational
                                                                                    Agents (CA), etc.
          Sentence                                                     Parse Tree
        (John grabbed           Stanford CoreNLP
        the apple there)             Parser                                         5.2. SQuARE
                                                                                    Question answering system for reading comprehension
  contact(during(grab),agent(john),theme(the_apple)).
  continue(event(grab),theme(the_apple)).                                           is a challenging task for the NLU research community.
  transfer(during(grab),theme(the_apple)).              Semantic Generator          In recent times with the advancement of ML applied
  cause(agent(john),event(grab)).                       (Valuation Function
  ...                                                                               to NLU, researchers have created more advanced QA
  ...                                                      Verb          VerbNet
                                                          (grab)         Frames
                                                                                    systems that show outstanding performance in QA for
                Sentence Semantics
                                                                                    reading-comprehension tasks. However, for these high
                Represented in ASP                                 VerbNet          performing neural-networks based agents, the question
                                                                                    rises whether they really “understand” the text or not.
Figure 3: English to ASP translation process                                        These systems are outstanding in learning data patterns
                                                                                    and then predicting the answers that require shallow
                           Text        Question                      The total count of all the objects that john is possessing at time t6 is 1, because
                                                                       [the_milk] is the list of all the objects that are possessed by john at time t6,
                                                                                                                                                 because
            Syntactic                              Syntactic               the_milk is possessed by john at time t6, because
           Parse Tree      Natural Language       Parse Tree                  time t6 comes after time t5, and
             (Text)                               (Question)                  the_milk is possessed by john at time t5, because
                              Processor
                           (CoreNLP & spaCy)                                     time t5 comes after time t4, and
                                                                                 the_milk is possessed by john at time t4, and
                                                                                 there is no evidence that the_milk is not possessed by john at time t5.
     Semantic               Valuation Function                                there is no evidence that the_milk is not possessed by john at time t6.
                                                         ASP Query     The list [the_milk] is generated after removing duplicates from the list [the_milk],
     Generator                                           Generator                                                                                because
                         Commonsense Knowledge                             The list [] is generated after removing duplicates from the list [].

           Semantic                                                     1 is the length of the list [the_milk], because
          Knowledge in                                                      0 is the length of the list [].
             ASP               s(CASP)              ASP Query

                                Engine
                                                                     Figure 5: Natural language justification
                                  Answer


Figure 4: SQuARE Framework                                           only give a snippet of knowledge (due to space constraint)
                                                                     generated from the third sentence of the story (VerbNet
                                                                     details of the verb - grab is given in Figure 1).
or no reasoning capabilities. Moreover, for some QA                   1 contact(t3,during(grab),agent(john),
tasks, if a system claims that it performs equal or bet-                            theme(the_apple)).
ter than a human in terms of accuracy, then the system                2 cause(t3,agent(john),event(grab)).
                                                                      3 transfer(t3,during(grab),
must also show human level intelligence in explaining
                                                                                                theme(the_apple)).
its answers. Taking all this into account, we have created
our SQuARE QA system that uses ML based parser to                    Question and ASP Query: For the question - “How
generate the syntax tree and uses Algorithm 1 to trans-              many objects is John carrying?”, the ASP query generator
late a sentence into its knowledge in ASP. By using the              generates a generic query-rule and the specific ASP query
ASP-coded knowledge along with pre-defined generic                   (it uses the process template for counting).
commonsense knowledge, SQuARE outperforms other                       count_object(T,Per,Count) :-
ML based systems by achieving 100% accuracy in 18 tasks                findall(O,property(possession,T,Per,O),Os),
(99.9% accuracy in all 20 tasks) of the bAbI QA dataset                set(Os,Objects),list_length(Objects,Count).
(note that the 0.01% inaccuracy is due to the dataset’s flaw,         ?- count_object(t6,john,Count).
not of our system). SQuARE is also capable of generating
English justification of its answers.                                Answer: The s(CASP) system finds the correct answer -
   SQuARE is composed of two main sub systems: the                   1.
semantic generator and the ASP query generator. Both                 Justification: The s(CASP) system generated justifica-
subsystems inside the SQuARE architecture (illustrated               tion for this answer is shown in Figure 5.
in Figure 4) share the common valuation function.
Example: To demonstrate the power of the SQuARE                      5.3. StaCACK
system, we next discuss a full-fledged example showing
                                                                     Conversational AI has been an active area of research,
the data-flow and the intermediate results.
                                                                     starting from a rule-based system, such as ELIZA [19] and
Story: A customized segment of a story from the bAbI
                                                                     PARRY [20], to the recent open domain, data-driven CAs
QA dataset about counting objects (Task-7) is taken.
                                                                     like Amazon’s Alexa, Google Assistant, or Apple’s Siri.
 1 John moved to the bedroom.                                        Early rule-based bots were based on just syntax analysis,
 2 John got the football there.                                      while the main challenge of modern ML based chat-bots
 3 John grabbed the apple there.                                     is the lack of “understanding” of the conversation. A re-
 4 John picked up the milk there.                                    alistic socialbot should be able to understand and reason
 5 John gave the apple to Mary.                                      like a human. In human to human conversations, we
 6 John left the football.                                           do not always tell every detail, we expect the listener to
Parsed Output: CoreNLP and spaCy parsers parse each                  fill gaps through their commonsense knowledge. Also,
sentence of the story and pass the parsed information                our thinking process is flexible and non-monotonic in
to the semantic generator. Details are omitted due to                nature, which means “what we believe today may become
lack of space, however, parsing can be easily done at                false in the future with new knowledge”. We can model
https://corenlp.run/.                                                this human thinking process with (i) default rules, (ii)
Semantics: From the parsed information, the semantic                 exceptions to defaults, and (iii) preferences over multiple
generator generates the semantic knowledge in ASP. We                defaults [4].
                          Start                                                       exceptions and preferences in ASP.
                                                                                         Task-specific CAs follow a certain scheme in their in-
                      Understand user intent
                                                                                 Yes
                                                                                      quiry that can be modeled as a finite state machine (FSM).
                                  Intent

    Incomplete
                                                                                      The FSM is illustrated in Figure 6. However, the tasks in
    Informa on Ask preferences based on the intent
                                                                                      each state transition are not simple as in every level it
                                   Complete Informa on

   Has Updates
                                                          Has Preference              requires different types of (commonsense) reasoning.
                     Verify and update query               Updates
                                                                                         StaCACK achieves 100% accuracy on the Facebook
                                   No Updates                   No fy No Results
                                                  No Results               No-Updates bAbI    dialog dataset suit [21] (including OOV: out-of-
                          Execute query
                                                                 Ask other details
                                                                                      vocabulary    datasets) of five tasks created for a restau-
                                   Results
                                                 No more                   No         rant reservation dialog system. In addition, StaCACK
    User Unsa sﬁed                               details
                         Provide result(s)
                                   User Sa sﬁed
                                                              Thank you gree ngs      can answer questions that ML chatbots cannot without
   More details
                                                                                      proper training (details are given in following sections).
                   Complete task and give details                        End
                                                                                      We focus on agents that are designed for a specific tasks
                                                                                      (e.g., restaurant reservation).
Figure 6: FSM for StaCACK framework                                                   Example: StaCACK is able to hold the conversation in
                                                                                      a more natural way by using commonsense knowledge,
                                                                                      which may not be possible with a rule-based system based
                                                                                      on a monotonic logic. Following example shows how Sta-
   Following the discussion above, we have created Sta- CACK can understand the cuisine preference of a user,
CACK, a general closed-domain chatbot framework. Sta- just by performing reasoning over commonsense infor-
CACK is a stateful framework that maintains states by mation about a cuisine (that curry is predominant in
remembering every past dialog between the user and Indian and Thai cuisine).
itself. The main difference between StaCACK and the
other stateful or stateless chatbot models is the use of User:                                      Good morning.
                                                                                       StaCACK: Hello what can i help you with
commonsense knowledge for understanding user utter-
                                                                                                    today?
ances and generating responses. Moreover, it is capable User:                                       Can you make a restaurant
of doing non-monotonic reasoning by using defaults with                                             reservation in London in a cheap
                                                                                    price range?
                                                                           StaCACK: I’m on it.
                                       MemNN
                   Model                (AM+ Mitra SQu-                    User:    <SILENCE>
           Tasks                         NG+ et al. ARE                    StaCACK: Any preference on a type of
                                         NL)                                        cuisine?
                                                                           User:    Anything, except Lebanese food.
  Single Supporting Fact                   100        100       100
                                                                           StaCACK: Do you want to have Chinese
   Two Supporting Facts                    98         100       100
                                                                                    food?
  Three Supporting Facts                   95         100       100
                                                                           User:    I want to have curry.
     Two Arg. Relation                     100        100       100
                                                                           StaCACK: Do you like Indian or Thai?
    Three Arg. Relation                    99         100       99.8
                                                                           User:         Thai.
     Yes/No Questions                      100        100       100
          Counting                         97         100       100
          Lists/Sets                       97         100       100      5.4. Experiments and Results
      Simple Negation                      100        100       100
                                                                         The SQuARE and the StaCACK system have been tested
   Indefinite Knowledge                    98         100       98.2
                                                                         on the bAbI QA [22] and the bAbI dialog dataset respec-
     Basic Coreference                     100        100       100
        Conjunction                        100        100       100      tively [21]. With the aim of improving NLU research,
  Compound Coreference                     100        100       100      Facebook researchers have created the bAbI datasets suit
      Time Reasoning                       100        100       100      comprising of different NLU application-oriented sim-
      Basic Deduction                      100        100       100      ple task-based datasets. The datasets are designed in
      Basic Induction                      99         93.6      100      such a way that it becomes easy for human to reason
   Positional Reasoning                    60         100       100      and reach an answer with proper justification whereas
       Size Reasoning                      95         100       100      difficult for machines due to the lack of understanding
        Path Finding                       35         100       100      about the language. In the SQuARE system, the accuracy
    Agent’s Motivations                    100        100       100      has been calculated by matching the generated answer
    MEAN ACCURACY                          94         100       100      with the actual answer given in the bAbI QA dataset.
Table 2
                                                                         Whereas, StaCACK’s accuracy is calculated on the basis
SQuARE accuracy (%) comparison                                           of per-response as well as per-dialog. Table 2 and table 3
                                                                         compares our results in terms of accuracy with the ex-
               Mem2Seq         BossNet       StaCACK          believe that intelligent systems that emulate human abil-
    Task 1        100 (100)     100 (100)     100 (100)       ity should follow this approach, especially, if we desire
    Task 2        100 (100)     100 (100)     100 (100)       true understanding and explainability.
    Task 3       94.7 (62.1)   95.2 (63.8)    100 (100)          CASPR’s conversation planning is centered around a
    Task 4        100 (100)     100 (100)     100 (100)       loop in which it moves from topic to topic, and within a
    Task 5       97.9 (69.6)   97.3 (65.6)    100 (100)       topic, it moves from one attribute of that topic to another.
    Task 1
    (OOV)        94.0 (62.2)    100 (100)     100 (100)       Thus, CASPR has an outer conversation loop to hold the
    Task 2                                                    conversation at the topmost level and an inner loop in
    (OOV)        86.5 (12.4)    100 (100)     100 (100)       which it moves from attribute to attribute of a topic. The
    Task 3                                                    logic of these loops is slightly involved, as a user may
    (OOV)        90.3 (38.7)   95.7 (66.6)    100 (100)
                                                              return to a topic or an attribute at any time, and CASPR
    Task 4
    (OOV)         100 (100)     100 (100)     100 (100)       must remember where the user left off in that topic or
    Task 5                                                    attribute. For the inner loops, CASPR uses a template,
    (OOV)         84.5 (2.3)   91.7 (18.5)    100 (100)       called conversational knowledge template (CKT), that
                                                              can be used to automatically generate code that loops
Table 3                                                       over the attributes of a topic, or loops through various
StaCACK accuracy per response (per dialog) in %.
                                                              dialogs (mini-CKT) that need to be spoken by CASPR for
                                                              a given topic.
isting state-of-the-art results for SQuARE and StaCACK
system respectively.
                                                              7. Conclusion and Future Work
6. Social-Bot                                                 In this paper, we discussed about our ASP based ap-
                                                              proaches to overcome the challenges of NLU. In the pro-
Using the similar technology of the StaCACK system, We        cess of that we presented a visual question answering
have designed and developed the CASPR system, a social-       framework — AQuA. In the textual QA domain, we in-
bot designed to compete in the Amazon Alexa Socialbot         troduced to our novel semantics-driven English text to
Challenge 4. CASPR’s distinguishing characteristic is         answer set program generator. Also, we showed how
that it will use automated commonsense reasoning to           commonsense reasoning coded in ASP can be leveraged
truly “understand” dialogs, allowing it to converse like a    to develop advanced NLU applications, such as SQuARE
human. Three main requirements of a socialbot are that it     and StaCACK. We make use of the s(CASP) engine, a
should be able to “understand” users’ utterances, possess     query-driven implementation of ASP, to perform reason-
a strategy for holding a conversation, and be able to learn   ing while generating a natural language explanation for
new knowledge. We developed techniques such as con-           any computed answer. At the end, we discussed about the
versational knowledge template (CKT) to approximate           design philosophy behind our social-bot CASPR and how
commonsense reasoning needed to hold a conversation           we have qualified to participate in the Amazon Alexa
on specific topics.                                           Socialbot Challenge 4. As part of future work, we plan
   Our philosophy is to design a socialbot that emulates,     to extend the SQuARE system to handle more complex
as much as possible, the way humans conduct social con-       sentences and eventually handle complex stories. Our
versations. Humans employ both learned-pattern match-         goal is also to develop an open-domain conversational
ing (e.g., recognizing user sentiments) and commonsense       AI chatbot based on automated commonsense reason-
reasoning (e.g., if a user starts talking about having seen   ing that can “converse” with a human based on “truly
the Eiffel Tower, we infer that they must have traveled to    understanding” that person’s utterances.
France in the past) during a conversation. Thus, ideally,
a socialbot should make use of both machine learning
as well as commonsense reasoning technologies. Our            References
goal is to use the appropriate technology for a task, i.e.,
                                                               [1] K. Basu, S. C. Varanasi, F. Shakerin, G. Gupta,
use machine learning and commonsense reasoning for
                                                                   Square: Semantics-based question answering and
respective tasks that they are good at. Machine learning
                                                                   reasoning engine, arXiv preprint arXiv:2009.10239
is good for tasks such as parsing, topic modeling, and
                                                                   (2020).
sentiment detection while commonsense reasoning is
                                                               [2] K. Kipper, A. Korhonen, N. Ryant, M. Palmer, A
good for tasks such as generating a response to an ut-
                                                                   large-scale classification of english verbs, Language
terance. In a nutshell, we should use machine learning
                                                                   Resources and Evaluation 42 (2008) 21–40. doi:10.
for modeling System 1 thinking and commonsense rea-
                                                                   1007/s10579-007-9048-2.
soning for modeling System 2 thinking [23]. We strongly
 [3] K. Basu, F. Shakerin, G. Gupta, Aqua: Asp-based               55–60. doi:10.3115/v1/P14-5010.
     visual question answering, in: International Sympo-      [19] J. Weizenbaum, ELIZA—a computer program for
     sium on Practical Aspects of Declarative Languages,           the study of natural language communication be-
     Springer, 2020, pp. 57–72.                                    tween man and machine, CACM 9 (1966) 36–45.
 [4] M. Gelfond, Y. Kahl, Knowledge representation,           [20] K. M. Colby, S. Weber, F. D. Hilf, Artificial paranoia,
     reasoning, and the design of intelligent agents:              Artificial Intelligence 2 (1971) 1–25.
     The answer-set programming approach, Cambridge           [21] A. Bordes, Y.-L. Boureau, J. Weston, Learning
     University Press, 2014.                                       end-to-end goal-oriented dialog, arXiv preprint
 [5] M. Gelfond, V. Lifschitz, The stable model semantics          arXiv:1605.07683 (2016).
     for logic programming., in: ICLP/SLP, volume 88,         [22] J. Weston, et al., Towards AI-Complete Question
     1988, pp. 1070–1080.                                          Answering: A Set of Prerequisite Toy Tasks, arXiv
 [6] J. Arias, M. Carro, E. Salazar, K. Marple, G. Gupta,          preprint arXiv:1502.05698 (2015).
     Constraint answer set programming without                [23] D. Kahneman, Thinking, fast and slow, Macmillan,
     grounding, TPLP 18 (2018) 337–354. doi:10.1017/               2011.
     S1471068418000285.
 [7] J. Arias, M. Carro, Z. Chen, G. Gupta, Justifications
     for goal-directed constraint answer set program-
     ming, arXiv preprint arXiv:2009.10238 (2020).
 [8] D. A. Schmidt, Denotational semantics: a methodol-
     ogy for language development, William C, Brown
     Publishers, Dubuque, IA, USA, 1986.
 [9] B. Levin, English verb classes and alternations: A
     preliminary investigation, U. Chicago Press, 1993.
     doi:10.1075/fol.2.1.16noe.
[10] C. Baral, Knowledge representation, reasoning and
     declarative problem solving, Cambridge Uni. Press,
     2003.
[11] J. Redmon, A. Farhadi, Yolov3: An incremental im-
     provement, arXiv preprint arXiv:1804.02767 (2018).
[12] J. Johnson, et al., Inferring and executing programs
     for visual reasoning, in: Proceedings of the IEEE In-
     ternational Conference on Computer Vision, 2017,
     pp. 2989–2998.
[13] J. Suarez, J. Johnson, F.-F. Li, Ddrprog: A clevr dif-
     ferentiable dynamic reasoning programmer, arXiv
     preprint arXiv:1803.11361 (2018).
[14] K. Yi, et al., Neural-symbolic VQA: Disentangling
     reasoning from vision and language understanding,
     in: NIPS’18, 2018, pp. 1031–1042.
[15] D. Davidson, Inquiries into truth and interpretation:
     Philosophical essays, volume 2, Oxford University
     Press, 2001.
[16] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-
     Fei, C. Lawrence Zitnick, R. Girshick, Clevr: A
     diagnostic dataset for compositional language and
     elementary visual reasoning, in: IEEE CVPR’17,
     2017, pp. 2901–2910.
[17] K. Basu, S. Varanasi, F. Shakerin, J. Arias, G. Gupta,
     Knowledge-driven natural language understanding
     of english text and its applications, in: Proceedings
     of the AAAI Conference on Artificial Intelligence,
     volume 35, 2021, pp. 12554–12563.
[18] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J.
     Bethard, D. McClosky, The Stanford CoreNLP NLP
     toolkit, in: ACL System Demonstrations, 2014, pp.

</pre>