<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing Natural Language Understanding in Large Language Models by Symbolic Representation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bingqian Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Baiyang Song</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yi Zhou</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ShanghaiTech University</institution>
          ,
          <addr-line>Shanghai</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Science and Technology of China</institution>
          ,
          <addr-line>Hefei</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the Symbolically Enhanced Neural Inference Framework (SENIF), which enhances the natural language understanding (NLU) capabilities of large language models (LLMs) such as GPT-4 by combining large language models with symbolic representations. The proposed method aims to improve the performance of LLMs by enabling them to infer based on formalized statements. The framework employs Assertional Logic (AL) as its foundational representation. Initially, the framework translates natural language utterances into logical expressions after developing a Concept-Operator diagram (CO) within the domain. We propose a zero-shot parser that enables smaller language models to yield high-quality parsing results for a given Concept-Operator Diagram. We then design a Chain-of-Thought (CoT) prompt that utilizes both the original text and the parsing results from the preceding step as inputs. Experimental results show that LLMs, like GPT-4, can greatly benefit from these high-quality parsing results. Our framework exhibits substantial improvement in GPT-4's performance, elevating the most challenging measure, C@90, by 46.67% (40% → 86.67%). Meanwhile, we have also verified its feasibility in modeling in different fields and medium language models. This research provides a promising direction for enhancing the inference capabilities of large language models.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Domain Knowledge</kwd>
        <kwd>Semantic Parsing</kwd>
        <kwd>Symbolic Representation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Natural Language Understanding (NLU) is a challenging
task, even for the most advanced and powerful language
models. This task entails a comprehensive understanding,
often requiring not only the syntactic structure of the language
but also semantic meanings, contextual cues, and pragmatic
factors. This intricate nature of language comprehension
presents a formidable challenge even for large models such
as ChatGPT or GPT-4.</p>
      <p>
        Human comprehension of the world is a synthesis of
perception and cognition, indicating that our understanding is
not purely based on data-driven processes [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Rather, it
involves a combination of learned knowledge, experiences,
and symbolic reasoning. Therefore, it stands to reason that
mixing symbolic representations into large language
models may enhance the language understanding capabilities of
large models [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. By integrating symbolic representations,
models may be able to better encode and utilize abstract,
high-level concepts and relationships inherent in language.
      </p>
      <p>
        Both formal reasoning and language models exhibit
imperfections in language understanding. Formal reasoning,
despite its proficiency in concept comprehension and
inference, is often hindered by generalization issues, impeding
its practical application. In contrast, large language models,
despite their expansive coverage, often fail to accurately
capture complex reasoning processes, limiting their reliability.
We could even say that the accuracy of language models in
machine reading comprehension tasks relies more on
suitable QA pairs, rather than a genuine understanding of the
question. This point is emphasized and robustly tested by
the ZEST benchmark, which is why we have chosen to focus
our efforts on this dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>In light of these challenges, we first use the CO Diagram
based on assertion logic to achieve the symbolic
representation of domain prior knowledge, then we use a CoT
promptbased approach to incorporate it into the neural network, this
method can integrate the generalization and fuzzy matching
capabilities of language models with the precision of
formal representations. This innovative strategy significantly
improves model performance on tasks related to language
understanding.</p>
      <p>
        Moreover, to efcfiiently obtain formal representations in
an open domain, we present a semantic parser for assertional
logic [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This algorithm confers several advantages,
including swift cross-domain migration, ease of improvement,
and independence from annotated data. Addressing these
core challenges in the field of semantic parsing is of utmost
importance.
      </p>
      <p>To validate our claims, we apply our proposed
methodology to approximately 200 examples extracted from the ZEST
benchmark. We further annotate about 400 assertions in
assertional logic to evaluate the performance of our zero-shot
parser. Meanwhile, we used a subset of ZEST for automatic
and hasty modeling, and fine-tuned llama3 based on the data
parsed from this CO Diagram. Our experiments show two
key insights: 1) formal reasoning is an essential complement
to neural inference (40.00% → 73.33%), 2) high-quality
parsing results are key to benefitting the language model
(40.00% → 86.67%). Our approach is effective for quick
and dirty domain modeling and also for fine-tuning on
moderate models. However, if the parsing and reasoning
processes are suboptimal, they may potentially decrease the
performance in Machine Reading Comprehension (MRC)
significantly ( 30.00% → 6.67% for turbo).</p>
      <p>In conclusion, our contributions are as follows:
1. We introduce the Symbolically Enhanced Neural
Inference Framework (SENIF), which mimics the way
humans process semantics and cleverly combines the
powerful capabilities of language models with
symbolic representations. This innovative blend
leverages the strengths of the former’s a generalization and
fuzzy matching capabilities, along with the precision
of the latter, to markedly improve model performance
on NLU tasks.
2. A semantic parser for assertional logic is proposed to
facilitate the efficient translation of natural language
into formal representations in an open domain. It
achieves state-of-the-art performance on a semantic
parsing dataset annotated with assertional logic.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>2.1. Concept-Operator Diagram</title>
        <p>The Concept-Operator Diagram (CO diagram) is a graphical
representation of a knowledge representation model that is
based on assertional logic. In this logic, knowledge is
represented in the form of " = ", where  and  are either
atomic individuals or compound individuals. There are three
components of its syntax: individual, concept, and operator.
Concepts are represented as rectangles in the diagram, while
operators are represented as diamonds. Since individuals
only represent specific instances of concepts, they are not
typically included in a CO diagram.</p>
        <p>Figure 1 is an illustration of the CO diagram. The concept
is represented by a rectangle and the operator by a diamond,
and we capitalize concept names for the sake of
distinguishing between concepts and operators, especially when written
as logical expressions. In this figure, ’NUMBER’ refers to
the set of numbers in mathematics, such as 1, 5.201, 13 , and
so on. While the ’addition’ represents a logical operation or
a logical relation or a map from LHS to RHS. The logical
expression corresponding to Figure 1 is addition (NUMBER,
NUMBER) = NUMBER. The semantics is that the sum of
two numbers equals another number. An example of this
operator is 2 + 3 = 5.</p>
        <p>Concepts and operators can be nested and considered as
individuals as well. Additionally, CO Diagram serves for
assertional logic, which possesses higher-order logic
expressiveness at least. This allows for representing complex
relationships and rules like the Pythagorean theorem, which is
challenging for tuple-based KBs.</p>
        <p>Compared to traditional entity-relationship (E-R) models,
the CO model has several advantages. The E-R model can
only describe existing data, while the CO model is capable
of expressing logical relationships between concepts, such
as that classic example 2 + 3 = 5. This logical expression
is difficult to represent in an E-R diagram but can be easily
represented in a CO diagram. The numbers on the arrows
in the CO diagram indicate the order of the concepts in the
operator, with "2" being the first input and "3" being the
second input in the example of 2 + 3 = 5.</p>
        <p>The CO model is an expressive model that enhances
traditional data models by enabling reasoning and inference
capabilities. Moreover, It overcomes the limitation of the
traditional model, which is unable to perform inference. This
enables the CO model to be used for modeling various types
of concepts and their relationships to describe wide
knowledge.</p>
        <sec id="sec-2-1-1">
          <title>The CO diagram is a powerful tool for representing knowledge in a way that is both intuitive and expressive. It allows for logical relationships to be expressed clearly and concisely.</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Pipeline</title>
        <p>The whole steps of the proposed SENIF are shown in Figure
2. To enhance the performance of the traditional method
that leverages language models for NLU tasks, our research
introduces symbolic representations and simple reasoning
into the existing framework. The central hypothesis is that
by infusing these two elements, the model can handle
higherlevel, abstract thoughts that often elude pre-trained language
models, therefore improving overall performance.
• Domain-specific CO diagram We construct a
domain-specific CO diagram based on the collected
domain information text, which contains the
necessary meta-knowledge in a domain.
• Parsing based on CO diagram Our parsing
procedure is conducted based on a predefined
domainspecific CO diagram, as shown in Figure 2a and
Figure 3.</p>
        <p>Allow for generalization, we have designed a
zeroshot parser to handle it (Figure 2b). We treat the
parsing task as a combination of Named Entity
Recognition (NER) and MRC tasks.
• Integrating symbolic representation and
reasoning Therefore, we incorporate an additional
semantic parsing dimension to the existing inputs of
question and context. Moreover, we designed a
chainof-thought prompt that effectively integrates these
three inputs (question, context, and semantic parsing
results) for further analysis, as illustrated in Figure
2c.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Domain-specific CO Diagram</title>
        <p>To begin with, we need to build a corpus from https://www.
whitehouse.gov/about-the-white-house/presidents/ to model
the presidential domain of the ZEST benchmark, which
contains concise but essential information about the presidents.
The information gathered from the website can be used to
abstract the core concepts and extract the relationship called
operator. And this information is in natural language format
and does not require any annotation or processing.The
operator helps algorithms understand how the different concepts
are related to each other, and they help algorithms integrate
domain-specific knowledge.</p>
        <p>Based on this corpus, we use both manual processing
and large language model automatic processing to abstract
concepts and operators from natural language, and expand
outward with different conceptual relationships, ultimately
establishing a model that covers this field and meets
modeling quality standards.</p>
        <p>The criteria for modeling quality include less semantic
information loss, simplicity, etc. We will now explore some
of these criteria in detail to help understand how they can be
achieved in modeling the education experience of presidents.</p>
        <p>The first example is for less semantic information loss.
Compared the "resident_place (PERSON) = PLACE,
resi</p>
        <sec id="sec-3-2-1">
          <title>a) Simplified modeling</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>b) Zero-shot parser</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>c) Adding the parsing and reasoning process</title>
          <p>dent_period (PERSON) = PERIOD" and correct one
"resident_info (PERSON, PERIOD) = PLACE" for contexts like
"The family lived in Lamar until Harry was ten months old".
The first one will lose the dependencies between a certain
place and a certain period. In other words, the inference
system will be confused if there’re multiple places and periods
of residence.</p>
          <p>For simplicity, too many variables would make the model
difficult to extract and infer. For instance, " school_of
(PERSON) = SCHOOL and belong_to (CLASS) = SCHOOL"
are better than "class_of (PERSON, SCHOOL) = CLASS"
because the information of the latter can be derived from
the easier former. Another example is "birth_date
(PERSON) = DATE and birth_place (PERSON) = PLACE" versus
"birth_info (PERSON, DATE) = PLACE". We prefer the
ifrst one because they have the same semantics as long as life
only has once.</p>
          <p>Achieving all quality criteria simultaneously at the same
time is near impossible. We need to balance them well to
achieve the best model. This balance is different in different
ifelds and it requires experimentation in the modeling field.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Zero-shot Semantic Parser</title>
        <p>
          Most existing semantic parsing datasets are limited to parsing
short sentences and single facts [
          <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
          ]. Although MIVS [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
has introduced a semantic parsing dataset for multiple facts, it
is essentially a compilation of single-fact datasets, making it
relatively mechanical and challenging to apply to real-world
scenarios. So we developed a simple zero-shot semantic
parsing.
3.3.1. Two-stage algorithm
This paper presents a semantic parsing process that is
controlled by a given CO diagram and designed for an
opendomain task. This parsing process is difficult to accomplish
using traditional algorithms or even advanced language
models such as ChatGPT or Davinci without finetune.
        </p>
        <p>
          We use a two-stage algorithm. In the first stage, we utilize
an open-domain named entity recognition (hereafter referred
to as OpenNER) model to recognize individuals with certain
concepts, while in the second stage, a MRC system is applied
to fill variables for certain operators that are related to
concepts identified in stage one. The MRC process is based on
question templates generated automatically. This two-stage
approach allows us to capture the relationships between
individuals and individuals more accurately and efficiently. In
this paper, we use UIE [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]) and DeBERTa-v3-large-squad2
as the base model.
3.3.2. Templates for MRC step
The MRC system starts with a pre-compiled set of templates,
where each template corresponds to a specific operator. The
MRC system can answer questions like "Who is six feet
tall?" by using the template "Who is [HEIGHT] tall?", the
template corresponds to questions asking for the person with
a particular height. Therefore, it’s necessary to construct
templates automatically in a zero-shot scenario.
        </p>
        <p>Capitalizing on the advancements in in-context learning,
it has become feasible to generate question-answer templates
for each operator. Thus completing the final step towards
constructing a parser for a given CO diagram, with almost
complete automation and without annotations.</p>
        <p>The generation process is prompted by the combination of
instruction, chain-of-thought, and standard prompting, which
we have found to achieve an appropriate balance between
quality and variety. We present a brief overview of this
schema in Table 1. We found that this combination is better
than only using instruction or the chain-of-thought prompt
with more examples.</p>
        <p>In fact, the number of incorrect templates during the
generation process is higher than that of correct ones. But
fortunately, some hard constraints can be employed to detect all
faults when using the prompt shown in Table 1:
• The number of question templates for each operator
should be equal to the number of concepts that need
to be filled.
• Every question template is only permitted to use
concepts with known values because they are queried
one by one.</p>
        <p>The complete generation process involves the following
steps:
1. Set the temperature to 0.0 and maximal tries to 20.
2. Alternate between using the text-davinci-003 and
gpt3.5-turbo models to generate the templates.
3. Verify the results using the aforementioned hard
constraints. If the templates do not pass the test, the
temperature is increased by 0.1 and the process is
repeated.
4. Repeat steps 2–3 until the correct question templates
are generated or the maximal number of tries is
reached.</p>
        <sec id="sec-3-3-1">
          <title>As a result of this schema, correct templates can always be generated if they pass the constraints, with only two operators failing. The absence of templates for a few operators is insignificant in practice.</title>
          <p>Moreover, the davinci model is more reliable than the
turbo model in precise scenarios, which are consistent with
observations when they are used as baselines for zero-shot
semantic parsing.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Case study for Symbolic-Enhanced</title>
      </sec>
      <sec id="sec-3-5">
        <title>Neural Inference Framework</title>
        <p>Finally, a case study will be applied to introduce the whole
steps of SENIF (Figure 2). Consider the question "What
academic credentials does this president hold?" and the context
"Trump received a bachelor’s degree in 1968.".</p>
        <p>Suppose that we’ve construct a CO diagram (Figure 2a),
and then zero-shot parser will extract the structural
information by a two-stage algorithm (Figure 2b):</p>
        <sec id="sec-3-5-1">
          <title>1. Identify the degree concept and its individual ’bachelor’, and turn to fill the "degree_obtained (PERSON, PERIOD) = DEGREE".</title>
        </sec>
        <sec id="sec-3-5-2">
          <title>2. Query MRC models by automatically generated tem</title>
          <p>plates and get the symbolic representation":
degree_obtained (Trump,-1968) = bachelor".</p>
          <p>Next, the generative models will receive the question,
context, and symbolic representations as inputs (Figure 2c). The
inference process is then completed in vfie steps: identifying
the primary information, selecting the relevant knowledge,
synthesizing the original context with the parsing results,
performing reasoning, and finally, providing the answer.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <sec id="sec-4-1">
        <title>4.1. Datasets and Metrics</title>
        <p>Datasets In order to demonstrate the practical significance of
our framework and preliminarily explore the potential of
integrating symbolic logic reasoning with large language model,
we selected a subset of approximately 200 question-answer
pairs from the ZEST dataset to test within the specific
domain that we manually modeled. This test comprises
approximately 200 question-answer problems. With its innovative
scoring mechanism (C@K) and challenging problem design,
ZEST effectively measures the performance of models in
truly understanding the questions, rather than merely
obtaining correct answers by chance due to input pairs that happen
to fit modelnetwork well. Meanwhile, because our
methodology is related to the parsing quality, we need a dataset for
the analysis of the parsing quality.</p>
        <p>Due to the lack of a publicly available benchmark to assess
the performance of semantic parsing for assertional logic, our
study has undertaken the annotation of a dataset of 400
assertions to serve as the test dataset. Notably, our approach to
semantic parsing does not require the use of training datasets.
To improve the reliability of the evaluation, it has some
differences in detail, see the appendix B.1.</p>
        <p>Furthermore, to quickly verify the effectiveness of our
method in other fields, we selected all questions matching the
prompt words from the training set of the ZEST benchmark,
and used a large model for zero-shot modeling (different from
the previous manual plus automatic modeling), including
questions in various fields such as the president, national
parks, and dog breeds. We tested about 800 question-answer
pairs in the modeling of this field to verify the versatility of
our method.</p>
        <p>
          Metrics for the NLU task In line with the metrics
employed in the foundational study by [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], we utilize Mean F1,
C@75, and C@90 for assessment. In this benchmark, each
question is associated with around 20 ⟨context, answer⟩ pairs.
The Mean F1 denotes the average F1 score, while C@A
represents a specialized evaluation metric where an algorithm
only receives 1 score if the average F1 score across
approximately 20 ⟨question, context⟩ pairs surpasses the A%.
Metrics for parsing task We present our findings by
comparing the precision and recall measures, using the exact match
condition, as employed in the SQuAD 2.0 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] benchmark.
Specifically, we perform a variable-wise matching of all
assertions, assigning a score of 1 when they’re the same and 0
otherwise. The maximal score across all the gold assertions
is then determined as the final score. It should be noted that a
score of 0 is assigned in instances where the operators do not
match, as this implies a lack of consistency in the underlying
semantics.
        </p>
        <p>Due to the limitation of zero resources, we have employed
NER and QA models to extract facts that align with the
semantics of the original context. We do not refine these
facts by considering whether they correspond to the original
sentences or merely possess similar semantics. For instance,
given the context "Alice is the mother of Bob." the facts
"mother_of (Bob) = Alice" and "child_of (Alice) = Bob" are
both correct, although the latter is not an original sentence.
However, this inherent deficiency does not have any practical
implications and can even be regarded as advantageous, as it
alleviates the difficulties associated with reasoning.</p>
        <p>In order to incorporate these accurate facts into the
computation of precision and recall metrics, an inference system
has been developed to augment the given parsing outcomes.
A notable observation is that more extensive language
models yield a greater quantity of supplementary facts. This can
be ascribed to the superior inference capabilities of larger
models, which possess the ability to generate novel facts
when processing contexts.</p>
        <p>The details of the inference system and ablation
experiments are shown in section B.2.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Baselines</title>
        <p>In this study, we evaluate our proposed algorithm by
comparing it with the state-of-the-art baselines of ZEST (BART
and T5) and the most powerful generative models:
TextDavinci-003, GPT-Turbo-3.5, and GPT-4, all renowned for
their few-shot and zero-shot learning capabilities. To ensure
a fair comparison and reproducibility, we maintain similar
parameters and prompts across different models GPT-family,
including temperature (0.0), max_tokens (2048), and a ’\n’
stop marker. The complete prompts used can be found in
Appendix C.2. The training details of BART and T5 can be
found in Appendix B.3.</p>
        <p>Due to the non-determinacy of generative models, we
repeated each experiment three times, then report the mean
value.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Analysis</title>
      <sec id="sec-5-1">
        <title>5.1. NLU task performance</title>
        <p>To demonstrate the superiority of our proposed SENIF in
enhancing the language understanding capabilities of large
models, we conducted a comparison with the advanced
generative models in the NLU task. In the experiments, we
employed three types of prompts:
• Using a few-shot prompt, requiring the model to
directly respond to the question;
• Utilizing a CoT prompt, which necessitates that the
model first parse input through formal expressions,
followed by inference and response. We anticipate
that this methodology will enhance both the
reliability and interpretability of reading comprehension
tasks.
• Using the almost same prompt, but replace the
parsing results by our zero-shot parser (SENIF).</p>
        <p>As evidenced in Table 2, our scheme outperforms the
baseline method considerably in the test examples. It is
important to note that our proposed approach not only focuses
on reading comprehension tasks but just views it as merely
one means for validating its effectiveness. The success
reveals the feasibility of integrating symbolic logic with neural
network-based inference.</p>
        <p>Second, it can be observed that the prompt requiring the
model to first parse input before answering the question
yields weaker results compared to the simple prompt for
davinci and turbo. We believe this can be attributed to two
main factors:
• The second type of prompt does not provide sample
data for the model to learn from the context;
• Insufficiently skilled and reliable parsing results may
interfere with the model’s output.</p>
        <p>However, it is worth noting that by replacing the parsing
step with our algorithm’s parsing results, a significant
improvement can be achieved. We believe this demonstrates the
potential for incorporating symbolic reasoning to enhance
inference reliability by language model (The ZEST dataset
assessing whether the model genuinely comprehends the
questions), but this improvement is reliant on high parsing
accuracy – an observation that shares a similar conclusion
with CoT’s success, which is dependent on the model’s
accuracy in terms of consistency and fact-based output.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Evaluation of Semantic Parsing</title>
        <sec id="sec-5-2-1">
          <title>To verify the relationship between our method and parsing</title>
          <p>quality, we tested the parsing quality of the GPT family and
our method.Table 3 presents an overview of the performance
of semantic parsing by GPT models and ours.</p>
          <p>In our experiments, the proposed model with only about
700M parameters demonstrates a significant performance
improvement, achieving approximately a 40.40% increase in
precision compared to turbo while surpassing the recall
performance of davinci by a 15.23% increase. Notably, Turbo
and Davinci models struggle to achieve high precision and
recall scores simultaneously, whereas our model attains
stateof-the-art results in both aspects.</p>
          <p>We attribute this enhancement primarily to the alignment
between the assertional logic and our structure. More
importantly, these results suggest the potential for driving
existing knowledge representation towards greater complexity
and controllability (stemming from the construction of the
modeling process), ultimately aiding in constructing a more
sophisticated knowledge base. This approach holds promise
to address challenges faced in knowledge computation that
arise from inconsistencies between knowledge
representation and knowledge bases, as well as reducing high resource
demands for semantic parsing associated with specific or
complex languages.</p>
          <p>To show the relationship between NLU and parsing
performance, we plot the performance difference on the ZEST
dataset before and after incorporating the parsing step, with
respect to the performance of baseline models on parsing
data. From Figure 4, a positive correlation could be observed:
parsing results with high precision is a key element for the
validity of extra formal steps, and precision is more
important than the recall score by comparing Figure 4a and Figure
4b. This finding provides further evidence supporting the
claim that our framework relies on the precision of symbolic
representation, in conjunction with the fuzzy matching
capabilities of large language model, to enable broader reasoning.
This observation is in line with our initial hypothesis.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Generalization Experiment</title>
        <p>To quickly and comprehensively validate the generalizability
of our approach, we automatically model additional domains
from the ZEST benchmark in zero-shot scenarios. Our
approach to achieving rapid domain-specific modeling involves
the following steps:
• Entity Extraction: Identification of all entities
within the text for subsequent concept formation.
• Entity to Concept: Abstraction of entities into
specific real-world concepts. For example, the entity
"red" is abstracted into the concept "COLOR".
• Relation Extraction: Identification and extraction of
relevant relationships between the extracted entities
and their corresponding concepts.</p>
        <sec id="sec-5-3-1">
          <title>To enhance the quality of modeling, we applied filter con</title>
          <p>ditions to the final results using the prompts detailed in
Appendix C.3. We counted the frequency of all concepts,
removing concepts and corresponding operators that appeared
too infrequently. Additionally, we filtered out operators with
identical meanings based on semantic similarity.</p>
          <p>As shown in Table 4, our method consistently achieves
optimal results even with rough modeling. This not only
veriifes the superior generalization capability of our approach but
also highlights the potential of combining symbolic language
with large language model.</p>
          <p>At the same time, we analyzed the reasons for the
decline in performance when it was extended to other fields:
compared with the manually constructed precise domain
CO graph, the quality of zero-shot modeling is significantly
worse than that of the manually constructed domain CO
graph, and it has obvious problems such as semantic loss
and high complexity. For example, for the sentence
"Malamutes were thought to be bred by the Malemiut Inupiaq
people of Alaska’s Norton Sound region.", automatic
modeling tends to focus more on the main part of the sentence,
that is, modeling "(ANIMAL) be_bred_by(PERSON)" from
the sentence, but there is another important semantics in this
sentence: (PERSON) live_in(PLACE). These situations lead
to a drop in performance in other areas, which also verifies
the importance of high-quality domain knowledge in model
reasoning.</p>
          <p>Furthermore, in order to prove that other models can also
combine symbols to improve their language understanding
ability, we fine-tune LLaMA3 on the lora framework and use
a zero-shot parser to parse data built from automatically
generated CO-Diagrams. We use the zero-shot parser to process
a subset of the training set in the ZEST benchmark, a total
of 700 question-answer pairs, and use this as the fine-tuning
dataset. We fine-tune llama3 in two forms: question-answer
pairs (Q/A) and question-answer pairs plus our parsing
results (Q/A/R). In the Table 5 we can see that our method
continued to achieve superior performance in the fine-tuned
LLaMA3, this suggests that models can benefit from domain
knowledge or structured knowledge.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Related work</title>
      <p>
        NLU in symbolic AI AI system based on logic are skilled
in reasoning and have a deep understanding of concepts.
Past researchers try to construct elaborate representation
(a)
(b)
frameworks such as knowledge base [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ], axiom system
for highly specialized domains like pouring water [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ]
and so on. However, these systems struggle with the issue of
over-generalization and are difficult to acquire.
      </p>
      <p>
        NLU in LLMs On the other hand, language models have
powerful universal capabilities for many downstream tasks
[
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ], but they lack a true understanding of the world and
are weak in reasoning [
        <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
        ], [
        <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
        ]. LLMs might only
use patterns [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], the suitable input pair [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], or take shortcuts
[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] to infer, without truly understanding the background
context.
      </p>
      <p>
        Symbolic-enhanced systems Therefore, researchers have
made numerous efforts to combine traditional AI with
language models. Approaches include neuralizing rule-based
system [
        <xref ref-type="bibr" rid="ref22 ref23">22, 23</xref>
        ], neural module network [
        <xref ref-type="bibr" rid="ref24 ref25">24, 25</xref>
        ], soft or hard
symbolic constraints [
        <xref ref-type="bibr" rid="ref26 ref3">26, 3</xref>
        ], formal reasoning-based system
[
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] and so on. Despite these attempts, these methods have
yet to successfully combine the advantages of symbolism and
connectionism, often relying too heavily on the capabilities
of one over the other. We believe that the most beneficial
elements of these two technology pathways are the fuzzy
matching capability of large language model and the high
precision of symbolic systems. Our work focuses on merging
these elements within advanced generative models. We use
symbolic representation to provide precise knowledge and
language models to enable universal inference.
      </p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>We have explored an innovative approach (SENIF) for
augmenting the comprehension capabilities of large language
models. Our findings suggest that integrating symbolic
representation into LLMs significantly improves the NLU ability,
offering promising directions for future advancements in the
ifeld.</p>
      <p>Further, the introduction of a zero-shot parser designed
for the CO diagram is another significant contribution of our
work. The parser’s capacity for quick cross-domain
migration, ease of enhancement, and independence from annotated
data make it a potent tool for translating natural language
into formal representations, a critical step in improving NLU
tasks.</p>
      <p>We conduct empirical validation on the NLU examples and
our own annotated semantic parsing dataset. The results offer
strong evidence of our approach’s efficacy, while our findings
also underscore its potential for cross-domain applicability.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Limitations</title>
      <p>Our approach works well in zero-shot scenarios and naturally
benefits from the enhancement of NER and MRC models
without additional effort. However, in the process of using
information extraction for approximate semantic parsing, it
will also be troubled by reasoning efficiency, redundancy of
extraction, and the congenital gap between them, which will
affect the further expansion of scale and accuracy.
Meanwhile, our zero-shot parsing algorithm will be affected by
scale. When facing Large-scale domain knowledge CO
Diagrams, its complexity will affect the reasoning speed.</p>
      <p>Furthermore, the challenge of multi-step reasoning tasks
remains unresolved for large language model. Therefore, it
is imperative to pursue further investigations based on the
proposed framework in order to integrate the capabilities of
large language model more deeply into the reasoning process.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Modeling results for CO diagram</title>
      <sec id="sec-9-1">
        <title>A.1. Concepts and Operators</title>
        <p>Concepts are shown in table 6 while operators are shown in table 7.</p>
        <sec id="sec-9-1-1">
          <title>Major studied by a person during a period Nominate someone for a profession during a period</title>
          <p>Number of children of a person
Number of grandchildren of a person
Whether a person is currently married
Whether a person is currently divorced
Start time of a time period
Terminal time of a time period
Year of a date
Month of a date
Day of a date
Award received by a person
Date when a person received an award</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>B. Details of evaluation</title>
      <sec id="sec-10-1">
        <title>B.1. Restriction for operators</title>
        <p>Certain operators may possess ambiguities that are not aligned with the annotation standard. For instance, the alias_of operator
is designed to capture distinct names used by an individual in varying periods or circumstances, such as nicknames, former
names, pseudonyms, etc. However, we notice that the full name and its abbreviation may also be regarded as the alias of a
person, as exemplified by Barack Hussein Obama II, Barack Hussein Obama, Barack Obama, and Obama. Recording such
information may be meaningless and challenging to label without omissions. Consequently, these operators are omitted when
calculating the precision and recall score.</p>
        <p>Meanwhile, two operators encountered failure during the template generation step: "succeeded_by" and
"someone_nominate_someone_for_profession". To make a fair comparison without manual intervention, we refrained from creating
the corresponding question templates. As a result, these two operators were excluded from the evaluation.</p>
      </sec>
      <sec id="sec-10-2">
        <title>B.2. Inference system</title>
        <p>We utilize 29 rules about family relationships and personal information to generate complete semantics, please see Table 8.
Table 9 indicates the relevant ablation experiments.</p>
      </sec>
      <sec id="sec-10-3">
        <title>B.3. Training settings for BART and T5</title>
        <p>
          For BART-large, we use the same setup as in the [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. However, for T5-3B and 11B models, as we did not have access to
TPUs, we replicate the experiments using 4x3090 24G GPUs and 2xA800 80G GPUs. It was observed that when running
under these resource constraints, the setup described in the paper employing 16x8 TPUs yielded poor results (even worse
than BART-large). Therefore, we opted for an alternative configuration that produced the best performance for these two
baselines. Specifically, an initial learning rate of 5e-5 was employed for 3 epochs during the training process (in fact, the best
performance for T5-11B is the one after two epoch training). Moreover, we also set batchsize as 32 but achieve it by batchsize=1
and gradient_accumulation_steps=32. This is because we find that any optimization may result in T5 not converging, so it is
significantly limited by memory.
        </p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>C. Complete prompts</title>
      <sec id="sec-11-1">
        <title>C.1. Templates generation prompt</title>
        <p>We generate MRC templates with the prompt provided in Table 10:</p>
      </sec>
      <sec id="sec-11-2">
        <title>C.2. Parsing baselines</title>
        <p>The prompt used for semantic parsing task is given in Table 11.</p>
      </sec>
      <sec id="sec-11-3">
        <title>C.3. Auto modeling</title>
        <p>The templates we use for automatic modeling are provided in 12</p>
      </sec>
      <sec id="sec-11-4">
        <title>C.4. Downstream task baselines</title>
        <p>In this section, we present the prompts utilized for the baseline and our semantic parsing results, with the differences between
these two prompts highlighted in red for easy identification (Table 13 and Table 14). Our objective was to facilitate fair
comparisons; thus, we intentionally introduced only subtle discrepancies in the first set of prompts. These modifications were
primarily focused on incorporating our parsing results into the first prompt.</p>
        <p>Moreover, due to the inclusion of a lengthy text (i.e., parsing results) at the end of a prompt may potentially confuse the
language model and cause it to lose track of its tasks, we incorporated reminders ("Follow above ... the question") to maintain
consistency and ensure that all steps are successfully executed.</p>
        <p>For the few-shot prompt, please see the table 15 below.</p>
        <p>Your aim is given the question templates for every function and its variables. For example, the input ’age_of’: [’PERSON0’,
’AGE1’] indicate the formula age_of (PERSON0) = AGE1. The semantics of this formula is T¨ he age of PERSON0 is AGE1¨. The
value [’PERSON0’, ’AGE1’] are the elements related to this function. Concretely, the last one (’AGE1’ in this case) is always
defined as the output while others ([’PERSON0’] in this case) are input (s) for the function. You can always suppose that we’ve
gotten the value of the output (’AGE1’ in this case), so just write question templates for the previous elements ([’PERSON0’] in
this case) one by one and from first to the last but one. Only after an question template is given, we can suppose that value
can be obtained and use it in next template. Otherwise, you only can use the output variable and input variables with given
templates. To avoid the confusion about two same variable names, an unique id from 0 to n-1 is added to the end of variable
names. The question order is 0, 1,... n-2 when we have n elements.</p>
        <p>For instance, you can only use AGE1 in the first step to construct template for PERSON0 (no 0-th variable in your first template).
After that, you are allowed to use 0-th variable and output variable to design the next template (no 1-th variable in your second
template) if there are any variables left over, and so on.</p>
        <p>Q: write question template for formula: ’age_of’: [’PERSON0’, ’AGE1’]
A: [’Whose age is age1?’]
Q: write question template for formula: ’birth_date’: [’PERSON0’, ’DATE1’]
A: [’Who was born in date1?’]
Q: write question template for formula: ’death_date’: [’PERSON0’, ’DATE1’]
A: [’Who died in date1?’]
Q: write question template for formula: ’GetHeight’: [’PERSON0’, ’HEIGHT1’]
A: [’Whose height is height1 tall?’]
Q: write question template for formula: ’degree_obtained’: [’PERSON0’, ’PERIOD_S1’, ’PERIOD_T2’, ’DEGREE3’]
A: [’Who has recieved degree3?’, ’When did person0 start degree3?’, ’When did person0 recieved degree3?’]
Q: write question template for formula: ’majored_in’: [’PERSON0’, ’PERIOD_S1’, ’PERIOD_T2’, ’DEGREE3’, ’MAJOR4’]
A: [’Who majored in major4?’, ’When did person0 start to study major4?’, ’When did person0 graduate in major4?’, ’What
degree was person0 study for major4 in period_s1-period_t2?’]
Q: write question template for formula: ’school_located_in’: [’SCHOOL0’, ’PLACE1’]
A: [’Which school is located in place1?’]
Q: write question template for formula: ’borned_in’: [’PERSON0’, ’PLACE1’]
A: [’Who borned in place1?’]
Q: write question template for formula: ’father_of’: [’PERSON0’, ’PERSON1’]
A: [’Whose father is person1?’]</p>
        <p>Q: write question template for formula: ’mother_of’: [’PERSON0’, ’PERSON1’]</p>
        <p>Please parsing the given context into structured data by preset templates. The information that cannot be covered by templates
should be ignored. I’ll give you the templates and some examples. Then you should parsing the next context.
Note that you are only allowed to use the words or phrases in the context.</p>
        <p>The templates are following:
{’degree_obtained’: [’PERSON0’, ’PERIOD_S’, ’PERIOD_T’, ’DEGREE’]}
...</p>
        <p>Context: Nixon’s visit to China in 1972 eventually led to diplomatic relations between the two nations.
Answer: [’operator’: ’visited_place’, ’PERSON0’: ’Nixon’, ’PERIOD_S1’: ’1972’, ’PERIOD_T2’: ”, ’PLACE3’: ’China’]
Context: The black bear is a common inhabitant of Olympic National Park, and North America, in general.
Answer: [’operator’: ’inhabitant_animal_of’, ’PLACE0’: ’Olympic National Park’, ’ANIMAL1’: ’black bear’,’operator’:
’inhabitant_animal_of’, ’PLACE0’: ’North America’, ’ANIMAL1’: ’black bear’]
Context: Dachshunds have a wide variety of colors and patterns, the most common one being red.</p>
        <p>Answer: [’operator’: ’common_color_of’, ’ANIMAL0’: ’Dachshunds’, ’COLOR1’: ’red’]</p>
        <p>Context: Six Trump campaign advisers and staff were indicted and five pled guilty to criminal charges. Answer: []</p>
        <p>Please imitate the following example to extract operators from the given text, and only the answers are output, taking into
account all meanings of the coverage statement.</p>
        <p>The process of operator extraction is as follows: 1. First identify the named entities in the sentence. For example, for the sentence
’Dachshunds have a wide variety of colors and patterns, the most common one being red’, identify Dachshunds, red. 2. To
extend entities into categories, try to think of extensions to larger and actually existing categories in nature, such as Dalmatians
not extending to dogs but to animals, and Lincoln Park not extending to parks but to places.Don’t use non-existent concepts
like "danger" and "DESCRIPTION". 3. Extract the operator common  __ ( ) =
 :
Context: Dwight David ’Ike’ Eisenhower ( EYE-zn-how-r; October 14, 1890 – March 28, 1969), GCB, OM was an American
army general and statesman who served as the 34th president of the United States from 1953 to 1961.
Answer: [’birth_date(PERSON) = DATE’, ’death_date(PERSON) = DATE’, ’profession_of(PERSON,PERIOD_START,
PERIOD_TERMINAL) = PROFESSION’, ’which_president_rank_of(PRESIDENT) = RANK’]
Context: The most common animals observed around Rim Drive are golden-mantled ground squirrels, Canada jays and an
assortment of butterflies and bees. Black bear sightings are more common in autumn and late spring.</p>
        <p>Answer: [’common_observed_in(PERIOD,PLACE) = ANIMAL’]
Context: The English White Terrier is the failed show ring name of a pricked-ear version of the white fox-working terriers that
have existed in Great Britain since the late 18th century.</p>
        <p>Answer: [’existed_in(PERIOD,ANIMAL) = PLACE’, ’ring name(ANIMAL) = ANIMAL’]
Context: Black bear \u2013 Ursus americanus. The black bear is a common inhabitant of Olympic National Park, and North
America, in general. They are smaller and darker than the grizzly bear and the brown bear. Females typically weigh between
100 and 400 lbs, while males weigh between 250 and 600 lbs.</p>
        <p>Answer:</p>
        <p>Given the question ’{question}’ and the original context ’{context}’, please:
1. Identify the main concepts and relationships involved in the question. Provide the semantic parsing results of the context
based on this ontologies, in first-order logic form.
2. Select necessary information from both the semantic parsing results and the original context.
3. Compare the information from these two sources. If there is a discrepancy, resolve it by deciding which source is likely to be
more accurate.
4. Combine the verified pieces of information and present your line of formal reasoning in first order logic.
5. Output the answer without any extra details by "Answer:{answer}" format. The answer should be yes, no, n/a or a brief phrase
from the input words based on the question and context. n/a means no answer."’</p>
        <p>For a given question ’{question}’, the original context ’{context}’, and corresponding semantic parsing results (at the end),
please:
1. Identify the main concepts and relationships involved in the question.
2. Select necessary information from both the semantic parsing results and the original context.
3. Compare the information from these two sources. If there is a discrepancy, resolve it by deciding which source is likely to be
more accurate.
4. Combine the verified pieces of information and present your line of formal reasoning in logic.
5. Output the answer without any extra details by "Answer:{answer}" format. The answer should be yes, no, n/a or a brief phrase
from the input words based on the question and context. n/a means no answer.</p>
        <p>Semantic parsing results:{parsing_results}</p>
        <p>Follow above five steps exactly to complete the question</p>
        <p>Give an answer from ’yes, no, n/a’, or a brief phrase from the input words based on question and context, n/a means no answer.
Question: After leaving office, where did this president go to retire?
Context: Dwight David ’Ike’ Eisenhower ( EYE-z\u0259n-how-\u0259r; October 14, 1890 \u2013 March 28, 1969), GCB, OM
was an American army general and statesman who served as the 34th president of the United States from 1953 to 1961.
Following the war, he served under various generals and was promoted to the rank of brigadier general in 1941. After the United
States entered World War II, Eisenhower oversaw the invasions of North Africa and Sicily before supervising the invasions of
France and Germany. After the war, he served as Army Chief of Staff (1945\u20131948), as president of Columbia University
(1948\u20131953) and as the first Supreme Commander of NATO (1951\u20131952). While Eisenhower was stationed in
Texas, he met Mamie Doud of Boone, Iowa. Eisenhower was mostly reluctant to discuss his death. Their second son, John
Eisenhower (1922\u20132013), was born in Denver, Colorado. John served in the United States Army, retired as a brigadier
general, became an author and served as U.S.</p>
        <p>Answer: n/a
Question: Are bear sightings common at this national park?
Context: Black bear \u2013 Ursus americanus. The black bear is a common inhabitant of Olympic National Park, and North
America, in general. They are smaller and darker than the grizzly bear and the brown bear. Females typically weigh between
100 and 400 lbs, while males weigh between 250 and 600 lbs.</p>
        <p>Answer: yes
Question: Are bear sightings common at this national park?
Context: This area is thickly forested. Moose and, less commonly, bears can be seen if they are near the road; otherwise,
wildlife sightings are fairly rare. The road rises up to Mile 9, eventually breaking out of spruce forest and into a low alpine zone
of tall bushes and sporadic trees. Moose frequent this stretch during the autumn (mid-August to mid-September). Caribou and
bears can occasionally be seen, especially toward Savage River (Mile 15). Mountain-dwelling critters, like marmots, pika and
Dall sheep are sometimes seen on Healy Ridge and Mount Margaret.</p>
        <p>Answer: no
Question: What public offices did this president run for and win?
Context: Johnson won election to the United States Senate from Texas in 1948 after winning the Democratic Party’s nomination
by an extremely narrow margin with fraudulent votes that were manufactured by friendly political machines. He was appointed
to the position of Senate Majority Whip in 1951. He became the Senate Minority Leader in 1953 and the Senate Majority
Leader in 1955. At the same time as his vice presidential run, Johnson also sought a third term in the U.S. Senate. According
to Robert Caro, ’On November 8, 1960, Lyndon Johnson won election for both the vice presidency of the United States, on the
Kennedy\u2013Johnson ticket, and for a third term as senator (he had Texas law changed to allow him to run for both offices).
Answer: United States Senate|vice presidency</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Mahowald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ivanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. A.</given-names>
            <surname>Blank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kanwisher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Tenenbaum</surname>
          </string-name>
          , E. Fedorenko,
          <article-title>Dissociating language and thought in large language models: a cognitive perspective</article-title>
          ,
          <source>arXiv preprint arXiv:2301.06627</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Grounded conversation generation as guided traverses in commonsense knowledge graphs, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>2031</fpage>
          -
          <lpage>2043</lpage>
          . URL: https: //aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>184</volume>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2020</year>
          .acl-main.
          <volume>184</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pryor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Getoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. E.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Esc: Exploration with soft commonsense constraints for zero-shot object navigation</article-title>
          ,
          <source>arXiv preprint arXiv:2301.13166</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>O.</given-names>
            <surname>Weller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lourie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gardner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <article-title>Learning from task descriptions</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>1361</fpage>
          -
          <lpage>1375</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .emnlp-main.
          <volume>105</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>105</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>From first-order logic to assertional logic</article-title>
          , in: T.
          <string-name>
            <surname>Everitt</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Goertzel</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Potapov (Eds.),
          <source>Artificial General Intelligence</source>
          , Springer International Publishing, Cham,
          <year>2017</year>
          , pp.
          <fpage>87</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Moradshahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tsai</surname>
          </string-name>
          , G. Campagna,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <article-title>Contextual semantic parsing for multilingual taskoriented dialogues</article-title>
          ,
          <source>arXiv preprint arXiv:2111.02574</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mohit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <article-title>Semantic parsing for task oriented dialog using hierarchical representations</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>07942</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Chen,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>A birgat model for multi-intent spoken language understanding with hierarchical semantic frames</article-title>
          ,
          <source>arXiv preprint arXiv:2402.18258</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . Sun, H. Wu,
          <article-title>Unified structure generation for universal information extraction</article-title>
          ,
          <source>in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>5755</fpage>
          -
          <lpage>5772</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>395</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>395</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Rajpurkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>Know what you don't know: Unanswerable questions for squad</article-title>
          , arXiv preprint arXiv:
          <year>1806</year>
          .
          <volume>03822</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>D. B. Lenat</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Prakash</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Shepherd</surname>
          </string-name>
          ,
          <article-title>Cyc: Using common sense knowledge to overcome brittleness and knowledge acquisition bottlenecks</article-title>
          ,
          <source>AI</source>
          magazine
          <volume>6</volume>
          (
          <year>1985</year>
          )
          <fpage>65</fpage>
          -
          <lpage>65</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Speer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Havasi</surname>
          </string-name>
          ,
          <article-title>Conceptnet 5.5: An open multilingual graph of general knowledge</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>31</volume>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <article-title>Pouring liquids: A study in commonsense physical reasoning</article-title>
          ,
          <source>Artificial Intelligence</source>
          <volume>172</volume>
          (
          <year>2008</year>
          )
          <fpage>1540</fpage>
          -
          <lpage>1578</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>E.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <article-title>Logical formalizations of commonsense reasoning: a survey</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>59</volume>
          (
          <year>2017</year>
          )
          <fpage>651</fpage>
          -
          <lpage>723</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bubeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chandrasekaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Eldan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gehrke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Horvitz</surname>
          </string-name>
          , E. Kamar,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Palangi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Tulio</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Sparks of Artificial General Intelligence:
          <article-title>Early experiments with GPT-4</article-title>
          , arXiv e-prints (
          <year>2023</year>
          ) arXiv:
          <fpage>2303</fpage>
          .12712. doi:
          <volume>10</volume>
          .48550/arXiv.2303. 12712. arXiv:
          <volume>2303</volume>
          .
          <fpage>12712</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>How well do Large Language Models perform in Arithmetic tasks?</article-title>
          , arXiv e-prints (
          <year>2023</year>
          ) arXiv:
          <fpage>2304</fpage>
          .
          <year>02015</year>
          . doi:
          <volume>10</volume>
          . 48550/arXiv.2304.
          <year>02015</year>
          . arXiv:
          <fpage>2304</fpage>
          .
          <year>02015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Khashabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <article-title>“going on a vacation” takes longer than “going for a walk”: A study of temporal commonsense understanding</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>3363</fpage>
          -
          <lpage>3369</lpage>
          . URL: https://aclanthology.org/D19-1332. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1332.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>C.</given-names>
            <surname>Durt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Froese</surname>
          </string-name>
          , T. Fuchs,
          <article-title>Against ai understanding and sentience: Large language models, meaning, and the patterns of human language use, 2023</article-title>
          . URL: http: //philsci-archive.pitt.edu/21983/.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lenci</surname>
          </string-name>
          ,
          <article-title>Understanding natural language understanding systems. a critical analysis</article-title>
          ,
          <source>arXiv preprint arXiv:2303.04229</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Zhao,</surname>
          </string-name>
          <article-title>Why machine reading comprehension models learn shortcuts?, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021</article-title>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>989</fpage>
          -
          <lpage>1002</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .findings-acl.
          <volume>85</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .findings-acl.
          <volume>85</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Generalize symbolic knowledge with neural rule engine</article-title>
          , ArXiv abs/
          <year>1808</year>
          .10326 (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>C.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <article-title>Cold-start and interpretability: Turning regular expressions into trainable recurrent neural networks</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>3193</fpage>
          -
          <lpage>3207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Andreas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <article-title>Neural module networks</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>K.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kohli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tenenbaum</surname>
          </string-name>
          ,
          <article-title>Neural-symbolic vqa: Disentangling reasoning from vision and language understanding</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>31</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>N.</given-names>
            <surname>Tandon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dalvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Grus</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosselut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <article-title>Reasoning about actions and state changes by injecting commonsense knowledge</article-title>
          ,
          <source>in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Brussels, Belgium,
          <year>2018</year>
          , pp.
          <fpage>57</fpage>
          -
          <lpage>66</lpage>
          . URL: https://aclanthology.org/D18-1006. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D18</fpage>
          -1006.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rajasekharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Padalkar</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Gupta, Reliable natural language understanding with large language models and answer set programming</article-title>
          ,
          <source>ArXiv abs/2302</source>
          .03780 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>