<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Prompt Memoried LLMs with 'What, Why, and How' to Reason Implicit Emotions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Wei Cheng</string-name>
          <email>weicheng@cau.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pengyu Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hejin Liu</string-name>
          <email>hejin_liu@cau.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hailin Cheng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhenglin Cai</string-name>
          <email>caizh@cau.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan Wen</string-name>
          <email>wenjuan@cau.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>China Agricultural University</institution>
          ,
          <addr-line>No. 17 Qinghua East Road, Haidian District, Beijing 100083</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <fpage>26</fpage>
      <lpage>38</lpage>
      <abstract>
        <p>Implicit Sentiment Analysis (ISA) aims to capture emotions that are subtly expressed, often obscured by ambiguity and literary devices. This requires a shift in thinking based on memorized prior knowledge (past cases) and multi-stage logical reasoning to uncover the hidden emotions of the speaker. Inspired by human memory and reasoning, we propose a What, Why, and How three-question (3Q) framework that incorporates a memory mechanism for past cases. We design a three-step prompt principle for it: based on domain priors, first identify what the most crucial entity is, then infer why the speaker mentioned it, and finally uncover how the hidden emotions are. During the reasoning process, historical queries and responses are stored in memory as past case pairs. These pairs can be used to retrieve and generate improved prompts for any new queries, thereby enhancing the implicit sentiment analysis capabilities of LLMs. In addition, we compile a more practical and complex benchmark for ISA tasks. It spans multiple domains and includes bilingual corpora in both English and Chinese. Our framework is universal and minimalist, and achieves a new state-of-the-art by significantly outperforming previous methods in both zero-shot and fine-tuning settings. It also significantly reduces the tendency to over-interpret emotions.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;implicit sentiment analysis</kwd>
        <kwd>large language models</kwd>
        <kwd>prompt engineering</kwd>
        <kwd>case-based reasoning</kwd>
        <kwd>memory</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Sentiment analysis (SA) aims at detecting the sentiment polarity of a given text. SA can be divided into
explicit SA (ESA) and implicit SA (ISA), with the former being the mainstream task where sentiment
expressions are explicitly present in the text [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In contrast to ESA, ISA is more challenging because
sentences in ISA may not contain factual descriptions that directly express clear opinions or sentiments
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. there may also be problems with ambiguous referential components. For example, given the text
‘No one can stop him from rampaging like a bull.’ as shown in Figure 1, ‘him’ could refer to a particular
athlete or a friend of the speaker. Some sentences may also contain literary devices such as irony,
metaphor, quotes, rhetorical questions, etc., which serve semantic expression needs but also pose
challenges for sentiment analysis tasks.
      </p>
      <p>
        Traditional research on implicit sentiment analysis has leaned on paradigms such as attention
mechanisms and feature extraction [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">3, 4, 5, 6</xref>
        ]. However, these methods struggle to efectively process
implicit sentiment data. As we can see in the example, the phrase ‘like a bull’ is a simile used to describe
a person’s behavior. In contrast, humans can infer the sentence sentiment as positive by referring to
prior knowledge (past cases) and engaging in multi-step questioning and reasoning processes, given a
certain possible domain, such as ‘competition’. Based on human experience in performing sentiment
analysis, we argue that there should be two critical elements for tackling ISA tasks: memorized prior
knowledge and multi-step logical reasoning.
      </p>
      <p>
        The advent of large language models (LLMs) such as ChatGPT [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] has made significant progress in
realizing the above two key factors. LLMs are recognized not only for their extensive prior knowledge
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], but also for their ability to maintain consistent dialogue. This makes it possible to equip them
with the memory of past cases [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Some studies [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ] have also shown the remarkable common
sense understanding and logical reasoning of LLMs. In addition, the Chain-of-Thought (CoT) method
has revealed LLMs’ potential for complex logical reasoning [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ], which can be harnessed through
strategic prompting. Inspired by these ideas, THOR [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] designed a least-to-most prompt process to get
LLMs to explore aspects, opinions, and sentiment polarity of sentences.
      </p>
      <p>While THOR has made progress in ISA tasks, there are cases where its performance is not efective.
This problem motivates us to rethink the ISA task from the perspective of simulating human memory
and reasoning. First and foremost, domain knowledge is essential for judging emotional polarity.
Consider the sentence ‘No one can stop him from rampaging like a bull.’ again, its interpretation and
associated emotions vary significantly between contexts. In a competitive setting, this might praise an
athlete’s impressive performance and convey a positive sentiment. Conversely, in daily life, it might
criticize someone’s careless behavior, reflecting a negative sentiment. Without the relevant domain or
context, humans would also experience cognitive ambiguity in decision making. Furthermore, people
will subconsciously recall past cases and use them to identify the most crucial entity that evokes emotion
in a sentence (e.g., ‘him, which may refer to an athlete’). They then use empathy to understand why the
speaker mentioned that entity by probing for its true intent (e.g., ‘The speaker wants to highlight this
entity’s determined and unstoppable nature in relation to competition.’). Finally, based on the previous
inferences and past cases, we can discern the underlying emotions (e.g., ‘which can be seen as positive
attributes’).</p>
      <p>Based on our findings, we propose a universal What, Why, and How three-question (3Q) framework
with a memory function for past cases. It is structured in three steps: based on domain priors, 1) identify
what is the most crucial entity in the sentence, 2) infer why the speaker wants to mention it, 3) uncover
how the hidden emotion is like. During the above process, historical queries and responses are stored in
memory as past case pairs. These pairs allow the retrieval and creation of improved prompts for any new
query. This minimalist approach, which mirrors human memory and reasoning, efectively captures
the essence of sentence emotions and reveals hidden meanings, thereby simplifying the judgment of
sentiment polarity.</p>
      <p>Moreover, to evaluate the efectiveness of the proposed framework, we enable a Chinese-English
bilingual dataset derived from several oficial datasets. Experimental results show that the 3Q
framework sets a new state-of-the-art (SOTA) benchmark in both supervised fine-tuning and zero-shot
settings. Furthermore, it significantly outperforms THOR and Direct on F1 and neutral F1 metrics, while
mitigating the negative efects of excessive emotion interpretation.</p>
      <p>To sum up, our contributions are:
• We propose a What, Why, and How three-question (3Q) framework with a memory mechanism
for past cases. It mimics human memory and reasoning and puts LLMs in the shoes of speakers
for ISA tasks. It refines ISA tasks into three simple steps, each coupled with a memory function:
based on domain priors, 1) identify what the most crucial entity is in the sentence, 2) infer why
the speaker wants to mention it, 3) uncover how the implied emotions are like.
• We enable a more practical and complex benchmark for ISA tasks, where it includes both English
and Chinese languages, covering 28 domains or scenes.
• Extensive evaluations on the benchmark reporting SOTA results in both zero-shot and fine-tuning
settings. Experimental results show that the 3Q framework can efectively handle various complex
situations, such as over-interpretation of emotions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Implicit Sentiment Analysis (ISA)</title>
        <p>
          Current research on ISA task predominantly centers on deep learning based methods. For example,
the SCAPT model [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] utilizes supervised contrastive pretraining to better capture implicit and explicit
sentiment orientations towards specific aspects. The CLEAN model [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], using causal intervention,
eliminates confounding causal efects in the corpus, suppresses sentiment polarity judgments influenced
by explicit cues, and extracts pure causal efects between sentences and emotions. However, these
studies primarily focus on extracting features from the plain meaning of the text using a combination of
neural networks and attention mechanisms, overlooking the essential role of human logical reasoning
and extensive external knowledge in addressing ISA task.
        </p>
        <p>We observe that many implicit sentiment analysis models, including THOR, are based on traditional
ifne-grained sentiment analysis, such as Aspect-Based Sentiment Analysis (ABSA) tasks. We find that
by referencing these fine-grained sentiment analysis tasks, decomposing the sentiment analysis of the
corpus into key emotional elements such as ‘aspect words’ can improve the accuracy of sentiment
analysis. However, ABSA requires manual annotation of aspect words, which is dificult to achieve in
large datasets. Inspired by human memory and reasoning, we propose 3Q, a What, Why, and How
threequestion framework that integrates a memory function for past cases. It is minimalist and universal,
facilitating eficient implicit sentiment analysis.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Large Language Models (LLMs) with Case-Based Reasoning (CBR)</title>
        <p>
          There are several common methods of case-based reasoning with LLMs. Few-shot prompting involves
incorporating case demonstrations into prompts to guide the model toward better performance [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
Retrieval-augmented generation (RAG) is used to address complex and knowledge-intensive tasks,
where an external knowledge system is established to access case examples from external knowledge
sources to aid in inference [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Previous research has combined the above techniques to enable LLMs
to tackle more complex problems such as mathematical applications and ethical reasoning [
          <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
          ]. For
more complex ISA tasks, however, these approaches are dificult to apply. This is because many implicit
sentences involve multiple ambiguous entities and even complex literary techniques like metaphor,
irony, hyperbole, and quotation. To deal with these situations efectively, we need not only classical
paradigms to refer to and learn by transfer, but also multi-step logical reasoning [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] to dive into implicit
emotions.
        </p>
        <p>
          The rise of the Chain-of-thought (CoT) has been efective in alleviating these problems. It enhances
the multi-step reasoning ability of LLMs by inducing models that simulate human step-by-step reasoning
processes to reach conclusions [
          <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
          ]. Research on ISA based on CoT with LLM is limited. The only
THOR model, employing a three-hop reasoning pattern, induces implicit aspects and opinions in
the corpus, and elicits sentiment polarity judgments on the language [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. However, THOR requires
excessive prior target information in the corpus. It is idealistic and restrictive because real-world
sentences often lack specific targets and require manual annotation. In addition, since THOR emphasizes
Least-to-Most prompting, each conversation is independent, which in turn afects reasoning capabilities
of LLMs [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. The above issues prompt us to rethink the ISA task in terms of human memory and
reasoning. We propose a What, Why, and How three-question (3Q) framework that incorporates a
memory function for past cases. Not only does it efectively mimic the way humans think when faced
with an ISA task, but it also drastically improves emotional over-interpretation. In addition, the memory
mechanism allows LLMs to learn classic paradigms from past cases. This helps to uncover hidden
emotions.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. 3Q Framework with a Memorized Mechanism</title>
      <sec id="sec-3-1">
        <title>3.1. Task Definition</title>
        <p>The task of Implicit Sentiment Analysis (ISA) is to determine the sentiment polarity of a given 
sentence, categorizing it as positive, neutral, or negative. A standard approach without memory is to
use a Direct Prompt template as input for LLMs, typically in the following form:</p>
        <p>Given the sentence , what is the sentiment polarity towards it?</p>
        <p>LLMs should return the answer via: ˆ = argmax ( | ).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Implementation of Memory</title>
        <p>In the 3Q framework, users sequentially ask three questions  ( = 1, 2, 3) to query the model for
sentiment analysis of the input sentence . At each step, the questions  and the corresponding
model-generated answers  are stored in memory  as past case pairs. If there are past case pairs
in memory, LLMs will automatically retrieve them from memory and integrate them into the current
question. The memory implementation consists of the following components:</p>
        <p>Memory  : Memory  is a dynamically expanding table managed by the LLM itself that contains
past case pairs. The -th case pair is denoted as  (, ). The memory  contains all these past case
pairs for future reference. It supports the store operation .() and the fetch operation  ()
itself. These operations can be achieved using simple key-value lookups and prompt concatenation.
Figure 2 omits the mathematical formulas of these two operations for simplicity.</p>
        <p>Past Case Pairs  : All existing cases stored in memory  can be concatenated to construct
 , which represents the totality of past information. It can be represented as:   =  (1) +
 (2) + ... +  ().   can also be combined with the next question +1 as a new query.</p>
        <p>Combiner (1, 2): It supports combining past case pairs   and questions  as a new
query.</p>
        <p>The prompt-based memory can be structured as follows:
&lt;1&gt; [INST] {1} [/INST] {1} &lt;/1&gt;
&lt;2&gt; [INST] {2} [/INST] {2} &lt;/2&gt;
&lt;3&gt; [INST] {3} [/INST] {3} &lt;/3&gt;</p>
        <p>It is worth noting that the structure of the prompt-based memory will vary slightly to accommodate
the diferent prompt formats required by diferent LLMs.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. 3Q Framework</title>
        <p>In our novel 3Q framework, we aim to have the LLM identify the most crucial entity ˆ that influences
sentence sentiment, then ask it to put itself in the speaker’s shoes to uncover the reasons ˆ. Finally, the
LLM should determine the sentiment polarity ˆ based on these prior inferences. We break this process
down into three steps as follows:</p>
        <p>What. We first provide the sentence  along with its corresponding field  , and then we ask the
LLM to identify the most critical entity  within the sentence:</p>
        <p>Given the sentence , the field of it is  . What is the most crucial entity described in it?
This step is represented as ˆ = argmax (|1(,  )), where ˆ is the model’s inference of the entity
, explicitly stating what the entity is and its possible reasons. Note that the entity needn’t be explicitly
stated in the sentence; instead, it can be generalized and abstracted based on the semantic content of
the sentence. After this step, the query and answer are stored in memory in the form of case pairs:
 = . append (1 (1, ˆ)).</p>
        <p>Why. Now, based on 1, we position the model from the speaker’s perspective. In this step, we ask
the LLMs to uncover the reasons why the speaker mentioned this particular entity ˆ:
 2. Why is this entity mentioned in this sentence?</p>
        <p>This step can be formulated as: ˆ = argmax (| 1, 2), where ˆ represents the model’s inference
regarding the reason for mentioning the entity. After this step, the query and response are stored in the
form of case pairs in the memory:  = . append (2 (2, ˆ)).</p>
        <p>How. With the  2 as the premise, we prompt LLMs to infer the sentiment polarity  as the final
outcome:
 3. Based on the above reasons, how would you describe the sentiment polarity towards the
sentence?</p>
        <p>This step can be formulated as: ˆ = argmax (| 2, 3), where ˆ is the sentiment polarity
ultimately predicted by the model. After this step, the query and response are stored in the form of case
pairs in the memory:  = . append (3 (3, ˆ)).</p>
        <p>
          The 3Q framework combines the memory mechanism with CoT [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], allowing LLMs to use their
conversational consistency capabilities to learn information from past cases. In addition, the prompt
itself is simple and universal. In contrast, THOR emphasizes the Least-to-Most prompt [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. This results
in each dialog being independent, which in turn afects the reasoning capabilities of LLMs [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <sec id="sec-4-1">
        <title>4.1. Datasets</title>
        <p>
          We construct a bilingual dataset in both Chinese and English for ISA tasks. It spans multiple domains
and includes high-quality explicit and implicit corpora. We select 6,000 samples from a number of
publicly accessible datasets: SemEval-2014 Task 4 [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], SemEval-2015 Task 9 [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], SMP-ECISA 2021 1,
CCL2018-Chinese-Metaphor-Analysis 2, and Twitter US Airline Sentiment 3. Among them, 4000 samples
for training and 2000 for testing. In both the training and testing sets, the ratio of explicit sentences to
implicit sentences is 1:1. The ratio of English sentences to Chinese sentences is also 1:1. Each sample
is classified into one of three sentiment polarities: positive, negative, or neutral. The ratio of these
sentiment polarities in the training set and test set is 1:1:1. The details of the dataset can be found in
Table 1.
        </p>
        <p>As we analyzed before, one sentence can convey diferent meanings or emotions in diferent fields
or contexts. It is unrealistic to perform sentiment analysis without specifying the related field. None
of the existing oficial datasets meet this requirement. To address this issue, we annotate each sample
with its corresponding field. Specifically, samples from SemEval-2014 Task 4 are labeled as either
‘laptops’ or ‘restaurants’, while those from Twitter US Airline Sentiment are marked as ‘aviation’.
Chinese datasets represented by SMP-ECISA 2021 have already specified the field or topic in its oficial
description. Therefore, samples from these datasets are annotated accordingly, including fields such
as ‘daily’, ‘travel’, ‘politics’, etc. Finally, we obtain a bilingual benchmark dataset 4 with 28 diferent
domains.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Models</title>
        <p>
          We choose the Llama 2-CHAT model from Hugging Face as our backbone LLM [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. It is available in
three sizes: 7B, 13B, and 70B. We also experiment with a leading closed source model, ChatGPT [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ].
Note that ChatGPT does not release its model parameters, and we use it in the querying way via the
gpt-3.5-turbo-1106 API. We also compare to the current classic baselines, including BERT-SPC [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] and
DistilBERT [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. Given the model’s ability to encode a Chinese and English and to load a bilingual
tokenizer, we find only these two pre-trained models in the open source community.
1https://github.com/sxu-nlp/ECISA2021
2https://github.com/DUTIR-Emotion-Group/CCL2018-Chinese-Metaphor-Analysis
3https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment
4https://github.com/qinfengsama/3Q-framework
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Baselines</title>
        <p>
          In both zero-shot and fine-tuning settings, we compare our methods with two baseline methods: THOR
and Direct. THOR is the only method that combines LLMs and CoT. The prompt used for it is the
same as the corresponding research [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Direct itself has no memory mechanism and no multi-step
reasoning. The prompt defined in Section 3.1 is used for comparison. To be fair, we do not provide
system messages for all the methods.
        </p>
        <p>
          In the zero-shot scenario, we also use Zero-Shot CoT, which has no memory mechanism, as the
baseline method. We adopt the standard prompt from this study [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. The only diference is that we
concatenate the test question with the prompt ‘The sentiment polarity toward the sentence is’ instead of
‘The answer is’ as the LLM’s input.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Implementation Details</title>
        <p>For the Llama 2-CHAT model, we use a temperature of 1 and top-p of 1. This configuration is consistent
across models with 7B, 13B, and 70B parameters. For gpt-3.5-turbo-1106 API, we set temperature to 1,
top-p to 1, and leave other parameters (such as frequency-penalty) unchanged.</p>
        <p>We use cloud A100x1 GPU (40G) instances, cloud A100x1 GPU (80GB) instances, and cloud A100x4
GPU (40GB) instances to fine tune the Llama2 model with model sizes of 7B, 13B, and 70B. The average
tuning time is 1 hour. After fine tuning, it takes an average of 17 hours to perform inference on the
2000-sample test set. For the gpt-3.5-turbo-1106 model, we use the OpenAI API for fine tuning, which
takes about 2 hours on average. After tuning, the inference process on the entire test set takes about 5
hours.</p>
        <p>To train the BERT baseline, we use the AdamW optimizer with a learning rate of 5e-5. The batch size
is set to 64 and the dropout probability is set to p=0.1. We use NVIDIA 3090x1 GPU instances in the
cloud.
ISA</p>
        <p>All</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Evaluation Metrics</title>
        <p>We take the F1 as one of the evaluation metrics. During the experiments, we observe a significant
diference in the neutral F1 compared to the positive and negative F1s. Therefore, we also include
the neutral F1 in our evaluation metrics, recognizing its importance in assessing the interpretation of
excessive emotion.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Results</title>
      <sec id="sec-5-1">
        <title>5.1. Results on Zero-shot Reasoning</title>
        <p>
          Table 2 presents the comparison results under zero-shot settings. It is evident that four methods
that combine LLMs with prompt engineering significantly outperform current state-of-the-art (SoTA)
baselines. Among these, 3Q stands out for its impressive performance. Specifically, when equipped with
the 7B Llama 2-CHAT model, it leads the best performing baseline (BERT-SPC) by 21.66%. As the model
scale increases, the gap in F1 between the two also widens, peaking at a 28.88% diference with the 70B
parameter model. This is consistent with the study’s conclusion [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] that reasoning-based methods can
achieve significant advances over traditional non-reasoning methods.
        </p>
        <p>On all three scales of the Llama2 model, 3Q outperforms the other three prompt methods and achieves
a new SOTA performance. Specifically, in the ISA setting, its F1 score is approximately 7.86%, 11.86%, and
20.55% higher than Zero-shot CoT, Direct, and THOR, respectively. This suggests that the combination
of both memory mechanisms and multi-step reasoning can provide a greater improvement in implicit
sentiment analysis than neither memory mechanisms nor multi-step reasoning. Interestingly, the
diference in F1 score between 3Q and THOR is the most significant. This is because a sentence usually
contains multiple targets or entities. Sometimes sentence-level sentiment may contradict aspect-level
sentiment. We argue that THOR’s ability is limited by its own prompt architecture, which is more
concerned with aspect-level sentiment. This is likely to lead to significant errors. In contrast, 3Q
does not initially specify entities. Instead, it encourages LLMs to infer the entity that best identifies
sentence-level sentiment based on the domain. In the subsequent ‘why’ section, LLMs are prompted to
understand and contextualize the reasons behind the most critical entity assertions. In addition, 3Q’s
memory mechanism helps to model the relationship between the most crucial entity and other entities
in the context, thereby improving the understanding of sentiment at the sentence level. Similar benefits
are seen in both Zero-Shot CoT and Direct.</p>
        <p>In addition, we note that as the model scale increases, not only does the gap between 3Q and THOR
widen, but Direct has also surpassed THOR. These two phenomena are primarily due to the
everwidening neutral F1 gap, which is related to the ‘over-interpretation of neutral emotions’ phenomenon
that we discuss later.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Results on Supervised Fine-tuning</title>
        <p>
          The comparisons under fine-tuning are shown in Table 3. Overall, high-quality instructional fine-tuning
significantly improves performance. This is consistent with the conclusions of the Flan-T5 study [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ],
which suggests that model performance improves significantly with more fine-tuning tasks. Taking the
7B model as an example, instruction fine-tuning increases the F1 of 3Q, THOR, and Direct by 18.79%,
19.12%, and 37.96%, respectively. Similar trends are observed in other configurations, and 3Q achieves
SOTA performance when equipped with the 70B model. Under this configuration, the F1 score reaches
an impressive 94.01%, surpassing that of BERT-SPC by 24.04%. Furthermore, after comparing three
prompt methods, we find that 3Q consistently outperforms THOR by at least 10% in F1 score, a stable
improvement that is also reflected in the zero-shot settings. In addition, a major benefit of prompt
tuning is the increased gap between 3Q and Direct: 3Q leads Direct by 5.43% F1 even on the smallest 7b
model.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Emotional Over-interpretation</title>
        <p>Table 2 shows that the F1 score for neutral sample sentiment classification remains low regardless of
the configuration. Performance improves slightly when the model parameters reach a large size, as
seen in Llama 2-CHAT 70B and gpt-3.5-turbo-1106. This suggests that accurate identification of neutral
sentiment continues to be a significant obstacle in ISA tasks.</p>
        <p>It is worth noting that THOR maintains an average of 20% F1-neu in most cases. Such poor
performance results from its second step, the induction of the implicit opinion. The original prompt is, ‘Based
on common sense, what is the implicit opinion towards the mentioned aspect of t, and why?’. The phrase
‘implicit opinion’ causes LLMs to infer inappropriate implicit opinions from neutral sentences, which
introduces great interference and error into the next step of judging sentiment polarity.</p>
        <p>In contrast, 3Q outperforms THOR by nearly 50% in average neutral F1 and establishes a new SOTA
score of 76.86% on the 70B Llama 2-CHAT model. It is also 9.88% higher than Direct under similar
configurations. This is because in the ‘why’ section, 3Q does not introduce subjective interventions like
‘implicit opinion’. Instead, it only leads LLMs to infer the reasons for mentioning entities. In neutral
sentences, the most critical entity tends to state a specific fact or serve a specific purpose. It is usually
mentioned for objective reasons. Therefore, 3Q can efectively mitigate the problem of over-interpreting
emotions. Figure 3 visualizes the entire analysis. While Direct also has similar benefits with a relatively
high neutral F1 score, it is still 10% lower on average than 3Q. This suggests that the combination of
both memory mechanisms and multi-step reasoning can provide a greater improvement in implicit
sentiment analysis than neither memory mechanisms nor multi-step reasoning.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Influence of Diferent Model Sizes of LLMs</title>
        <p>
          In Figure 4 we examine the influence of diferent LLM scales. As the model scale increases, the F1
for all three methods generally show a growth trend. This is consistent with the existing findings of
CoT prompting methods [
          <xref ref-type="bibr" rid="ref12 ref13 ref30">12, 13, 30</xref>
          ], suggesting that larger LMs have more extensive prior knowledge
and improved logical reasoning abilities. In addition, high-quality instruction fine-tuning leads to a
significant improvement in F1 scores. This aligns with the conclusions of the Flan-T5 study[
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], which
proposes that instructional fine-tuning is a general method for improving the performance and usability
of pre-trained language models. Among these prompt methods, 3Q experiences the largest increase,
with an average improvement of 18%. Most strikingly, its zero-shot performance approaches that of
THOR after instructional fine-tuning, suggesting that 3Q has significant potential and a higher ceiling
for performance.
        </p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Error Analysis</title>
        <p>We examine the error rates and error cases of the 3Q model in both zero-shot and supervised fine-tuning
settings on our test set. Errors are categorized into three types: Type A error represents over-analyzing
neutral corpora, where the model mistakenly interprets neutral content as positive or negative sentiment;
Type B error occurs when the model fails to capture sentiment clues in the corpora and misjudges
negative and positive as neutral; Type C error means the model confusing two non-neutral emotions,
labeling positive as negative or vice versa. The distribution of these three types of errors is illustrated
in Figure 5.</p>
        <p>From the experimental results, it is evident that in the zero-shot setting, the gpt-3.5-turbo-1106 is
predominantly associated with Type A errors, i.e, among the total error rate of 21.21%, Type A errors
accounts for 18.72%. Supervision Fine-tuning reduces the Type A error rate to a low level (3.64%), i.e.,
eliminating approximately 80% of Type A errors. In contrast, the proportions of Type B and Type C
errors show no significant changes before and after supervised fine-tuning. Moreover, these two error
types account for a smaller portion of the overall errors, possibly due to the inherent dificulty of the
self-built corpora and the limitation of model capabilities.</p>
        <p>In contrast, the Llama2-CHAT-70B model, unlike gpt-3.5-turbo-1106, tends to misclassify neutral
sentences. In the zero-shot setting, the Llama 2-CHAT-70B model exhibits a Type B error rate of 7.09%,
which surpasses the Type A error rate (5.79%). Even after supervised fine-tuning, the Type B error rate
(3.09%) exceeds the zero-shot results of the gpt-3.5-turbo-1106 model (1.60%). This suggests that the
Llama 2-CHAT-70B model is prone to misclassifying sentiment-unnoticed samples as neutral.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>In this research, we propose a What, Why, and How three-question (3Q) framework that incorporates
a memory mechanism for past cases. Specifically, 3Q performs a sequential three-step prompt by
simulating the human memory and reasoning process. It first determines what the most critical entity in
the sentence is, then infers why the speaker mentions it. Finally, 3Q uncovers how the hidden emotions
are based on the past cases. During the reasoning process, historical queries and responses are stored
in memory as past case pairs. These pairs enable the retrieval and creation of enhanced prompts for
any new query, thereby enhancing the implicit sentiment analysis capabilities of LLMs. To verify the
efectiveness of it, we construct a Chinese-English bilingual dataset based on several oficial datasets.
3Q achieves a new SoTA under both supervised fine-tuning and zero-shot settings. Experimental results
show that it significantly outperforms THOR, Direct and Zero-Shot CoT in F1 scores, and efectively
reduces the negative efects of over-interpretation of emotions. We release the dataset in an associated
repository and hope that the accessibility of the dataset will encourage the community to evaluate novel
methods for implicit sentiment analysis. In the future, we will extend this approach to the real-world
document understanding. In addition, we plan to further explore the ISA task by simulating the human
nonlinear reasoning process from the perspective of Tree of Thoughts (ToT) and Graph of Thoughts
(GoT).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pontiki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Galanis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Papageorgiou</surname>
          </string-name>
          , I. Androutsopoulos, S. Manandhar, SemEval
          <article-title>-2014 task 4: Aspect based sentiment analysis</article-title>
          , in: P. Nakov, T. Zesch (Eds.),
          <source>Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval</source>
          <year>2014</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Dublin, Ireland,
          <year>2014</year>
          , pp.
          <fpage>27</fpage>
          -
          <lpage>35</lpage>
          . URL: https://aclanthology.org/S14-2004. doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>S14</fpage>
          -2004.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Caselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Strapparava</surname>
          </string-name>
          , Semeval
          <article-title>-2015 task 9: Clipeval implicit polarity of events</article-title>
          ,
          <source>in: Proceedings of the 9th international workshop on semantic evaluation (SemEval</source>
          <year>2015</year>
          ),
          <year>2015</year>
          , pp.
          <fpage>443</fpage>
          -
          <lpage>450</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dahlmeier</surname>
          </string-name>
          ,
          <article-title>Efective attention modeling for aspect-level sentiment classification</article-title>
          , in: E. M.
          <string-name>
            <surname>Bender</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Derczynski</surname>
          </string-name>
          , P. Isabelle (Eds.),
          <source>Proceedings of the 27th International Conference on Computational Linguistics</source>
          , Association for Computational Linguistics, Santa Fe, New Mexico, USA,
          <year>2018</year>
          , pp.
          <fpage>1121</fpage>
          -
          <lpage>1131</lpage>
          . URL: https://aclanthology.org/C18-1096.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Dependency graph enhanced dual-transformer structure for aspectbased sentiment classification</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>6578</fpage>
          -
          <lpage>6588</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>588</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>588</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>Learning implicit sentiment in aspect-based sentiment analysis with supervised contrastive pre-training</article-title>
          , in: M.
          <article-title>-</article-title>
          <string-name>
            <surname>F. Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , S. W.-t. Yih (Eds.),
          <source>Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Online and
          <string-name>
            <given-names>Punta</given-names>
            <surname>Cana</surname>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>246</fpage>
          -
          <lpage>256</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .emnlp-main.
          <volume>22</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .emnlp-main.
          <volume>22</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Causal intervention improves implicit sentiment analysis</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>C.-R.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pustejovsky</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wanner</surname>
          </string-name>
          , K.-S. Choi,
          <string-name>
            <surname>P.-M. Ryu</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-H. Chen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Donatelli</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kurohashi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Paggio</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Xue</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Hahm</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>T. K.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Santus</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Bond</surname>
          </string-name>
          , S.-H. Na (Eds.),
          <source>Proceedings of the 29th International Conference on Computational Linguistics</source>
          ,
          <source>International Committee on Computational Linguistics</source>
          , Gyeongju, Republic of Korea,
          <year>2022</year>
          , pp.
          <fpage>6966</fpage>
          -
          <lpage>6977</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .coling-
          <volume>1</volume>
          .
          <fpage>607</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Agarwal,
          <string-name>
            <given-names>K.</given-names>
            <surname>Slama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          , et al.,
          <article-title>Training language models to follow instructions with human feedback</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>27730</fpage>
          -
          <lpage>27744</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Schramowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Turan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Andersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Rothkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kersting</surname>
          </string-name>
          ,
          <article-title>Large pre-trained language models contain human-like biases of what is right and wrong to do</article-title>
          ,
          <source>Nature Machine Intelligence</source>
          <volume>4</volume>
          (
          <year>2022</year>
          )
          <fpage>258</fpage>
          -
          <lpage>268</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Madaan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tandon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Memory-assisted prompt editing to improve gpt-3 after deployment</article-title>
          ,
          <source>arXiv preprint arXiv:2201.06009</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Paranjape</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Michael</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <article-title>Prompting contrastive explanations for commonsense reasoning tasks</article-title>
          ,
          <source>arXiv preprint arXiv:2106.06823</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Welleck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>West</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Le</given-names>
            <surname>Bras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <article-title>Generated knowledge prompting for commonsense reasoning</article-title>
          , in: S. Muresan,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Villavicencio (Eds.),
          <source>Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>3154</fpage>
          -
          <lpage>3169</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>225</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>225</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , et al.,
          <article-title>Chain-of-thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>24824</fpage>
          -
          <lpage>24837</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Smola</surname>
          </string-name>
          ,
          <article-title>Automatic chain of thought prompting in large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2210.03493</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H.</given-names>
            <surname>Fei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.-S.</given-names>
            <surname>Chua</surname>
          </string-name>
          ,
          <article-title>Reasoning implicit sentiment with chain-ofthought prompting</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>1171</fpage>
          -
          <lpage>1182</lpage>
          . URL: https: //aclanthology.org/
          <year>2023</year>
          .acl-short.
          <volume>101</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .acl-short.
          <volume>101</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Madaan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tandon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Memory-assisted prompt editing to improve gpt-3 after deployment</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2201</volume>
          .
          <fpage>06009</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>N.</given-names>
            <surname>Tandon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madaan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Learning to repair: Repairing model output errors after deployment using a dynamic memory of feedback</article-title>
          ,
          <source>arXiv preprint arXiv:2112.09737</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Barham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gehrmann</surname>
          </string-name>
          , et al.,
          <article-title>Palm: Scaling language modeling with pathways</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>24</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Borgeaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yogatama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          , et al.,
          <article-title>Emergent abilities of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2206.07682</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for large language models: A survey</article-title>
          ,
          <source>arXiv preprint arXiv:2312.10997</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ichter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chain-of-thought prompting elicits reasoning in large language models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2201</volume>
          .
          <fpage>11903</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schärli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Scales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bousquet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          , et al.,
          <article-title>Least-to-most prompting enables complex reasoning in large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2205.10625</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almahairi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Babaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bashlykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhargava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhosale</surname>
          </string-name>
          , et al.,
          <source>Llama</source>
          <volume>2</volume>
          :
          <article-title>Open foundation and fine-tuned chat models</article-title>
          ,
          <source>arXiv preprint arXiv:2307.09288</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Agarwal,
          <string-name>
            <given-names>K.</given-names>
            <surname>Slama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          , et al.,
          <article-title>Training language models to follow instructions with human feedback</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>27730</fpage>
          -
          <lpage>27744</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          , arXiv e-prints (
          <year>2018</year>
          ) arXiv:
          <year>1810</year>
          .04805. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1810</year>
          .
          <volume>04805</volume>
          . arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , T. Wolf,
          <article-title>DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter</article-title>
          , arXiv e-prints (
          <year>2019</year>
          ) arXiv:
          <year>1910</year>
          .01108. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1910</year>
          .
          <volume>01108</volume>
          . arXiv:
          <year>1910</year>
          .01108.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kojima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Iwasawa</surname>
          </string-name>
          ,
          <article-title>Large language models are zero-shot reasoners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>22199</fpage>
          -
          <lpage>22213</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fedus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brahma</surname>
          </string-name>
          , et al.,
          <article-title>Scaling instruction-finetuned language models</article-title>
          ,
          <source>arXiv preprint arXiv:2210.11416</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sabharwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Clark</surname>
          </string-name>
          , T. Khot,
          <article-title>Complexity-based prompting for multi-step reasoning</article-title>
          ,
          <source>arXiv preprint arXiv:2210.00720</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>