<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pei-Fu Guo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ying-Hsuan Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yun-Da Tsai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shou-De Lin</string-name>
          <email>sdlin@csie.ntu.edu.tw</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>30th ACM KDD Conference</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science and Information Engineering, National Taiwan University</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Finally, Section 8 summarizes and concludes the paper</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>LLM Reasoning</institution>
          ,
          <addr-line>LLM Optimization</addr-line>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>timization algorithms: Gradient Descent</institution>
          ,
          <addr-line>Hill Climbing</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this study, we evaluate the optimization capabilities of Large Language Models (LLMs) across diverse mathematical and combinatorial optimization tasks, where each task is described in natural language. These tasks require LLM to iteratively generate and evaluate solutions through interactive prompting, where each optimization step involves generating new solutions based on past results and then pass to subsequent iterations. We demonstrate that LLMs can perform various optimization algorithms and act as efective black-box optimizers, capable of intelligently optimizing unknown functions. We also introduce three simple yet informative metrics to evaluate optimization performance, applicable across diverse tasks and less sensitive to test sample variations. Our findings reveal that LLMs excel at optimizing small-scale problems with limited data and their performance is significantly afected by the dimension of problem and values, highlighting the need for further research in LLM optimization.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR</p>
      <p>ceur-ws.org
optimization tasks. Section 6, details our experimental
derscore the need for further research within the domain
of optimization tasks tailored for LLMs. It’s important to
note that our work does not aim to outperform
state-ofthe-art optimization algorithms for either mathematical
KiL’24: Workshop on Knowledge-infused Learning co-located with
∗Both authors contributed equally.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related</title>
    </sec>
    <sec id="sec-3">
      <title>Works</title>
      <p>In various optimization scenarios, the utilization of Large
Language Models (LLMs) has become indispensable for
the development of optimization algorithms or agent
systems capable of handling complex and informative
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License text-based feedback. In this section, we summarize three
Attribution 4.0 International (CC BY 4.0).</p>
    </sec>
    <sec id="sec-4">
      <title>1. Introduction</title>
      <p>
        Large Language Models have demonstrated exceptional
capabilities in reasoning across a variety of natural
language-based tasks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>However, their potential
extends beyond multiple-choice questions or
singlequestion answering. This work explores LLMs’
efectiveness in optimization across diverse tasks and problem
dimensions. Optimization involves iteratively
generating and evaluating solutions to improve a given objective
function. Our research assesses LLM performance in
interactive optimization, where each step generates new
solutions based on previous ones and their values.</p>
      <p>We conduct our study with four diferent types of
op</p>
      <p>Our findings suggest that LLMs show impressive
optioptimization or combinatorial optimization problems.
Instead, our goal is to showcase the potential of LLM in
these optimization domains and find out limitations in
these settings.</p>
      <p>Our contributions are summarized as follows:
• Exploring the potential of LLMs in mathematical</p>
      <p>and combinatorial optimization scenarios.
• Introduce three novel metrics for assessing LLM</p>
      <p>performance in optimization tasks.
• Delve into factors that influence LLM
performance using our metrics, with a particular
emphasis on the impact of problem dimension and
task type.</p>
      <p>The remainder of this paper is structured as follows.
for addressing optimization challenges. In Section 3, we
defined 4 optimization algorithms in the case studies.</p>
      <p>In Section 4, we demonstrate that LLMs with iterative
prompting strategy function as optimizers. In Section 5,
we present three metrics that we have designed to
ascomprehensive evaluation of LLM performance, we in- In Section 2, we present preliminary works on LLMs
significant related works that leverage LLMs to tackle op- explore nearby solutions by making small incremental
timization and reinforcement learning challenges. These changes. In our task, neighboring solutions are
generworks showcase the adaptability and efectiveness of ated by selecting a specific element within the solution
LLMs in addressing optimization and learning challenges and either increasing or decreasing it by one each time.
across various domains. Subsequently, the neighbor solution with the minimum</p>
      <p>
        Optimization by PROmpting (OPRO) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] OPRO loss is chosen as the new solution and passed to the next
harnesses LLMs as versatile optimizers by describing iteration.
optimization tasks in natural language prompts. It it- Grid-Search assesses the LLM’s ability to conduct
eratively generates and evaluates solutions from these exhaustive searches and locate optimal solutions within
prompts, demonstrating superior performance on tasks a predefined search space. LLMs are tasked with
genlike linear regression and traveling salesman problems. erating all grid points and systematically searching for
OPRO outperforms human-designed prompts by up to the point that results in the lowest loss according to the
50% on challenging tasks. given loss function.
      </p>
      <p>
        Reflexion [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] Reflexion introduces a novel framework Black-Box Optimization evaluates the LLM’s ability
for training language agents that rely on linguistic feed- to make informed decisions and optimize in an abstract
back rather than traditional reinforcement learning. This problem-solving context. We treat the LLMs as black
framework delivers outstanding results, boasting a re- boxes that try to fit an unknown loss function. We
promarkable 91% pass@1 accuracy on coding tasks—an ex- vide the LLM with a limited set of solutions, each paired
ceptional 11% improvement over previous state-of-the- with its respective true loss value. The LLM’s objective
art models. Reflexion’s success underscores the potential is to discover new solutions that have lower losses than
of linguistic feedback as a powerful training mechanism. the existing solutions in each iteration by themselves.
      </p>
      <p>
        EvoPrompt [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] EvoPrompt automates prompt
optimization by connecting LLMs with evolutionary
algorithms. This automated process surpasses human- 4. Methodologies
designed prompts by up to 25% and outperforms existing
automatic prompt generation methods by an impressive
14%. EvoPrompt’s success highlights the relationship
between Large Language Models and traditional algorithms,
showcasing the potential for enhanced problem-solving
capabilities through this synergistic fusion.
      </p>
    </sec>
    <sec id="sec-5">
      <title>3. Problem setting</title>
      <p>We design four optimization tasks that require the model
to algorithmically search for the optimal value of
parameters. These tasks encompass Gradient-Descent,
Hill-Climbing, Grid-Search, and Black-Box
Optimization, each representing unique optimization domains:
gradient-based, meta-heuristics, decision-theoretic, and
Bayesian. In terms of parameter types, Grid-Search
and Hill-Climbing involve discrete search spaces, while
Gradient-Descent and Black-Box Optimization tackle
continuous search spaces. Following is detailed
information on each optimization task.</p>
      <p>Gradient-Descent assesses the model’s proficiency
in advanced calculations and its grasp of the principles of
gradient descent. We instruct LLMs to undertake a
conventional gradient descent optimization process based
on the loss function they have defined. LLMs need to
compute the gradient and update the parameters using
the gradient information and the learning rate given.</p>
      <p>Hill-Climbing evaluate the LLM’s capability to
adhere to custom predefined rules they have not seen
before. LLMs start with an initial solution and iteratively
In this section, we show how LLMs, guided by
iterative prompting, can efectively function as optimizers,
akin to various optimization algorithms. To
systematically navigate the search space, we introduce an iterative
prompting framework that enables LLMs to
incrementally achieve better solutions within the search space
through iterative processes.</p>
      <p>We applied Chain of Thoughts and iterative prompting
as our prompting method. LLM will accomplish each step
with reasoning thoughts as intermediate outputs. In each
of these tasks (optimization algorithm), LLMs are initially
required to formulate the loss function based on given
samples. Then each optimization iteration is composed of
two steps: (1) Generates new solution based on algorithm
instructions and past search results (2) Calculate loss of
new solution and add the results to the prompt of the
next iteration. We keep repeating the two steps until the
stop criteria are met. Figure 1 shows an overview of how
LLM performs optimization in interactive settings.</p>
      <p>To create an interactive environment, we utilize the
chat mode of GPTs, where the entire conversation history
serves as the prompt. This allows LLMs to retain memory
of past search results and reasoning paths. New
instructions are appended to ongoing conversation records with
each iteration. If the dialogue surpasses the token limit,
earlier portions are removed.
the solution-score pairs to the prompt of the next iteration. the value is negative, it means that LLM’s performance
self-improvement, which is measured by</p>
      <p>also crucial to appraise the LLMs’ capability to operate
in a manner consistent with our truth model algorithm.</p>
      <p>This metric serves as an indicator of the LLM’s adeptness
in adhering to task-specific instructions. We define the
of a test sample  as :
  =
1</p>
      <p>∑
 =1
  ,
−</p>
      <p>ℎ
 ℎ
where 
,
is the LLM output loss of trial  ,</p>
      <p>ℎ
the ground truth of sample  and  is the number of trials.</p>
      <p>Since the policy metric measures the disparity between
the ground truth and the LLM’s output, a lower policy
metric value indicates a more efective alignment of the</p>
      <sec id="sec-5-1">
        <title>LLM’s actions with the prescribed guidelines.</title>
      </sec>
      <sec id="sec-5-2">
        <title>When</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Evaluation</title>
      <p>We devised three novel metrics for the comprehensive
evaluation of LLM capabilities. In this section, we will
explain the design and objective of each metric. These
metrics ofer versatility in assessing LLM performance
across diverse tasks, making concurrent evaluation easier. test sample  as :
Their reliance on ratio measures, rather than diferences,
makes them less sensitive to sample variations.</p>
      <sec id="sec-6-1">
        <title>5.1. Goal Metric</title>
        <p>Goal metric evaluates how efectively LLMs perform
optimization. It provides a quantitative measure of the
degree to which the LLM contributes to minimizing the
loss function values. In other words, ensuring that the
We define the</p>
        <p>of a test sample  as :
,
  =
where 

,</p>
        <p>is the LLM output loss of trial  , and  is the
number of trials per sample. The higher the metric value,
the greater the progress in optimization. The goal metric
plays a crucial role in our evaluation framework,
particularly in scenarios where ground truth is absent, such as
the Black-Box optimization scenarios.</p>
        <p>is the initial solution loss of sample  ,</p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Policy Metric</title>
        <p>Policy metric assesses the degree of alignment between
the final model output and the ground truth. Beyond
(2)
is
of a
(3)

ultimate solution loss is lower than the initial solution. problems, the final optimal output should be identical in
1</p>
        <p>∑
 =1
  ,
 ,
− 
,
(1)
6. Experiments
  =</p>
        <p>1
 =1
∑(
,
−  
)
2
where 
,</p>
        <p>is the LLM output of the i-th trial, 
is the mean of the trial outputs and  is the number of
trials. A stable LLM can be more trusted for tasks that
demand consistent and reproducible results. In our case,
if the language model truly understands the context of
every trial of the same sample.
surpasses the ground truth.</p>
      </sec>
      <sec id="sec-6-3">
        <title>5.3. Uncertainty Metric</title>
        <p>Uncertainty metric quantifies the variability in the LLM’s
solutions under identical conditions. Stability is a crucial
characteristic in optimization tasks. We hope that the
LLMs produce identical results in every trial involving the
same sample, even under conditions with temperatures
greater than zero. We define the  
 
This section provides details of our experimental
configurations and highlights the outcomes of experiments.
Subsection 6.1 outlines the process of generating synthetic
datasets for all optimization tasks, while subsection 6.2
elucidates the detailed settings of our experiment. Lastly,
subsection 6.3 ofers a concise summary of the outcomes
derived from our experiment.</p>
      </sec>
      <sec id="sec-6-4">
        <title>6.1. Dataset</title>
        <p>In the experiment, we create five datasets with 
values chosen from the set {3, 6, 12, 24, 48} and generate
instances belonging to [0, 10] in each dataset to examine
sensitivity to the number of parameters, representing may pose greater dificulty for LLMs compared to other
the dimension of the optimization problem. For instance, tasks.
 = 3 indicates that there are 3 variables in the loss
function and the dimension of this optimization problem is 3.</p>
        <p>We then apply each instance to a loss function and find
the true solution for each parameter search task. These
authenticated solutions, coupled with their associated
losses, not only serve as the ground truth for the tasks
but also act as a pivotal benchmark against which the
solutions derived by LLMs are systematically evaluated
and compared in the ensuing analysis.</p>
      </sec>
      <sec id="sec-6-5">
        <title>6.2. Detailed Settings</title>
        <p>In our experiment, We set the LLM temperature to 0.8 and
the reset as default. We performed 5 repetitions of the test
for each instance in the dataset, with the LLM conducting
10 iterations of parameter search in each repetition. We
excluded excessively biased results to prevent our metrics
from being skewed by a minority of poorly performing
test outcomes. All experiments employ the
GPT-turbo3.5 ’0613’ version as the Language Model.</p>
      </sec>
      <sec id="sec-6-6">
        <title>6.3. Main Results</title>
        <p>We summarize the outcomes of our experiment and
subsequently examine the common trends observed across all
experiments. In every plot, the x-axis displays the
dimension of the optimization problem. In the case of the goal
metric and policy metric plots, the y-axis illustrates the
average metric value for the respective tasks, while the
shaded area in a lighter color delineates the confidence
interval of the metric, denoted as [  − ,   + ]
As for the uncertainty metric plot, the y-axis showcases
the uncertainty metric value, which corresponds to the
standard deviation of the LLM final solution loss. It is
worth noting that the Goal Metric graph excludes the
non-iterative Grid-Search task due to its non-iterative
nature, while the Policy Metric graph omits the Black-Box
task due to unattainable ground truth.</p>
        <p>LLMs show strong optimization capabilities in
small-scale problems. Our experiments test the
comprehensive optimization capabilities of LLMs. Observing
ifgure 2, GPT-turbo-3.5 showcases considerable
optimization capabilities across various scenarios. Impressively,
in the Gradient-Descent task, GPT-turbo-3.5 even
surpasses the ground truth, particularly in the case of the
sample dimension equal to six. It’s also surprising that
the model achieves respectable results in the Grid-Search
task, considering it must compute a vast number of grid
points, which increase exponentially as the dimension of
the problem expands. The model faces challenges in the
Hill-Climbing task, evident from a policy metric
significantly exceeding zero. This suggests that meta-heuristics
. Figure 2: Goal Metric and Policy Metric hover from positive
to near zero, signifying substantial optimization capability
and alignment between LLM’s output and ground truth.</p>
        <p>LLMs show potential as Black-Box Optimizers.
Favorable performance in Black-Box experiments
suggests the use of LLM as an optimizer without giving any
algorithm instructions. From figure 3, we can see that
GPT-turbo-3.5 performs notably when the dimension of
the problem is three, whereas GPT-4 excels when the
dimensions are three and six. Interestingly, as the
dimension increases, the performance of both models gradually
diminishes. Eventually, GPT-4 edged out GPT-turbo-3.5
by a slight margin in optimization and stability.</p>
        <p>LLMs exhibit strong performance in
GradientDescent. Gradient-Descent experiment tests the model’s
proficiency in advanced calculations and grasp of
mathematics principles. Figure 4 underscores this by revealing
a policy metric that consistently hovers near zero,
signifying a remarkable alignment between the LLM’s output
and the ground truth. Despite a decline in the goal
metric as the sample size increases, the consistently low and
stable value of the policy metric underscores the fact that
GPT’s performance in the gradient-descent task is nearly
on par with the truth model.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Analysis and Discussion</title>
      <p>In this section, we consolidate several crucial insights
derived from our experimental results and subject them
to analysis.</p>
      <p>
        Pretrained Knowledge dominates the
optimization capability of LLM. Among all optimization tasks
performed by LLMs, Gradient Descent emerges as the
leading performer, while Hill-Climbing poses greater
challenges. The main diference between the two tasks
is that Hill-Climbing is a heuristic algorithm with more
user-specific parameters, whereas gradient descent is an
optimization algorithm that relies more on mathematical
principles. This suggests that LLM optimization
capabilities primarily stem from pretrained knowledge stored
within the model parameters, rather than from context
knowledge provided by users. Our findings align with
previous research [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
        ] showing that language models
often prioritize their prior knowledge over new context.
      </p>
      <p>Achieving balanced attention to both prior and context
knowledge is essential for further research to improve
the optimization capability of language models.</p>
      <p>LLMs are potential hybrid optimizers. The
predominantly positive goal metric values across most tasks
and datasets indicate LLMs’ capability for optimization.</p>
      <p>This highlights their versatile capacity to optimize across
diferent problem spaces, potentially allowing for the
switching between optimization methods within a single
task. Such switching can help LLMs better explore the
solution space and escape local optima where they might
get stuck. This is a significant advantage of LLMs in
optimization, as they can easily change methods through a
simple natural language prompt during iterations.
Furthermore, LLMs can act as agents (world models) that use
diferent algorithms as tools (actions), switching
methods by evaluating the optimization path from past to
present (state). This adaptability underscores the
potential of LLMs to enhance optimization processes through
dynamic method selection and strategic problem-solving. of test samples. Previous research has indicated that</p>
      <p>
        LLMs possess richer solution space in small-scale LLMs may demonstrate preferences for particular
numproblems. In our experiments, we observed high uncer- bers, words, and symbols [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], which can introduce a level
tainty metric values and significant variations in policy of bias in their responses. Given the high sensitivity of
and goal metrics when samples had smaller dimensions. LLMs to the input prompt, the initial starting points and
Interestingly, LLMs tend to perform more efectively with data provided can exert a significant influence on their
smaller dimension instances, suggesting a correlation be- outputs. In essence, the impact of instruction
descriptween higher uncertainty and better performance. This tion and data initialization should be carefully considered
consistent pattern across various tasks and models in- when interpreting the results of LLM-based experiments
dicates that LLMs have a richer solution space when to ensure a more accurate assessment of their
perfortackling small-scale problems. The expanded solution mance.
space leads to higher uncertainty, providing LLMs with a Self-consistency prompting improves stability. In
broader range of solutions to explore. This highlights the the Gradient-Descent task, we employ self-consistency
importance of dimension reduction in data preprocessing technique [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], where we conduct five repetitions for each
for efective optimization by LLMs. Figure 2 and 5 both iteration and select the solution that emerges most
frehighlight the pattern of uncertainty, where the uncer- quently. From Figure 6, we can see that GPT-4
perfortainty initially rises and then gradually decreases. mances increase largely, and the confidence interval for
both the policy-metric and goal-metric narrows,
indicating improved stability and reliability. Nonetheless, this
approach does not yield favorable outcomes when
applied to GPT-turbo-3.5. This suggests the need for further
investigation within the realm of variance reduction.
      </p>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion and Future</title>
    </sec>
    <sec id="sec-9">
      <title>Directions</title>
      <p>In this paper, we present our in-depth examination of
assessing Large Language Models within the realm of
optimization, where LLM progressively generates new
solutions to optimize an objective function. We investigate
LLMs’ performance across four optimization tasks that
necessitate their comprehension of algorithmic
instructions and their ability to generate new solutions based
on previous solutions and their corresponding values.</p>
      <p>Our evaluation shows that LLMs showcase
optimization prowess across diverse domains. Among the four
tasks we examined, LLMs exhibit their greatest strengths
in the Gradient-Descent task, displaying remarkable
proifciency in this area. However, they encounter more
pronounced dificulties in the meta-heuristics task, where
they must adhere to predefined rules that they have not
encountered previously. Furthermore, LLMs demonstrate
impressive skills in the grid search task, showcasing their
ability to conduct exhaustive searches efectively. In the
Black-Box task, LLMs excel, particularly when dealing
with limited sample sizes, suggesting inherent
optimizaFigure 5: An initial rise followed by a decline in the Uncer- tion abilities within them.
tainty Metric with instance dimension growth suggests LLMs We also consolidate several crucial insights derived
may have a richer sample space for small-scale problems, con- from our experimental results and subject them to
analsistent across tasks and models. ysis. We find that pretrained knowledge dominates the
optimization capability of LLMs, while they also possess</p>
      <p>LLMs are sensitive to numerical values. It’s worth a richer solution space in small-scale problems.
Furtherconsidering that the aforementioned results may be in- more, we elaborate on the potential of LLMs as hybrid
lfuenced by the inherent randomness in the generation optimizers. These insights and analyses unveil a host of
unresolved questions that warrant further research.</p>
    </sec>
    <sec id="sec-10">
      <title>A. Prompt Templates</title>
      <p>User Prompt :
Q :
I want to minimize the loss function using hill climbing. Generate neighboring solutions by either adding 1 or minus 1 to a
specific element in the current solution. The current solution is solution. Your answer includes two parts an explanation with
calculation and a list containing all neighbor solutions(eg. [(ŷ1, ŷ2,....), (ŷ1, ŷ2,....), ...]).</p>
      <p>A :
Explanation : Let’s think step by step ...</p>
      <p>List : [write neighbor solutions here]</p>
      <p>User Prompt :
Q :
You want to minimize an unknown MSE loss function by guessing the values of the ŷs. When you guess, you should take
consider of the past guessing result so that your new guess will have smaller loss than the past results. Pass guessing result are
{pass_result}. Base on the previous guesses, what is your next guess?
A :
(ŷ1, ŷ2,....) = [your answer]</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chain of thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>in: arXiv preprint arXiv:2201.11903</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Large language models as optimizers</article-title>
          ,
          <source>in: arXiv preprint arXiv:2309.03409</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Shinn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cassano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Labash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gopinath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <article-title>Reflexion: Language agents with verbal reinforcement learning</article-title>
          ,
          <source>in: arXiv:2303.11366</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          , G. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Connecting large language models with evolutionary algorithms yields powerful prompt optimizers</article-title>
          ,
          <source>in: arXiv:2309.08532</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.-T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , E. Choi,
          <article-title>Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence</article-title>
          ,
          <source>in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>2292</fpage>
          -
          <lpage>2307</lpage>
          ,
          <string-name>
            <surname>Abu</surname>
            <given-names>Dhabi</given-names>
          </string-name>
          , United Arab Emirates,
          <article-title>Association for Computational Linguistics</article-title>
          .,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pagnoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Balachandran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tsvetkov</surname>
          </string-name>
          ,
          <article-title>Understanding factuality in abstractive summarization with frank: A benchmark for factuality metrics</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , pages
          <fpage>4812</fpage>
          -
          <lpage>4829</lpage>
          , Online.
          <source>Association for Computational Linguistics</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Poon,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Contextfaithful prompting for large language models</article-title>
          , in: ArXiv, abs/2303.11315,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Renda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hopkins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Carbin</surname>
          </string-name>
          ,
          <article-title>Can llms generate random numbers? evaluatingllm sampling in controlled domains</article-title>
          ,
          <source>in: ICML 2023 Workshop: Sampling and Optimization in Discrete Space</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Self-consistency improves chain of thought reasoning in language models</article-title>
          ,
          <source>in: arXiv preprint arXiv:2203.11171</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>