=Paper= {{Paper |id=Vol-3894/paper3 |storemode=property |title=Towards Optimizing with Large Language Model |pdfUrl=https://ceur-ws.org/Vol-3894/paper3.pdf |volume=Vol-3894 |authors=Pei-Fu Guo,Ying-Hsuan Chen,Yun-Da Tsai,Shou-De Lin |dblpUrl=https://dblp.org/rec/conf/kil/GuoCTL24 }} ==Towards Optimizing with Large Language Model== https://ceur-ws.org/Vol-3894/paper3.pdf
                                Towards Optimizing with Large Language Model
                                Pei-Fu Guo1,∗ , Ying-Hsuan Chen1,∗ , Yun-Da Tsai1 and Shou-De Lin1
                                1
                                    Department of Computer Science and Information Engineering, National Taiwan University


                                                Abstract
                                                In this study, we evaluate the optimization capabilities of Large Language Models (LLMs) across diverse mathematical and
                                                combinatorial optimization tasks, where each task is described in natural language. These tasks require LLM to iteratively
                                                generate and evaluate solutions through interactive prompting, where each optimization step involves generating new
                                                solutions based on past results and then pass to subsequent iterations. We demonstrate that LLMs can perform various
                                                optimization algorithms and act as effective black-box optimizers, capable of intelligently optimizing unknown functions. We
                                                also introduce three simple yet informative metrics to evaluate optimization performance, applicable across diverse tasks and
                                                less sensitive to test sample variations. Our findings reveal that LLMs excel at optimizing small-scale problems with limited
                                                data and their performance is significantly affected by the dimension of problem and values, highlighting the need for further
                                                research in LLM optimization.

                                                Keywords
                                                LLM Reasoning, LLM Optimization



                                1. Introduction                                                                     optimization or combinatorial optimization problems. In-
                                                                                                                    stead, our goal is to showcase the potential of LLM in
                                Large Language Models have demonstrated exceptional                                 these optimization domains and find out limitations in
                                capabilities in reasoning across a variety of natural                               these settings.
                                language-based tasks [1]. However, their potential                                     Our contributions are summarized as follows:
                                extends beyond multiple-choice questions or single-
                                question answering. This work explores LLMs’ effective-                                    • Exploring the potential of LLMs in mathematical
                                ness in optimization across diverse tasks and problem                                        and combinatorial optimization scenarios.
                                dimensions. Optimization involves iteratively generat-                                     • Introduce three novel metrics for assessing LLM
                                ing and evaluating solutions to improve a given objective                                    performance in optimization tasks.
                                function. Our research assesses LLM performance in in-                                     • Delve into factors that influence LLM perfor-
                                teractive optimization, where each step generates new                                        mance using our metrics, with a particular em-
                                solutions based on previous ones and their values.                                           phasis on the impact of problem dimension and
                                   We conduct our study with four different types of op-                                     task type.
                                timization algorithms: Gradient Descent, Hill Climbing,
                                Grid Search, and Black Box Optimization. To provide a                                  The remainder of this paper is structured as follows.
                                comprehensive evaluation of LLM performance, we in-                                 In Section 2, we present preliminary works on LLMs
                                troduce three distinct metrics. These metrics provide a                             for addressing optimization challenges. In Section 3, we
                                multifaceted view of task performance and are applicable                            defined 4 optimization algorithms in the case studies.
                                across a broad spectrum of optimization tasks, reducing                             In Section 4, we demonstrate that LLMs with iterative
                                sensitivity to sample variations.                                                   prompting strategy function as optimizers. In Section 5,
                                   Our findings suggest that LLMs show impressive opti-                             we present three metrics that we have designed to as-
                                mization capabilities, especially in small-scale problems.                          sess the overall performance of LLMs in undertaking
                                However, their performance is notably affected by factors                           optimization tasks. Section 6, details our experimental
                                like sample size and value range. These observations un-                            results, showcasing the effectiveness of using LLMs as
                                derscore the need for further research within the domain                            optimizers. In Section 7, we consolidated noteworthy ob-
                                of optimization tasks tailored for LLMs. It’s important to                          servations and points of discussion from the experiments.
                                note that our work does not aim to outperform state-of-                             Finally, Section 8 summarizes and concludes the paper.
                                the-art optimization algorithms for either mathematical
                                                                                                                    2. Related Works
                                KiL’24: Workshop on Knowledge-infused Learning co-located with
                                30th ACM KDD Conference, August 26, 2024, Barcelona, Spain                                               In various optimization scenarios, the utilization of Large
                                ∗
                                     Both authors contributed equally.
                                                                                                                                         Language Models (LLMs) has become indispensable for
                                Envelope-Open r12922217@csie.ntu.edu.tw (P. Guo); r12922044@csie.ntu.edu.tw
                                (Y. Chen); f08946007@csie.ntu.edu.tw (Y. Tsai);                                                          the development of optimization algorithms or agent
                                sdlin@csie.ntu.edu.tw (S. Lin)                                                                           systems capable of handling complex and informative
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).                                                   text-based feedback. In this section, we summarize three




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
significant related works that leverage LLMs to tackle op-    explore nearby solutions by making small incremental
timization and reinforcement learning challenges. These       changes. In our task, neighboring solutions are gener-
works showcase the adaptability and effectiveness of          ated by selecting a specific element within the solution
LLMs in addressing optimization and learning challenges       and either increasing or decreasing it by one each time.
across various domains.                                       Subsequently, the neighbor solution with the minimum
   Optimization by PROmpting (OPRO) [2] OPRO                  loss is chosen as the new solution and passed to the next
harnesses LLMs as versatile optimizers by describing          iteration.
optimization tasks in natural language prompts. It it-           Grid-Search assesses the LLM’s ability to conduct
eratively generates and evaluates solutions from these        exhaustive searches and locate optimal solutions within
prompts, demonstrating superior performance on tasks          a predefined search space. LLMs are tasked with gen-
like linear regression and traveling salesman problems.       erating all grid points and systematically searching for
OPRO outperforms human-designed prompts by up to              the point that results in the lowest loss according to the
50% on challenging tasks.                                     given loss function.
   Reflexion [3] Reflexion introduces a novel framework          Black-Box Optimization evaluates the LLM’s ability
for training language agents that rely on linguistic feed-    to make informed decisions and optimize in an abstract
back rather than traditional reinforcement learning. This     problem-solving context. We treat the LLMs as black
framework delivers outstanding results, boasting a re-        boxes that try to fit an unknown loss function. We pro-
markable 91% pass@1 accuracy on coding tasks—an ex-           vide the LLM with a limited set of solutions, each paired
ceptional 11% improvement over previous state-of-the-         with its respective true loss value. The LLM’s objective
art models. Reflexion’s success underscores the potential     is to discover new solutions that have lower losses than
of linguistic feedback as a powerful training mechanism.      the existing solutions in each iteration by themselves.
   EvoPrompt [4] EvoPrompt automates prompt op-
timization by connecting LLMs with evolutionary al-
gorithms. This automated process surpasses human-             4. Methodologies
designed prompts by up to 25% and outperforms existing
                                                              In this section, we show how LLMs, guided by itera-
automatic prompt generation methods by an impressive
                                                              tive prompting, can effectively function as optimizers,
14%. EvoPrompt’s success highlights the relationship be-
                                                              akin to various optimization algorithms. To systemati-
tween Large Language Models and traditional algorithms,
                                                              cally navigate the search space, we introduce an iterative
showcasing the potential for enhanced problem-solving
                                                              prompting framework that enables LLMs to incremen-
capabilities through this synergistic fusion.
                                                              tally achieve better solutions within the search space
                                                              through iterative processes.
3. Problem setting                                               We applied Chain of Thoughts and iterative prompting
                                                              as our prompting method. LLM will accomplish each step
We design four optimization tasks that require the model      with reasoning thoughts as intermediate outputs. In each
to algorithmically search for the optimal value of pa-        of these tasks (optimization algorithm), LLMs are initially
rameters. These tasks encompass Gradient-Descent,             required to formulate the loss function based on given
Hill-Climbing, Grid-Search, and Black-Box Optimiza-           samples. Then each optimization iteration is composed of
tion, each representing unique optimization domains:          two steps: (1) Generates new solution based on algorithm
gradient-based, meta-heuristics, decision-theoretic, and      instructions and past search results (2) Calculate loss of
Bayesian. In terms of parameter types, Grid-Search            new solution and add the results to the prompt of the
and Hill-Climbing involve discrete search spaces, while       next iteration. We keep repeating the two steps until the
Gradient-Descent and Black-Box Optimization tackle            stop criteria are met. Figure 1 shows an overview of how
continuous search spaces. Following is detailed infor-        LLM performs optimization in interactive settings.
mation on each optimization task.                                To create an interactive environment, we utilize the
   Gradient-Descent assesses the model’s proficiency          chat mode of GPTs, where the entire conversation history
in advanced calculations and its grasp of the principles of   serves as the prompt. This allows LLMs to retain memory
gradient descent. We instruct LLMs to undertake a con-        of past search results and reasoning paths. New instruc-
ventional gradient descent optimization process based         tions are appended to ongoing conversation records with
on the loss function they have defined. LLMs need to          each iteration. If the dialogue surpasses the token limit,
compute the gradient and update the parameters using          earlier portions are removed.
the gradient information and the learning rate given.
   Hill-Climbing evaluate the LLM’s capability to ad-
here to custom predefined rules they have not seen be-
fore. LLMs start with an initial solution and iteratively
                                                                self-improvement, which is measured by 𝑔𝑜𝑎𝑙 𝑚𝑒𝑡𝑟𝑖𝑐, it is
                                                                also crucial to appraise the LLMs’ capability to operate
                                                                in a manner consistent with our truth model algorithm.
                                                                This metric serves as an indicator of the LLM’s adeptness
                                                                in adhering to task-specific instructions. We define the
                                                                𝑝𝑜𝑙𝑖𝑐𝑦 𝑚𝑒𝑡𝑟𝑖𝑐 of a test sample 𝑗 as :
                                                                                       𝑁  𝑙𝑜𝑠𝑠𝐿𝐿𝑀,𝑖 − 𝑙𝑜𝑠𝑠𝑡𝑟𝑢𝑡ℎ
                                                                                    1
                                                                             𝑃𝑗 =     ∑                                  (2)
                                                                                    𝑁 𝑖=1       𝑙𝑜𝑠𝑠𝑡𝑟𝑢𝑡ℎ

                                                              where 𝑙𝑜𝑠𝑠𝐿𝐿𝑀,𝑖 is the LLM output loss of trial 𝑖, 𝑙𝑜𝑠𝑠𝑡𝑟𝑢𝑡ℎ is
                                                              the ground truth of sample 𝑗 and 𝑁 is the number of trials.
                                                              Since the policy metric measures the disparity between
Figure 1: Overview of our prompting strategy. (1) LLMs the ground truth and the LLM’s output, a lower policy
formulate the loss function based on given samples. (2) Given
                                                              metric value indicates a more effective alignment of the
algorithm instructions and past results, LLM generates a new
solution. (3) Calculate the loss of the new solution and add LLM’s actions with the prescribed guidelines. When
the solution-score pairs to the prompt of the next iteration. the value is negative, it means that LLM’s performance
(4) Repeat the second and third steps until stop criteria are surpasses the ground truth.
met.
                                                                5.3. Uncertainty Metric
                                                           Uncertainty metric quantifies the variability in the LLM’s
5. Evaluation                                              solutions under identical conditions. Stability is a crucial
We devised three novel metrics for the comprehensive characteristic in optimization tasks. We hope that the
evaluation of LLM capabilities. In this section, we will LLMs produce identical results in every trial involving the
explain the design and objective of each metric. These same sample, even under conditions with temperatures
metrics offer versatility in assessing LLM performance greater than zero. We define the 𝑢𝑛𝑐𝑒𝑟𝑡𝑎𝑖𝑛𝑡𝑦 𝑚𝑒𝑡𝑟𝑖𝑐 of a
across diverse tasks, making concurrent evaluation easier. test sample 𝑗 as :
Their reliance on ratio measures, rather than differences,                       𝑁
makes them less sensitive to sample variations.                               1
                                                                       𝑈𝑗 = ∑(𝑙𝑜𝑠𝑠𝐿𝐿𝑀,𝑖 − 𝑙𝑜𝑠𝑠𝐿𝐿𝑀 )2                (3)
                                                                              𝑁 𝑖=1
5.1. Goal Metric                                                where 𝑙𝑜𝑠𝑠𝐿𝐿𝑀,𝑖 is the LLM output of the i-th trial, 𝑙𝑜𝑠𝑠𝐿𝐿𝑀
Goal metric evaluates how effectively LLMs perform op-          is the mean of the trial outputs and 𝑁 is the number of
timization. It provides a quantitative measure of the           trials. A stable LLM can be more trusted for tasks that
degree to which the LLM contributes to minimizing the           demand consistent and reproducible results. In our case,
loss function values. In other words, ensuring that the         if the language model truly understands the context of
ultimate solution loss is lower than the initial solution.      problems, the final optimal output should be identical in
We define the 𝑔𝑜𝑎𝑙 𝑚𝑒𝑡𝑟𝑖𝑐 of a test sample 𝑗 as :               every trial of the same sample.

                      𝑁 𝑙𝑜𝑠𝑠
                   1        𝐿𝐿𝑀,𝑖𝑛𝑖𝑡 − 𝑙𝑜𝑠𝑠𝐿𝐿𝑀,𝑖
            𝐺𝑗 =     ∑                                    (1)   6. Experiments
                   𝑁 𝑖=1      𝑙𝑜𝑠𝑠𝐿𝐿𝑀,𝑖𝑛𝑖𝑡
                                                                This section provides details of our experimental configu-
where 𝑙𝑜𝑠𝑠𝐿𝐿𝑀,𝑖𝑛𝑖𝑡 is the initial solution loss of sample 𝑗,    rations and highlights the outcomes of experiments. Sub-
𝑙𝑜𝑠𝑠𝐿𝐿𝑀,𝑖 is the LLM output loss of trial 𝑖, and 𝑁 is the       section 6.1 outlines the process of generating synthetic
number of trials per sample. The higher the metric value,       datasets for all optimization tasks, while subsection 6.2
the greater the progress in optimization. The goal metric       elucidates the detailed settings of our experiment. Lastly,
plays a crucial role in our evaluation framework, particu-      subsection 6.3 offers a concise summary of the outcomes
larly in scenarios where ground truth is absent, such as        derived from our experiment.
the Black-Box optimization scenarios.

                                                                6.1. Dataset
5.2. Policy Metric
                                                                In the experiment, we create five datasets with 𝑑 val-
Policy metric assesses the degree of alignment between          ues chosen from the set {3, 6, 12, 24, 48} and generate in-
the final model output and the ground truth. Beyond             stances belonging to [0, 10]𝑑 in each dataset to examine
sensitivity to the number of parameters, representing may pose greater difficulty for LLMs compared to other
the dimension of the optimization problem. For instance, tasks.
𝑑 = 3 indicates that there are 3 variables in the loss func-
tion and the dimension of this optimization problem is 3.
We then apply each instance to a loss function and find
the true solution for each parameter search task. These
authenticated solutions, coupled with their associated
losses, not only serve as the ground truth for the tasks
but also act as a pivotal benchmark against which the
solutions derived by LLMs are systematically evaluated
and compared in the ensuing analysis.

6.2. Detailed Settings
In our experiment, We set the LLM temperature to 0.8 and
the reset as default. We performed 5 repetitions of the test
for each instance in the dataset, with the LLM conducting
10 iterations of parameter search in each repetition. We
excluded excessively biased results to prevent our metrics
from being skewed by a minority of poorly performing
test outcomes. All experiments employ the GPT-turbo-
3.5 ’0613’ version as the Language Model.

6.3. Main Results
We summarize the outcomes of our experiment and subse-
quently examine the common trends observed across all
experiments. In every plot, the x-axis displays the dimen-
sion of the optimization problem. In the case of the goal
metric and policy metric plots, the y-axis illustrates the
average metric value for the respective tasks, while the
shaded area in a lighter color delineates the confidence
interval of the metric, denoted as [𝑣𝑎𝑙𝑢𝑒 − 𝑠𝑡𝑑, 𝑣𝑎𝑙𝑢𝑒 + 𝑠𝑡𝑑].   Figure 2: Goal Metric and Policy Metric hover from positive
As for the uncertainty metric plot, the y-axis showcases         to near zero, signifying substantial optimization capability
                                                                 and alignment between LLM’s output and ground truth.
the uncertainty metric value, which corresponds to the
standard deviation of the LLM final solution loss. It is
worth noting that the Goal Metric graph excludes the                LLMs show potential as Black-Box Optimizers.
non-iterative Grid-Search task due to its non-iterative na-      Favorable performance in Black-Box experiments sug-
ture, while the Policy Metric graph omits the Black-Box          gests the use of LLM as an optimizer without giving any
task due to unattainable ground truth.                           algorithm instructions. From figure 3, we can see that
   LLMs show strong optimization capabilities in                 GPT-turbo-3.5 performs notably when the dimension of
small-scale problems. Our experiments test the com-              the problem is three, whereas GPT-4 excels when the
prehensive optimization capabilities of LLMs. Observing          dimensions are three and six. Interestingly, as the dimen-
figure 2, GPT-turbo-3.5 showcases considerable optimiza-         sion increases, the performance of both models gradually
tion capabilities across various scenarios. Impressively,        diminishes. Eventually, GPT-4 edged out GPT-turbo-3.5
in the Gradient-Descent task, GPT-turbo-3.5 even sur-            by a slight margin in optimization and stability.
passes the ground truth, particularly in the case of the
sample dimension equal to six. It’s also surprising that            LLMs exhibit strong performance in Gradient-
the model achieves respectable results in the Grid-Search        Descent. Gradient-Descent experiment tests the model’s
task, considering it must compute a vast number of grid          proficiency in advanced calculations and grasp of mathe-
points, which increase exponentially as the dimension of         matics principles. Figure 4 underscores this by revealing
the problem expands. The model faces challenges in the           a policy metric that consistently hovers near zero, signi-
Hill-Climbing task, evident from a policy metric signifi-        fying a remarkable alignment between the LLM’s output
cantly exceeding zero. This suggests that meta-heuristics        and the ground truth. Despite a decline in the goal met-
                                                                 ric as the sample size increases, the consistently low and
Figure 3: Goal Metric reflects the performance of LLMs as     Figure 4: Low values in the Policy Metric and high positive
Black-Box optimizer, showing strong performance with in-      values in the Goal Metric indicate the robust performance of
stances of smaller dimensions.                                the LLM in the gradient descent task.



stable value of the policy metric underscores the fact that
                                                         previous research [5, 6, 7] showing that language models
GPT’s performance in the gradient-descent task is nearly often prioritize their prior knowledge over new context.
on par with the truth model.                             Achieving balanced attention to both prior and context
                                                         knowledge is essential for further research to improve
                                                         the optimization capability of language models.
7. Analysis and Discussion                                  LLMs are potential hybrid optimizers. The pre-
In this section, we consolidate several crucial insights dominantly positive goal metric values across most tasks
derived from our experimental results and subject them and datasets indicate LLMs’ capability for optimization.
to analysis.                                             This highlights their versatile capacity to optimize across
   Pretrained Knowledge dominates the optimiza- different problem spaces, potentially allowing for the
tion capability of LLM. Among all optimization tasks switching between optimization methods within a single
performed by LLMs, Gradient Descent emerges as the task. Such switching can help LLMs better explore the
leading performer, while Hill-Climbing poses greater solution space and escape local optima where they might
challenges. The main difference between the two tasks get stuck. This is a significant advantage of LLMs in op-
is that Hill-Climbing is a heuristic algorithm with more timization, as they can easily change methods through a
user-specific parameters, whereas gradient descent is an simple natural language prompt during iterations. Fur-
optimization algorithm that relies more on mathematical thermore, LLMs can act as agents (world models) that use
principles. This suggests that LLM optimization capabil- different algorithms as tools (actions), switching meth-
ities primarily stem from pretrained knowledge stored ods by evaluating the optimization path from past to
within the model parameters, rather than from context present (state). This adaptability underscores the poten-
knowledge provided by users. Our findings align with tial of LLMs to enhance optimization processes through
dynamic method selection and strategic problem-solving.      of test samples. Previous research has indicated that
   LLMs possess richer solution space in small-scale         LLMs may demonstrate preferences for particular num-
problems. In our experiments, we observed high uncer-        bers, words, and symbols [8], which can introduce a level
tainty metric values and significant variations in policy    of bias in their responses. Given the high sensitivity of
and goal metrics when samples had smaller dimensions.        LLMs to the input prompt, the initial starting points and
Interestingly, LLMs tend to perform more effectively with    data provided can exert a significant influence on their
smaller dimension instances, suggesting a correlation be-    outputs. In essence, the impact of instruction descrip-
tween higher uncertainty and better performance. This        tion and data initialization should be carefully considered
consistent pattern across various tasks and models in-       when interpreting the results of LLM-based experiments
dicates that LLMs have a richer solution space when          to ensure a more accurate assessment of their perfor-
tackling small-scale problems. The expanded solution         mance.
space leads to higher uncertainty, providing LLMs with a        Self-consistency prompting improves stability. In
broader range of solutions to explore. This highlights the   the Gradient-Descent task, we employ self-consistency
importance of dimension reduction in data preprocessing      technique [9], where we conduct five repetitions for each
for effective optimization by LLMs. Figure 2 and 5 both      iteration and select the solution that emerges most fre-
highlight the pattern of uncertainty, where the uncer-       quently. From Figure 6, we can see that GPT-4 perfor-
tainty initially rises and then gradually decreases.         mances increase largely, and the confidence interval for
                                                             both the policy-metric and goal-metric narrows, indicat-
                                                             ing improved stability and reliability. Nonetheless, this
                                                             approach does not yield favorable outcomes when ap-
                                                             plied to GPT-turbo-3.5. This suggests the need for further
                                                             investigation within the realm of variance reduction.


                                                             8. Conclusion and Future
                                                                Directions
                                                              In this paper, we present our in-depth examination of
                                                              assessing Large Language Models within the realm of
                                                              optimization, where LLM progressively generates new so-
                                                              lutions to optimize an objective function. We investigate
                                                              LLMs’ performance across four optimization tasks that
                                                              necessitate their comprehension of algorithmic instruc-
                                                              tions and their ability to generate new solutions based
                                                              on previous solutions and their corresponding values.
                                                                 Our evaluation shows that LLMs showcase optimiza-
                                                              tion prowess across diverse domains. Among the four
                                                              tasks we examined, LLMs exhibit their greatest strengths
                                                              in the Gradient-Descent task, displaying remarkable pro-
                                                              ficiency in this area. However, they encounter more pro-
                                                              nounced difficulties in the meta-heuristics task, where
                                                              they must adhere to predefined rules that they have not
                                                              encountered previously. Furthermore, LLMs demonstrate
                                                              impressive skills in the grid search task, showcasing their
                                                              ability to conduct exhaustive searches effectively. In the
                                                              Black-Box task, LLMs excel, particularly when dealing
                                                              with limited sample sizes, suggesting inherent optimiza-
Figure 5: An initial rise followed by a decline in the Uncer- tion abilities within them.
tainty Metric with instance dimension growth suggests LLMs       We also consolidate several crucial insights derived
may have a richer sample space for small-scale problems, con- from our experimental results and subject them to anal-
sistent across tasks and models.
                                                              ysis. We find that pretrained knowledge dominates the
                                                              optimization capability of LLMs, while they also possess
   LLMs are sensitive to numerical values. It’s worth a richer solution space in small-scale problems. Further-
considering that the aforementioned results may be in- more, we elaborate on the potential of LLMs as hybrid
fluenced by the inherent randomness in the generation optimizers. These insights and analyses unveil a host of
                                                               [5] H.-T. Chen, M. Zhang, E. Choi, Rich knowledge
                                                                   sources bring complex knowledge conflicts: Recal-
                                                                   ibrating models to reflect conflicting evidence, in:
                                                                   Proceedings of the 2022 Conference on Empirical
                                                                   Methods in Natural Language Processing, pages
                                                                   2292–2307, Abu Dhabi, United Arab Emirates, Asso-
                                                                   ciation for Computational Linguistics., 2022.
                                                               [6] A. Pagnoni, V. Balachandran, Y. Tsvetkov, Under-
                                                                   standing factuality in abstractive summarization
                                                                   with frank: A benchmark for factuality metrics, in:
                                                                   Proceedings of the 2021 Conference of the North
                                                                   American Chapter of the Association for Computa-
                                                                   tional Linguistics: Human Language Technologies,
                                                                   pages 4812–4829, Online. Association for Computa-
                                                                   tional Linguistics, 2021.
                                                               [7] W. Zhou, S. Zhang, H. Poon, M. Chen, Context-
                                                                   faithful prompting for large language models, in:
                                                                   ArXiv, abs/2303.11315, 2023.
                                                               [8] A. Renda, A. Hopkins, M. Carbin, Can llms generate
                                                                   random numbers? evaluatingllm sampling in con-
                                                                   trolled domains, in: ICML 2023 Workshop: Sampling
                                                                   and Optimization in Discrete Space, 2023.
                                                               [9] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi,
                                                                   D. Zhou, Self-consistency improves chain of thought
                                                                   reasoning in language models, in: arXiv preprint
                                                                   arXiv:2203.11171, 2022.



Figure 6: The confidence intervals for both the policy and
goal metrics of GPT-4 narrow, indicating improved stability.
A negative policy metric with a high goal metric signifies
significant outperformance of the ground truth model with
six-dimensional instances.



unresolved questions that warrant further research.


References
[1] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi,
    Q. Le, D. Zhou, Chain of thought prompting elic-
    its reasoning in large language models, in: arXiv
    preprint arXiv:2201.11903, 2022.
[2] C. Yang, X. Wang, Y. Lu, H. Liu, Q. V. Le, D. Zhou,
    X. Chen, Large language models as optimizers, in:
    arXiv preprint arXiv:2309.03409, 2023.
[3] N. Shinn, F. Cassano, B. Labash, A. Gopinath,
    K. Narasimhan, S. Yao, Reflexion: Language
    agents with verbal reinforcement learning, in:
    arXiv:2303.11366, 2023.
[4] Q. Guo, R. Wang, J. Guo, B. Li, K. Song, X. Tan, G. Liu,
    J. Bian, Y. Yang, Connecting large language mod-
    els with evolutionary algorithms yields powerful
    prompt optimizers, in: arXiv:2309.08532, 2023.
A. Prompt Templates

 User Prompt:
 Q:
 Given the data points (y1, y2, ...) = {data}, what is the MSE loss function with respect to the ŷs for a hypothetical set of
 predicted ŷs values?
 A:
 The MSE loss function for the given data points (y1, y2, ...) = {data} with respect to ŷs is:...


Figure 7: Example prompt for getting objective function.



 User Prompt:
 Q:
 Please minimize the loss function using gradient descent with learning rate of 0.1 at point (ŷ1, ŷ2, ŷ3, .....) = {point}. What is
 the point we eventually end up after one update? Your answer includes two parts an explanation with calculation and a short
 answer of result.
 A:
 Explanation : Lets think step by step ...
 Short Answer: After calculation, the next update point is (ŷ1𝑛𝑒𝑤 , ŷ2𝑛𝑒𝑤 , ŷ3𝑛𝑒𝑤 , .....) = ...


Figure 8: Example prompt for Gradient-Descent.



 User Prompt :
 Q:
 I want to do grid search on the ŷs and the range of them are the integers of {low_bound} to {high_bound}. Generate all
 possible combinations of ŷs values from the specify range.
 What are the combinations? Your answer includes two parts an explanation with calculation and a list containing all the
 combinations.
 A:
 Explanation : Lets think step by step ...
 List : [write all the combinations here]


Figure 9: Example prompt for Grid-Search (Create Grid Points)



 User Prompt :
 Q:
 For every combinations of ŷs, calculate its MSE loss. Which combination has the smallest MSE loss? Your answer includes two
 parts an explanation with calculation and a list containing the combination with the smallest MSE loss.
 A:
 Explanation : Lets think step by step
 List : [write the combination with smallest MSE loss]


Figure 10: Example prompt for Grid-Search (Select)
 User Prompt :
 Q:
 I want to minimize the loss function using hill climbing. Generate neighboring solutions by either adding 1 or minus 1 to a
 specific element in the current solution. The current solution is solution. Your answer includes two parts an explanation with
 calculation and a list containing all neighbor solutions(eg. [(ŷ1, ŷ2,....), (ŷ1, ŷ2,....), ...]).
 A:
 Explanation : Let’s think step by step ...
 List : [write neighbor solutions here]


Figure 11: Example prompt for Hill-Climbing Prompt (Generate Neighbors)


 User Prompt :
 Q:
 For every neighbor solution, calculate its MSE loss. Which neighbor has the smallest MSE loss? Your answer includes two parts
 an explanation with calculation and a list containing the smallest MSE loss neighbor solution(eg. [(ŷ1, ŷ2,....)]).
 A:
 Explanation : Lets think step by step ...
 List : [write best neighbor solution here]


Figure 12: Example prompt for Hill-Climbing Prompt (Select)


 User Prompt :
 Q:
 You want to minimize an unknown MSE loss function by guessing the values of the ŷs. When you guess, you should take
 consider of the past guessing result so that your new guess will have smaller loss than the past results. Pass guessing result are
 {pass_result}. Base on the previous guesses, what is your next guess?
 A:
 (ŷ1, ŷ2,....) = [your answer]


Figure 13: Example prompt for Black-Box


 User Prompt :
 *Define Loss function
 User :
 Given the data points (y1, y2, y3) = (2,6,0), what is the MSE loss function with respect to the ŷs for a hypothetical set of
 predicted ŷs values?
 LLM :
 The MSE loss function for the given data points (y1, y2, y3) = (2,6,0) with respect to ŷs is: 13 [(𝑦1̂ − 2)2 + (𝑦2̂ − 6)2 + (𝑦3̂ − 0)2 ]
 *First Iteration
 User :
 Please minimize the loss function using gradient descent with learning rate of 0.1 at point (ŷ1, ŷ2, ŷ3) = (2,3,4). What is the
 point we eventually end up after one update? Your answer includes two parts an explanation with calculation and a short
 answer of result.
 LLM :
 Explanation : To minimize the Mean Squared Error (MSE) loss function using gradient descent ...
 Short Answer: After calculation, the next update point is (ŷ1𝑛𝑒𝑤 , ŷ2𝑛𝑒𝑤 , ŷ3𝑛𝑒𝑤 ) = (1.92, 3.05, 3.94)
 *Second Iteration
 User :
 Please minimize the loss function using gradient descent with learning rate of 0.1 at point (ŷ1, ŷ2, ŷ3) = ((1.92, 3.05, 3.94). What
 is the point we eventually end up after one update? Your answer includes two parts an explanation with calculation and a short
 answer of result.
 LLM :
 Explanation : To minimize the Mean Squared Error (MSE) loss function using gradient descent ...
 Short Answer: After calculation, the next update point is (ŷ1𝑛𝑒𝑤 , ŷ2𝑛𝑒𝑤 , ŷ3𝑛𝑒𝑤 ) = (1.81, 3.12, 3.41)...


Figure 14: Iterative prompt for Gradient-Descent
 User Prompt :
 *Define Loss function
 User :
 Given the data points (y1, y2, y3) = (2,6,0), what is the MSE loss function with respect to the ŷs for a hypothetical set of
 predicted ŷs values?
 LLM :
 The MSE loss function for the given data points (y1, y2, y3) = (2,6,0) with respect to ŷs is: 13 [(𝑦1̂ − 2)2 + (𝑦2̂ − 6)2 + (𝑦3̂ − 0)2 ]
 *Create Grid Points
 User :
 I want to do grid search on the ŷs and the range of them are the integers of 2 to 3. Generate all possible combinations of ŷs
 values from the specify range. What are the combinations? Your answer includes two parts an explanation with calculation and
 a list containing all the combinations.
 LLM :
 Explanation : To generate all the grid points, you should ...
 List : [(2, 2, 2), (2, 2, 3), (2, 3, 2), (2, 3, 3), (3, 2, 2), (3, 2, 3), (3, 3, 2), (3, 3, 3)]
 *Select Best
 User :
 For every combinations of ŷs, calculate its MSE loss. Which combination has the smallest MSE loss? Your answer includes two
 parts an explanation with calculation and a list containing the combination with the smallest MSE loss.
 LLM :
 Explanation : To select the grid points with smallest MSE value ...
 List : [(2, 3, 3)]...


Figure 15: Iterative prompt for Grid-Search




 User Prompt :
 *Define Loss function
 User :
 Given the data points (y1, y2, y3) = (2,6,0), what is the MSE loss function with respect to the ŷs for a hypothetical set of
 predicted ŷs values?
 LLM :
 The MSE loss function for the given data points (y1, y2, y3) = (2,6,0) with respect to ŷs is: 13 [(𝑦1̂ − 2)2 + (𝑦2̂ − 6)2 + (𝑦3̂ − 0)2 ]
 *Generate Neighbors
 User :
 I want to minimize the loss function using hill climbing. Generate neighboring solutions by either adding 1 or minus 1 to a
 specific element in the current solution. The current solution is (10,10,10). Your answer includes two parts an explanation with
 calculation and a list containing all neighbor solutions(eg. [(ŷ1, ŷ2,....), (ŷ1, ŷ2,....), ...]).
 LLM :
 Explanation : To generate all the neighbors ...
 List : [(11, 10, 10), (10, 11, 10), (10, 10, 11), (9, 10, 10), (10, 9, 10), (10, 10, 9)].
 *Select Best
 User :
 For every neighbor solution, calculate its MSE loss. Which neighbor has the smallest MSE loss? Your answer includes two parts
 an explanation with calculation and a list containing the smallest MSE loss neighbor solution(eg. [(ŷ1, ŷ2,....)]).
 LLM :
 Explanation : To select the neighbor solution with smallest MSE value ...
 List : [(9, 10, 10)]


Figure 16: Iterative prompt for Hill-Climbing
 User Prompt :
 *Define Loss function
 User :
 Given the data points (y1, y2, y3) = (2,6,0), what is the MSE loss function with respect to the ŷs for a hypothetical set of
 predicted ŷs values?
 LLM :
 The MSE loss function for the given data points (y1, y2, y3) = (2,6,0) with respect to ŷs is: 13 [(𝑦1̂ − 2)2 + (𝑦2̂ − 6)2 + (𝑦3̂ − 0)2 ]
 First Iteration
 User :
 You want to minimize an unknown MSE loss function by guessing the values of the ŷs. When you guess, you should take
 consider of the past guessing result so that your new guess will have smaller loss than the past results. Pass guessing result are
 𝑓 (2, 4, 6) = 10, 𝑓 (1, 5, 0) = 2. Base on the previous guesses, what is your next guess?
 LLM :
 (ŷ1, ŷ2, ŷ3) = [1, 3, 5]
 Second Iteration
 User :
 You want to minimize an unknown MSE loss function by guessing the values of the ŷs. When you guess, you should take
 consider of the past guessing result so that your new guess will have smaller loss than the past results. Pass guessing result are
 𝑓 (2, 4, 6) = 10, 𝑓 (1, 5, 0) = 2, 𝑓 (1, 3, 5) = 5. Base on the previous guesses, what is your next guess?
 LLM :
 (ŷ1, ŷ2, ŷ3) = [1, 1, 1]


Figure 17: Iterative prompt for Black-Box Optimization