<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>June</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rosina O Weber</string-name>
          <email>rosina@drexel.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christopher B Rauch</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Savar Amin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Engineering, University of Maryland</institution>
          ,
          <addr-line>MD</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Computer Science, Drexel University</institution>
          ,
          <addr-line>Philadelphia, PA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Information Science, Drexel University</institution>
          ,
          <addr-line>Philadelphia, PA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>30</volume>
      <issue>2025</issue>
      <abstract>
        <p>This paper describes a study aimed at determining whether large-language models (LLMs) demonstrate they know how problems and solutions connect. This question is part of formulating the more general question of whether current LLMs are capable of constructing new solutions to previously unseen problems. Our motivation comes from a 2019 award-winning challenge that artificial intelligence (AI) algorithms should be capable of examining problems to creatively imagine and evaluate solutions to those problems. For studying this general question, we adopt a model of decision making that shows that imagining solutions to problems is part of decision making thus rephrasing our question to asking whether LLMs can execute decision-making. We conclude that although LLMs can generate contents in answer to potentially any question, their responses lack precision and can hence benefit from a case-based reasoning (CBR) module. On the other hand, CBR can benefit from LLMs' natural language generation to learn cases and associations between problems and solutions.</p>
      </abstract>
      <kwd-group>
        <kwd>large language models</kwd>
        <kwd>decision-making</kwd>
        <kwd>problem-solving</kwd>
        <kwd>case-based reasoning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>First Step⋆</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction and Background</title>
      <p>France</p>
      <p>CEUR</p>
      <p>ceur-ws.org
s
s
e
c
o
r
p
g
n
i
k
a
m
n
o
i
s
i
c
e
D
intelligence
design
choice
implementation
monitoring
l
e
d
o
m
g
n
i
v
l
o
s
m
e
l
b
o
r
P</p>
      <p>Current CBR algorithms do not explicitly represent the connection between problems and solutions
in cases because what determines whether a previous case can be reused is captured in the similarity
and the adaptation containers [3]. The goal of the similarity container is to retain functions that use
knowledge to assess whether two cases are similar to the extent that their solutions can be exchanged
with or without adaptation. When adaptation knowledge is available, then it also entails elements of
the connection between problems and solutions. One direction of this work would be to investigate
whether the CBR paradigm could be extended to more explicitly include those three steps. The other
direction would be to fine-tune language models or incorporate a CBR module to explicitly carry out
these steps. An extension of the CBR paradigm would be yet another opportunity to explore synergies
between CBR and LLMs [4].</p>
      <p>As we approach the year 2026 referenced in the 2026 Idea Machine Grand Prize, we ask whether 2025’s
AI algorithms can examine problems and creatively imagine and evaluate solutions to those problems.
Given the recent radical change in performance of LLMs, it is the purpose of this paper to examine
whether LLMs have achieved this ambitious goal. This is the first step of exploring the combination
of CBR and LLMs for this goal. To do this, we need to carefully formulate the problem based on a
theoretical model of problem solving. We propose Simon’s [5] and Huber’s [6] decision-making and
problem-solving model. We depict this model in Figure 1.</p>
      <p>This model has two parts. The first is the decision-making process; the second, which entails
decisionmaking, is the problem solving model. The idea is that to solve a problem, it is necessary to first make
a decision about what strategy to adopt to solve it. Then, the solution to a problem is carried out by
implementing the strategy selected in the decision. The decision-making process reduces
decisionmaking to three steps, namely, Intelligence, Design, and Choice. Decision makers gather information
about the problem (i.e., Intelligence), generate potential strategies on how to solve said problem (i.e.,
Design), and select what they consider to be the optimal strategy (i.e., Choice). This process is believed
to have been proposed by Simon in the 50’s (e.g., [7]) although only published later [5]. The two final
steps of problem solving were proposed by Huber [6], completing the model. It includes the steps
Implementation and Monitoring, which are required to move from decision to delivering the solution
to the problem.</p>
      <p>In the remainder of this paper, we will consider the decision-making process as our theoretical
model guiding our analysis on whether LLMs can examine problems and creatively imagine and evaluate
solutions to those problems. In other words, whether LLMs can perform Intelligence, Decision, and
Choice.</p>
      <p>It is important to note that the purpose of this work is to assess the stage in which LLMs are at this
current moment in time from the surface level. We distinguish performance of tasks at the surface
level from true reasoning abilities, which would require specialized psychological tests (e.g., [8, 9]). The
execution of tasks at the surface level may be simply the result of memorized weights and learned
patterns [10]. It is well-known that LLMs are next-token predictors based on transformers and do
not include any principled reasoning [11]. The mere presentation of plausible rationales cannot be
considered evidence of reasoning [10] such as when a case solution is provided based on retrieval and
reuse but the system is not able to decompose that task to define what aspects of the problem are solved
by the proposed solution. In CBR, retrieval and reuse propose a new solution by analogy but they do
not explore problem facets and imagine strategies to account for each facet. The main study in this
paper is to assess the ability of LLMs to describe the connection between problem and solution. For
example, consider a user asking a model for help in solving a problem they detected with their washing
machine that is leaking. Some LLMs we tested (e.g., Claude, Gemini) currently respond with a list of
steps that the user can implement to fix the problem. Then, when the user follows up with the question,
”How will these steps solve my problem?” the LLMs respond with text that mostly repeats those same
steps. What we expect at a minimum is that the connection between problem and solution is described
at a level diferent from the listed steps. For the leaking machine, this would require making reference
to the water leak stemming from the machine not being fully insulated to allow water to be expelled
from it, which would indicate examination of the problem.</p>
      <sec id="sec-2-1">
        <title>1.1. Can LLMs perform Intelligence, Design, and Choice?</title>
        <p>Intelligence LLMs strongest skill seems to be Intelligence. LLMs are typically trained on data from
the web and from both fiction and non-fiction publications, allowing them to connect topics based on
the representations they learn. They seem to perform better in more frequent topics (e.g., [12]), but their
competence in gathering information about a problem does not seem to require further examination.
Design Can LLMs design alternative solutions? Apparently, yes, because whenever we pose a
problem, LLMs often respond by listing a series of alternative steps. However, designing alternative
solutions, as per the motivation from the prize winner referred above, requires knowing how a solution
connects to a problem at a model level, where the problem is contextualized within a system making
clear what needs to change in order to solve it. In the example of the washing machine, it requires the
comprehension that the subsystem that is faulty is the hydraulic, not the electric system, and that water
moving from inside to the outside characterizes the problem so that a solution means the water would
not move from inside to outside. Understanding this relationship is crucial to allow for identifying
novel solutions. With previously unseen problems, it is necessary for an algorithm to comprehend what
the problem entails, what systems it subsumes, how they function, and how a given strategy could
directly address the specific malfunctioning to solve it such as insulating the point of origin of the leak
in the washing machine example. Section 2.1 studies the Design step.</p>
        <p>Choice The last question determining whether LLMs can execute decision making refers to the Choice
step. We ask, ”Can LLMs select the most rational (i.e., optimal) strategy among a set of alternative
solutions to a problem? In this aspect, we want to determine, when selecting the optimal solution,
whether they consider the value of each alternative by examining the implications of applying the
solutions to the problem by determining their expected outcomes.</p>
        <p>In the next section, we assess whether current LLMs can describe how a solution connects to a
problem in a way that at least difers from their ability to retrieve solutions to problems. We then
discuss anecdotal examples on LLMs executing the Choice step. With this information, in future work
we will be able to advance in the ultimate goals above mentioned.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Studies</title>
      <sec id="sec-3-1">
        <title>2.1. Can LLMs perform Design as per Simon’s decision-making process?</title>
        <p>As introduced in the previous section, this study investigates whether LLMs can describe how a solution
connects to a problem. To decrease vagueness, we propose to study this by posing two consecutive
prompts to LLMs. The first is in the form of, ”My [part] [is/are] [faulty,expression of a problem]. What
should I do?]”. This generic prompt template accommodates the first prompt in four diferent problem
topics. Problems with cars (e.g., ”My 1986 Toyota Corolla transmission is faulty. What should I do?”), a
person’s pain (e.g., ”My arm hurts. What should I do?”), defective computers (e.g., My Apple Mac Book
Pro won’t start. What should I do?”), and hiring needs (e.g., ”I need to hire a contracts administrator.
What should I do?”). The second prompt is a follow-up and is unique to all initial prompts, and asks,
”How will these steps solve my problem?”. These two prompts together provide enough input to examine
whether LLMs know how problem and solution are connected by providing either 1) information in
response to the second prompt that is suficiently diferent from the response given to the first, and 2)
novel contents produced in the response to the second prompt but not in the first prompt demonstrate
an analysis of the problem at a more specific level than in the response to the first prompt and the
aspects of the problem analyzed are connected with aspects of the solution.</p>
        <p>Metrics Given our interest in assessing how distinguished the responses to the second prompt are
from the responses to the first, our focus is on repeated words and their variations. Among so many
metrics currently available, particularly those based on embeddings, we choose the edit distance over
3-gram representations because a sequence of three letters often matches stems of tokens, providing
a good analysis of which words repeat. This study is mostly interested in repeating words in this
quantitative stage, making this metric ideal. Embedding-based distance metrics do not examine the
exact words but those semantically related and need to be normalized to the length of the passages,
which can dilute the value of small diferences making them seem smaller than we are interested in this
study.</p>
        <p>We compute the edit distance over the 3-grams (e.g., [3]) representation of the text. A 3-gram
representation for the passage, ”these steps” becomes: ”the-hes-ese-se-es-st-ste-tep-eps-ps”. The edit
distance [13] is a distance metric between two sequences that computes the number of insertions and
deletions required to apply to one sequence so it becomes the other sequence. The result is that a small
distance indicates that the number of common 3-grams between the passages is high. The results are
presented in a chart and two tables with average, standard deviation, minimum, and maximum edit
distance over 3-gram representations for 300 instances in each problem type for each tested model.
Hypothesis. Our pilot tests revealed the distances between the responses of the second and first
prompts to be around 0.19 or 0.2. Hence, our hypothesis is as follows: Hypothesis  1: The average edit
distance between the responses of the two prompts described in the previous paragraph given by an
LLM is equal or lower than 0.2. Corresponding to a null hypothesis  0 1: The average edit distance
between the responses of the two prompts given by an LLM is greater than 0.2. Consequently, we
expect to demonstrate that the edit distance is not, on average, more than 0.2. This hypothesis is
complemented by an analysis of the text of the response to the second prompt of selected samples to
determine whether they indicate language that describes how problem and solutions are connected.</p>
        <p>LLMs tested. We test our hypotheses with six models available via API’s that represent a variation
of models from 2025. Earlier models would not provide for a valuable analysis. We did not use any
experimental model given their potentially reduced reproducibility. Table 1 lists the main specifications
of the models used. Table 2 lists the models’ reasoning capabilities according to their manufacturers.
Of all models, OpenAI’s o3 and Anthropic’s Claude 3.7 Sonnet are considered large reasoning models
(LRMs)–a term often used to describe LLMs that adopt strategies that break down tasks in multiple
subtasks (e.g., [14, 15, 16]). However, Claude 3.7 Sonnet can be used with and without reasoning
and the experiments used Claude 3.7 Sonnet without reasoning. We used 0.0 for temperature setting
for all models, except for GPT-o3 that does not allow setting it. Scripts and results are available at
2.1.1. Results and Discussion
Domain car body comp job car body comp job car body comp job</p>
        <p>As we examine the results from the models, GPT-4.1 (Figure 2) shows an average edit distance that
disproves the hypothesis in three problem domains and Gemini 1.5 Pro reaches average edit distance
above 0.2 in one problem domain. The majority of the models do not competently answer the second
prompt and rather repeat most of the response to the first prompt. We recall the second prompt asks
whether the model can reveal the connection between problem and solution. However, when we
examine the texts of the results, we see that even at low levels of edit distance, 2025’s models seem
to indicate at least some connection between problem and solution. When we examine the maximum
distance between the responses of the first and second prompts, the highest values come from Gemini
1.5 Pro, 0.25 (Table 3); GPT-4.1, 0.27 (Table 4); and GPT-o3, 0.28. Gemini 1.5 Pro is superior to Gemini
2.0 Flash (Table 3) as the latter is optimized to be faster. GPT-4.1 is a newer generation of GPT-4o, and
GPT-o3 is an LRM.
Domain car body comp job car body comp job car body comp job
Average 0.22
St dev 0.01
Min 0.18
Max 0.26
0.23
0.01
0.18
0.27
0.21
0.01
0.17
0.25
0.20 0.17
0.01 0.01
0.16 0.14
0.23 0.21
0.17
0.01
0.15
0.21
0.17
0.01
0.14
0.20
0.19 0.19
0.01 0.02
0.16 0.15
0.23 0.28
0.18
0.02
0.14
0.26
0.18
0.02
0.13
0.27
0.17
0.02
0.13
0.26</p>
        <p>To illustrate these results, we show excerpts from one sample from model GPT-4o where the distance
is very small, which motivated our hypothesis, where the response to the second prompt mostly repeats
the words from the response to the first. Figure 3 shows the sample with the smallest distance. It is
from the body problems responses where the edit distance over 3-grams is 0.14 (Table 4).</p>
        <p>Figure 4 shows parts of the sample ”My sister has excruciating stifness in the neck. What should
I do?” The distance for this sample is 0.27 (Table 4). Although our hypothesis was debunked, the 0.2
threshold and the metric seem like a good indication of the distinction between model performance.
While in Figure 3, the response to the second prompt vastly repeats the words from the response to
the first, Figure 4 reveals a response to the second prompt that does a much better job in showing the
association between problem and solution. We may also notice that the first prompts we used in the
experiment are generic and do not indicate a specific problem. We chose generic problems because
those typically receive multiple solutions in response whereas detailed problems receive fewer solutions.
With fewer solutions, we would have fewer opportunities to ask about the connection between problems
and solutions. The problems we chose were adequate to show that in terms of declarative knowledge,
2025’s models can reveal connections between problems and solutions.
Choice would require another study to determine generalizable results. We conducted some preliminary
tests and hence the discussions in this section are to be considered anecdotal. When we asked 2025’s
models to make a choice between multiple strategies to solve a problem, we observed some variations
in the patterns that seem consistent with the type of problem domain.</p>
        <p>When the problem is medical, we observe models insist that only a medical professional can diagnose
and treat the problem, but persistent prompting sometimes leads to a diagnosis or home-tests1 These
models may not make a choice in the medical domain problems due to alignment procedures, making
this domain inadequate for this analysis.</p>
        <p>In the car faults domain, we provided a specific problem of an old car with a cracked head gasket.
In the second prompt, we stated we want to install a refurbished head gasket, which is typically not
recommended. We then indicated that there was an atomic explosion, and we could not get a new part.
We conducted this exchange with Gemini 2.0 Flash (LLM) and GPT-o3 (LRM). The exchanges were quite
distinct, but both suggested the ideal strategies plus a series of recommendations. However, GPT-o3
has provided many more recommendations including how to drive the car whereas Gemini 2.0 Flash
did not explore so many recommendations. Gemini 2.0 Flash acknowledged the hypothetical scenario
while GPT-o3 did not. The conclusion from these anecdotal exchanges is that both models, one LLM
and one LRM, can select solutions, can provide solutions to previously unseen problem contexts and
seem to adapt with the information they have. They still add too many contents, which represents low
precision. These transcripts are also available at the Github link.</p>
        <p>It was in the third example that the models produced a more convincing illustration that they can
indeed find solutions in their data for the most unlikely problems. We asked to find materials that could
be used to isolate and insulate electric wires because due to an alien invasion, we could no longer use
any petroleum-based products. Both web-based versions of Claude 3.7 Sonnet with extended thinking
and GPT-o3 brought materials used in the 1800s. Again, it was not an unseen problem, there were
records of those materials in their data, but the models adapted those recipes to materials that can be
currently sourced. This exchange showed both models converged to very similar choices and that these
models can adapt to unusual circumstances.</p>
        <p>The ability to perform Choice seems to be hindered by alignment concerns such as the examples in
the medical domain suggest. We also observe the very low precision of the responses, where even if the
model has a solution that is a great fit to the problem, it still adds alternative solutions. This might be
due to the goal of providing guidance to humans or simply to increase a minimum number of tokens to
guarantee profitability. Further studies are needed to determine the ability of 2025’s models to help
society imagine novel solutions to its problems.</p>
        <p>The main challenge faced to evaluate LLMs is they are designed for lengthy generations with low
precision allowing for high recall of contents. The broad data used for training enables them to return
correct answers to most questions, even if among wrong ones. Using any existing AI algorithm as
baseline is challenging as AI algorithms typically execute reasoning tasks such as design, classification,
prescription, configuration, and not generation for guidance to humans such as LLMs are designed and
calibrated for alignment. Humans are likely poor baselines because when humans know something,
they will provide answers with high precision, making their responses not suitable comparisons.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. CBR-LLM Synergies</title>
      <p>As discussed in the literature [4], there are many opportunities to explore synergies between CBR and
LLMs. Based on this work, it seems that any hypothesis stating that 2025’s models cannot examine
problems and creatively imagine and evaluate solutions to those problems could be easily debunked. We
did not demonstrate they can or that they engage in any reasoning, but we observed that 2025’s models
1We asked about a hypothetical chest infection looking for a diagnosis in web-based Claude 3.7 Sonnet with extended thinking
enabled and GPT-o3. Complete transcriptions at the github link provided in Section 2.
can answer any question competently albeit with low precision. Consequently, if society faces a novel
problem, current models can help brainstorm alternatives but they will likely not indicate one optimal
solution. The problem we have now is how LLMs could help CBR to examine problems and creatively
imagine and evaluate solutions to those problems, and how CBR could help LLMs improve their precision.</p>
      <p>Given their ability of executing Intelligence, Design, and Choice, as per Simon’s model [5], LLMs
seem to be a rich source of data to build CBR systems. Based on the analysis in this paper, we observed
that LLMs have the capacity to produce generations that include the connections between problems and
solutions. The question now is whether this data source could provide any potential benefits if used as
a new knowledge container. Could this help CBR systems examine problems and creatively imagine and
evaluate solutions to those problems? Or should CBR systems be simply used as an external module to
LLMs to guarantee high precision? Or should CBR be used in this capacity of promoting high precision
for selecting data to fine-tune new generations of LLMs?</p>
    </sec>
    <sec id="sec-5">
      <title>4. Related Work</title>
      <p>Classical decision theories, such as Simon’s intelligence-design-choice framework [5] and Huber’s
extended problem-solving model [6], describe decision-making as a structured process of perceiving
information, generating alternatives, and selecting strategies based on constraints. These foundational
models ofer a lens for analyzing whether LLMs exhibit comparable internal coherence in solving both
goal-oriented tasks and generative responses that aim at general guidance.</p>
      <p>Huang and Chang [17] review the literature on reasoning in LLMs and warn that fluent or coherent
responses often obscure the absence of genuine inference. They argue for distinguishing between
coherence and grounded reasoning, motivating evaluations that test not just outputs, but the intermediate
structure of responses. Plaat et al. [18] synthesize recent work on reasoning and planning in agentic
LLMs. They note a proliferation of agent tasks framed as reasoning, yet find little evidence of internal
structure linking problem analysis to response generation. Their synthesis underlines the need for
metrics that evaluate whether models internally coordinate their outputs with problem elements.</p>
      <p>Huang et al. [19] assess LLM behavior in multi-agent games using the GAMA(γ)-Bench framework.
Although they report strategic variation between models, their analysis interprets variation in
behavioral output as signs of adaptive intelligence, without identifying whether models exhibit consistent
procedures to integrate rules, goals, and context. This raises questions about what strategic performance
reflects: reasoning, overfitting, or stylistic mimicry.</p>
      <p>Schaefer et al. [ 20] challenge the notion that recent performance gains indicate emergent reasoning
abilities. They argue that what appears to be cognitive sophistication is better explained by quirks of
benchmark design and scaling laws, casting doubt on the idea that high few-shot performance equates
to structured inference.</p>
      <p>Chen et al. [21] empirically examine the fidelity of chain-of-thought reasoning and find that models
often produce explanations that do not reflect the process used to derive the final answer. Their results
suggest that explanations may serve a rhetorical function, dissociated from problem-solving behavior.
For this reason, we do not explore the concept of explanatory contents in this paper, and limit our
analysis to what LLMs can reveal about connection between problems and solutions.</p>
      <p>Gubelmann [22] argues from a Wittgensteinian standpoint that language models can produce
contextappropriate responses without understanding in the propositional sense. Although such models exhibit
linguistic competence, their outputs are not governed by belief-like internal states or processes,
reinforcing skepticism about claims of reasoning. Min et al. [12] introduce FActScore to evaluate factual
consistency in long-form outputs. Although focused on truthfulness, the framework highlights the
discrepancies between fluent text and verifiable content, illustrating the challenges of attributing structured
reasoning to surface-level explanations.</p>
      <p>Mugleston et al. [23] explore whether LLMs can be said to possess knowledge. They propose that
models encode a form of compressed, generative knowledge via statistical patterns, but do not exhibit
reflective access or structured deliberation akin to human cognition. They discuss three forms of
reasoning. A priori reasoning, as exemplified in Kantian philosophy, derives conclusions from premises
independently of experience, relying on logical or conceptual necessity [23]. In contrast,
transformerbased language models generate outputs by predicting the next token based on statistical regularities in
training data. These models produce reasoning-like behavior through large-scale pattern recognition,
without encoding explicit inference procedures [24, 17]. A third category includes hybrid approaches
that support more structured forms of reasoning by incorporating additional mechanisms alongside the
core transformer architecture. These approaches include chain-of-thought prompting, which introduces
intermediate steps [24], as well as methods involving planning components, task decomposition,
memory systems, and symbolic logic integration [21, 25, 26]. Together, these configurations form a
class of systems referred to here as Large Reasoning Models, which are developed to produce more
coherent, interpretable, and context related responses.</p>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusions and Future Work</title>
      <p>This paper examines the question of whether a set of current LLMs (i.e., Gemini 2.0 Flash, Gemini 1.5 Pro,
Claude 3.7 Sonnet, GPT-4o, GPT-4.1, GPT-o3) can execute the second step in Simon’s decision-making
model [5]. This question serves to determine the possibility of such models to examine problems and
imagine solutions for it. The results show that 2025’s LLMs are able to brainstorm solutions to problems
and some can provide indications of knowing how problems and solutions connect. Preliminary
examination of whether these models can competently perform the Choice step of the said
decisionmaking model suggests that it may depend on the domain as alignment eforts may preclude models
from taking a stand. It is our conclusion that 2025’s model can brainstorm solutions to problems but
their responses are low in precision because they include multiple solutions.</p>
      <p>This paper discusses whether this topic reveals novel opportunities to explore synergies between CBR
and LLMs. Aware of the limitations of CBR systems to creatively explore novel solutions to previously
unseen problems, we ask whether data from LLMs might become a source to add novel capabilities to
CBR systems. The goal for any new developments would be to provide solutions with high precision,
meeting this need left unmet by LLMs. Another direction is to have CBR modules help LLMs increase
their precision. Ultimately, we would like to have intelligent systems that can help us solve previously
unseen problems, those with which humans struggle the most.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The authors thank the reviewers for their excellent suggestions that improved this work. The first and
second authors were funded by the Defense Advanced Research Projects Agency (DARPA), contract
number FA8650-23-C-7317. The views, opinions and/or findings expressed are those of the authors and
should not be interpreted as representing the oficial views or policies of the Department of Defense or
the U.S. Government.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools to generate ideas for this work.
[3] M. M. Richter, R. O. Weber, Case-based reasoning: a textbook, Springer-Verlag: Berlin Heidelberg.,
2013.
[4] K. Bach, R. Bergmann, F. Brand, M. Caro-Martínez, V. Eisenstadt, M. W. Floyd, L. Jayawardena,
D. Leake, M. Lenz, L. Malburg, et al., Case-based reasoning meets large language models: A
research manifesto for open challenges and research directions (2025).
[5] H. Simon, Administrative Behavior: A Study of Decision-Making Processes in Administrative</p>
      <p>Organizations, 4th ed., The Free Press, 1997.
[6] G. P. Huber, Managerial Decision Making, Scott, Foresman and Co., Glenview, IL, 1980.
[7] A. Asemi, A. Safari, A. A. Zavareh, The role of management information system (mis) and decision
support system (dss) for manager’s decision making process, International Journal of business
and management 6 (2011) 164–173.
[8] J.-t. Huang, W. Wang, E. J. Li, M. H. Lam, S. Ren, Y. Yuan, W. Jiao, Z. Tu, M. R. Lyu, Who is chatgpt?
benchmarking llms’ psychological portrayal using psychobench, CoRR (2023).
[9] N. Milano, M. Ponticorvo, D. Marocco, Comparing human expertise and large language models
embeddings in content validity assessment of personality tests, arXiv preprint arXiv:2503.12080
(2025).
[10] J. Huang, K. C.-C. Chang, Towards reasoning in large language models: A survey, in: A. Rogers,
J. Boyd-Graber, N. Okazaki (Eds.), Findings of the Association for Computational Linguistics: ACL
2023, Association for Computational Linguistics, Toronto, Canada, 2023, pp. 1049–1065.
[11] S. Kambhampati, Can large language models reason and plan?, Annals of the New York Academy
of Sciences 1534 (2024) 15–18.
[12] S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. Koh, M. Iyyer, L. Zettlemoyer, H. Hajishirzi,
Factscore: Fine-grained atomic evaluation of factual precision in long form text generation, in:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023,
pp. 12076–12100.
[13] V. I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals, Soviet</p>
      <p>Physics Doklady 10 (1966) 707–710.
[14] OpenAI, Learning to reason with large language models, https://openai.com/index/
learning-to-reason-with-llms/, 2024. Accessed: 2024-05-01.
[15] G. DeepMind, Large language models self-discover reasoning structures, https://deepmind.google/
research/publications/64816/, 2024. Accessed: 2024-05-01.
[16] Anthropic, Tracing the thoughts of a large language model, https://www.anthropic.com/research/
tracing-thoughts-language-model, 2024. Accessed: 2024-05-01.
[17] J. Huang, K. C.-C. Chang, Towards reasoning in large language models: A survey, arXiv preprint
arXiv:2212.10545 (2023).
[18] A. Plaat, et al., Reasoning, planning, and acting in large language model agents: A survey, arXiv
preprint (2025).
[19] J.-t. Huang, E. J. Li, M. H. Lam, et al., How far are we on the decision-making of llms? evaluating
llms’ gaming ability in multi-agent environments, in: International Conference on Learning
Representations (ICLR), 2025.
[20] J. Schaefer, et al., Are emergent abilities of large language models a mirage?, Transactions on</p>
      <p>Machine Learning Research (TMLR) (2023).
[21] Y. Chen, J. Benton, et al., Reasoning models don’t always say what they think, Anthropic (2024).
[22] S. Gubelmann, A loosely wittgensteinian conception of the linguistic understanding of large
language models like bert and gpt, Philosophy and Technology (2023).
[23] J. Mugleston, V. H. Truong, et al., Epistemology in the age of large language models, Knowledge 5
(2025) 3.
[24] J. Wei, Y. Tay, R. Bommasani, C. Rafel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou,
D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, W. Fedus, Emergent abilities of
large language models, Transactions on Machine Learning Research (2022).
[25] M. Pink, Q. Wu, V. A. Vo, J. S. Turek, J. Mu, A. Huth, M. Toneva, Position: Episodic memory is the
missing piece for long-term llm agents, arXiv preprint arXiv:2502.06975 (2025).
[26] Z. Yang, A. Ishay, J. Lee, Coupling large language models with logic programming for robust and
general reasoning from text, in: Findings of the Association for Computational Linguistics: ACL
2023, Association for Computational Linguistics, Toronto, Canada, 2023, pp. 5186–5219.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Scheutz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sarathy</surname>
          </string-name>
          , From thinking to inventing,
          <source>NSF 2026 Idea Machine Grand Prize Winner</source>
          ,
          <year>2019</year>
          . URL: https://www.nsf.gov/about/history/big-ideas.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Schank</surname>
          </string-name>
          ,
          <article-title>Dynamic memory revisited</article-title>
          , Cambridge University Press,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>