<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Intermediate Languages Matter: Formal Languages and LLMs afect Neurosymbolic Reasoning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexander Beiser</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Penz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nysret Musliu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Johannes Kepler University Linz</institution>
          ,
          <addr-line>Linz</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TU Wien</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Large language models (LLMs) achieve astonishing results on a wide range of tasks. However, their formal reasoning ability still lags behind. A promising approach is Neurosymbolic LLM reasoning. It works by using LLMs as translators from natural to formal languages and symbolic solvers for deriving correct results. Still, the contributing factors to the success of Neurosymbolic LLM reasoning remain unclear. This paper demonstrates that one previously overlooked factor is the choice of the formal language. We introduce the intermediate language challenge: selecting a suitable formal language for neurosymbolic reasoning. By comparing four formal languages across three datasets and seven LLMs, we show that the choice of formal language afects both syntactic and semantic reasoning capabilities. We also discuss the varying efects across diferent LLMs.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;logical reasoning</kwd>
        <kwd>neurosymbolic approaches</kwd>
        <kwd>LLM/AI agents</kwd>
        <kwd>prompting</kwd>
        <kwd>few-shot learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Instruction:
"[Your] task is to parse the problem description [..]
into a [formal language]. [..] A correctly parsed
example is given below. [..]"
Correctly Parsed Example:
"Fae is a cat, all cats are mammals, [..]
Facts: cat(fae).</p>
      <p>Rules: mammal(X) :- cat(X). [..]"
Natural Language
Problem Formulation</p>
      <p>Output</p>
      <p>In-context-learning</p>
      <sec id="sec-1-1">
        <title>Large Language Model</title>
        <p>Solution Human Readable
Formal (Intermediate) Language
Translated Problem</p>
        <p>Solution</p>
      </sec>
      <sec id="sec-1-2">
        <title>Symbolic Reasoner</title>
        <p>
          Our experiments show that the choice of formal language matters: first-order logic outperforms logic
programming languages. This paper is the short version of Intermediate Languages Matter: Formal
Choice Drives Neurosymbolic LLM Reasoning [
          <xref ref-type="bibr" rid="ref8">9</xref>
          ]. Here, we focus on the main results of the comparison
of diferent formal languages, and (previously not shown) comparison of LLMs.
        </p>
        <p>Structure. After this introduction we will present the necessary preliminaries and discuss related work
(Section 2). We continue to introduce our main hypothesis - the intermediate language problem - that
the choice of formal language afects reasoning performance (Section 3), which we follow with our
experimental setup (Section 4), and our experimental results (Section 5). We close our paper with our
conclusions in Section 6.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminaries</title>
      <p>We briefly present the necessary background material and definitions for understanding the paper.
Recall that the main objective of this study is to compare the reasoning performance of diferent formal
languages on modern LLMs. By taking the perspective of end-users, we treat LLMs as immutable
blackbox next-token predictor machines. Therefore, we are primarily interested in what efects diferent
prompting strategies have on the reasoning performance and consider the efects of other techniques,
such as fine-tuning, as out of scope. Throughout this paper, the terms syntax and semantics are used in
their formal language sense.</p>
      <sec id="sec-2-1">
        <title>2.1. Chain-of-Thought (CoT) prompting</title>
        <p>
          Chain-of-Thought (CoT) prompting is an in-context-learning (ICL) technique which improves the
reasoning capabilities of LLMs by adding additional information to a prompt [
          <xref ref-type="bibr" rid="ref4">5</xref>
          ]. CoT nudges the LLM
to mimic a reasoning chain, where we show an example in the next listing.
1 The following example showcases the line of reasoning you have to follow :
2 ---- Question
---3 Each cat is a carnivore . Fae is a cat .
4 True or false : Fae is a carnivore
5 ---- Reasoning
---6 Fae is a cat . Each cat is a carnivore . So Fae is a carnivore .
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Neurosymbolic LLM Reasoning</title>
        <p>Figure 1 depicts the high-level schematics of neurosymbolic LLM reasoning. A natural language-posed
problem is translated into its formal language representation by using ICL. ICL comprises of an
ICLinstruction and an ICL-example. The instruction describes the general task, while the example showcases
how to translate the natural language-posed problem into a formal language. We refer to the formal
language of the ICL-example, as the chosen formal language.</p>
        <p>
          In a second step, the symbolic reasoner solves the problem by obtaining a solution from the formal
representation, which can be either re-translated into natural language or directly used as output. We
do not employ backup strategies and use a as close as possible deterministic prompting (temperature
0), as we are interested in the unfiltered afect of the formal language on reasoning performance. We
thereby difer from other related approaches like Logic-LM [
          <xref ref-type="bibr" rid="ref6">7</xref>
          ], Logic-LM++ [
          <xref ref-type="bibr" rid="ref9">10</xref>
          ], and LINC [
          <xref ref-type="bibr" rid="ref7">8</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Related Work</title>
        <p>
          Improving LLM’s reasoning capability was approached by diferent angles. Model improvements
include fine-tuning or pre-training to improve numerical capabilities [
          <xref ref-type="bibr" rid="ref10">11</xref>
          ] or syntax recognition of
ASP with LLASP [
          <xref ref-type="bibr" rid="ref11">12</xref>
          ]. Prompting techniques are widely used, such as CoT prompting, part of the
emergent ICL or few-shot-learning capability [
          <xref ref-type="bibr" rid="ref12">13</xref>
          ]. CoT improves LLMs’ performance on reasoning
tasks [
          <xref ref-type="bibr" rid="ref4">5</xref>
          ]. Recent reasoning-focused LLMs, like DeepSeek-R1 utilize internal CoT [
          <xref ref-type="bibr" rid="ref13">14</xref>
          ]. In contrast to
these approaches we utilize CoT prompting and neurosymbolic AI. Neurosymbolic AI [
          <xref ref-type="bibr" rid="ref14">15</xref>
          ] is a broad
ifeld which ranges from diferentiable logic [
          <xref ref-type="bibr" rid="ref15">16</xref>
          ] over visual question answering [
          <xref ref-type="bibr" rid="ref16">17</xref>
          ], to LLMs [
          <xref ref-type="bibr" rid="ref6 ref7">7, 8</xref>
          ].
For logical reasoning tasks in particular, Logic-LM [
          <xref ref-type="bibr" rid="ref6">7</xref>
          ] and LINC [
          <xref ref-type="bibr" rid="ref7">8</xref>
          ] are two proposed neurosymbolic
approaches that combine LLMs with symbolic solvers. They translate a natural languages into a formal
language - called autoformalization [
          <xref ref-type="bibr" rid="ref17 ref18">18, 19</xref>
          ]. Logic-of-Thought, which tackles logic-puzzles with a
neurosymbolic approach [
          <xref ref-type="bibr" rid="ref19">20</xref>
          ], is also related. Although prior work employs an intermediate language,
it seldom justifies the choice. We show — empirically — that the selected language decisively shapes
reasoning performance.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. The Intermediate Language Challenge</title>
      <p>We proceed to define our intermediate language challenge for neurosymbolic LLM reasoning. We
assume to have given a natural language-posed reasoning problem  and a set of possible formal
languages ℒ.</p>
      <p>Definition 1. The intermediate language challenge is the task of choosing a formal language  ∈ ℒ for
solving  with a high reasoning accuracy.</p>
      <p>
        Inherent to the intermediate language challenge is autoformalization [
        <xref ref-type="bibr" rid="ref17">18</xref>
        ].
      </p>
      <p>Definition 2. Let  ∈ ℒ be a fixed formal language. Then, autoformalization aims for automatic and
correct translation of  into .</p>
      <p>While autoformalization is concerned with the correct translation from natural language into a fixed
formal language , the intermediate language challenge is about choosing a suitable formal language
′ ∈ ℒ s.t. autoformalization can be done efectively. We identify two root causes of the intermediate
language problem: (i) Syntax afects LLMs’ reasoning performance, and (ii) one logical problem can be
translated into multiple formal languages.</p>
      <p>
        Syntax afects LLMs’ reasoning performance. Consider the following two logical reasoning
problems: (1) “Tommi is a tumpus. Each tumpus is a wumpus. Is Tommi a wumpus?” (2) “Tommi is a cat.
Each cat is an animal. Is Tommi an animal?” Recent work suggests that, on average, LLMs perform
better for scenarios of type (2) than type (1) [
        <xref ref-type="bibr" rid="ref1">1, 2</xref>
        ]. From a semantic perspective, both scenarios require
the application of modus ponens. Thus, as the only diference lies in the syntax, we can conclude that
the syntax afects LLMs’ reasoning capabilities. Going back to formal languages, observe that the syntax
of formal languages difers. Therefore, we conclude that the choice of formal language afects LLMs’
reasoning capabilities.
      </p>
      <p>Logical problems can be encoded in diferent formal languages. Take the logical reasoning
problem (2) from the paragraph above. This problem can be encoded in diferent formal languages, such
as logic programming or first-order logic (FOL), while maintaining semantic correctness.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment Setup</title>
      <p>
        To show the impact of the intermediate language challenge, we investigate a set of formal languages
ℒ = {Pyke, ASP, NLTK, FOL}. We conduct experiments on three diferent datasets, ProntoQA [ 1],
Proof Writer [
        <xref ref-type="bibr" rid="ref20">21</xref>
        ], and FOLIO [
        <xref ref-type="bibr" rid="ref21">22</xref>
        ]. Let  be a given dataset, then each data instance  ∈  can be
considered a reasoning problem. Each  is translated into a formal language by the LLM according to a
specification (prompting style). We prompt the LLM with a prompting style that adheres to Figure 1
i.e., we provide an ICL instruction and an ICL example. We use a set of prompting styles, where they
difer in the syntax of the ICL example, such as wrapping the example in markdown syntax. Importantly,
we enable comparability between formal languages by using the same set of prompting styles for each
formal language.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Formal Languages</title>
        <p>We will provide a brief overview of the formal languages ℒ used for our experiments.
Pyke: The logic programming derivative Pyke [23] expresses rules similar to if -then statements. Pyke
derives conclusions by forward, or backward chaining algorithms.</p>
        <p>ASP: In the non-monotonic logic programming paradigm Answer Set Programming (ASP) [24, 25] a
program is written as a set of rules, which is first grounded [26] and then solved [27].
NLTK: The natural language toolkit [28] is a Python library that enables an integration of FOL with
Prover9 [29]. We assume familiarity with the semantics of FOL.</p>
        <p>FOL: We assume familiarity with the syntax and semantics of FOL. For our experiments, we implemented
a parser that translates FOL to NLTK formulas, which are then solved by Prover9.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Datasets</title>
        <p>We perform experiments on three datasets. We used one partly hand-crafted ICL-example (training
data) per dataset/formal language, which is not part of the test set. Each test set configuration resembles
the configuration of Logic-LM.</p>
        <p>ProntoQA [1]. The ProntoQA dataset is a generated dataset. We use the fictional character version
with a reasoning depth of 5. A random answer has a probability of 50% for getting a correct answer
(closed-world assumption - CWA), and a test set with 500 samples is used.</p>
        <p>
          Proof Writer [
          <xref ref-type="bibr" rid="ref20">21</xref>
          ]. Proof Writer is a generated dataset. We chose a reasoning depth of 5. A random
answer has a probability of about 33% to get a correct answer (open-world assumption - OWA). The
test set has 600 samples.
        </p>
        <p>
          FOLIO [
          <xref ref-type="bibr" rid="ref21">22</xref>
          ]. FOLIO is a (partly) expert-written dataset. A random answer is correct with about
33% (OWA). The FOLIO test set has 204 samples. We do not use ASP and Pyke on FOLIO, as FOLIO
instances require classical logic concepts which are efectively impossible to encode in standard logic
programming.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Large Language Models</title>
        <p>We compare the formal languages on seven LLMs, ranging from 8B to 671B parameters. For all
experiments we set the temperature to 0, to obtain a near-deterministic behavior. We restricted the
maximum number of new tokens to be 2048 and did not perform any additional modifications to the
LLMs. We are primarily interested in how the intermediate language afects small language models
(SLMs) with ≈ 8 B parameters, due to their lower resource consumption. Further, we focus on chat
models, as reasoning models build upon them. We used the following LLMs of approximately 8B
parameters: GPT-4o-mini1, Ministral-8B2, Llama-8B3. and DeepSeek-8B4. To study the efects when
using bigger models, we additionally perform experiments on DeepSeek-32B5 (≈ 32 B parameters) and
DeepSeek-V3 (≈ 671 B parameters) models. To verify the results on state-of-the-art reasoning models
we performed benchmarks on DeepSeek-R16 as well. We prompted DeepSeek-R1 with both 2048 and
20480 max-output-tokens, due to increased output token generation of the reasoning model.
1https://platform.openai.com/docs/models/gpt-4o-mini
2https://mistral.ai/news/ministraux
3https://openrouter.ai/meta-llama/llama-3.1-8b-instruct
4https://openrouter.ai/deepseek/deepseek-r1-distill-llama-8b
5https://openrouter.ai/deepseek/deepseek-r1-distill-qwen-32b
6https://api-docs.deepseek.com/news/news1226
80%
Formal L7anguages
0%
Pyke
ASP
NLTK
FOL
1
0
%</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Baselines</title>
        <p>The chance is the probability of getting a correct answer by a random draw. Chance is 50% for ProntoQA,
as it has a CWA, and 33% for Proof Writer and FOLIO, as they have an OWA. Additionally, we use the
following baselines7.</p>
        <p>Std. - refers to standard prompting. The LLM is given a short instruction on the task, the natural
language-posed problem, and a short example of how the LLM shall answer the question.
CoT - refers to CoT prompting. It extends standard prompting by nudging the LLM to reason
step-bystep by employing CoT.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Experimental Evaluation</title>
        <p>We conduct our experiments on an adapted Logic-LM implementation. Our adaptation includes an
ASP symbolic solver based on Clingo [27], a new Pyke implementation, and an adapted NLTK/FOL
solver implementation. We conduct experiments for 4 formal languages and 8 prompting styles, leading
to 32 total experiments for ProntoQA and Proof Writer. Including the 4 baseline experiments, we
report 36 experiments, respectively. For FOLIO, we conduct 20 experiments in total (Pyke and ASP
cannot be measured). This leads to a total of 92 experiments per LLM and 644 experiments in total.
The overall number of queries is 43680 per LLM and overall 305760. Let # be the dataset size,
#EXEC the number of correctly parsed instances, and #TRUE the number of correctly solved instances.
Syntactically correct refers to a translation that adheres to the defined formal language, whereas correctly
solved refers to a correct syntactical translation and the correct output of the solver. The execution-rate
is the fraction of correct syntactical outputs (Exec-Rate, ##EXEC ), execution-accuracy, is the fraction
of correctly solved instances of all syntactically correct ones (Exec-Acc, ##ETXRUECE ), and overall-accuracy
is the fraction of correctly solved instances over the entire dataset (Overall-Acc, ##TR#DUE ). Observe:
Overall-Acc = Exec-Acc · Exec-Rate . Baselines which do not use neurosymbolic reasoning are considered
to have an execution-rate of 100%, while their execution-accuracy resembles their overall-accuracy, as
they are not required to adhere to a formal language.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>We show the experimental results in Figure 2 and Table 1. In this paper we focus on the main results of
the formal languages and provide further information on reasoning model performance and averaged
LLM performance. Further results are shown in the main paper, such as an ablation study on eight
diferent prompting styles and how well formal languages work on each dataset. Overall we report that
FOL achieves the best results, followed by NLTK, ASP, and lastly Pyke. We report these findings in
7We additionally show the results of Neurosymbolic baselines in the main paper.</p>
      <p>Avg.</p>
      <p>Overall Results</p>
      <p>SEM</p>
      <p>GPT-4o-mini</p>
      <p>SEM</p>
      <p>Avg.</p>
      <p>Ministral-8B</p>
      <p>SEM</p>
      <p>Std.</p>
      <p>CoT
Pyke
ASP
NLTK
FOL
Lang.</p>
      <p>Std.</p>
      <p>CoT
Pyke
ASP
NLTK
FOL
Lang.</p>
      <p>Std.</p>
      <p>CoT
Pyke
ASP
NLTK
FOL
/
/
/
/
/
/</p>
      <p>Figure 2 (left) and Table 1 (left top), where we show averaged results with the standard error of the mean
(SEM). For averaging the formal languages, we compute the average across all LLMs, prompting styles,
and the datasets ProntoQA and Proof Writer, leading to  = 112. To account for a fair comparison and
incorporation of the results of the reasoning model, we only used the 20480 token results for
DeepSeekR1 for these averages. For the problems in the datasets, we do not encounter dificulties when solving in
terms of intractability - a combinatorial explosion in the solver - we never exceed 60s computation time.
Therefore, we are not required to use special strategies for tackling intractability, such as symmetry
breaking [30] or tackling the ASP bottleneck [31]. In Figure 2 (right), we show averaged scatter plots of
the execution-rate (x-axis) vs. execution-accuracy (y-axis). Each dot represents a formal language with
a specific prompting style, averaged across all LLMs and ProntoQA, and Proof Writer datasets (  = 14).
The overall-accuracy is obtained by multiplying a point’s x-position with its respective y-position. We
report that Pyke performs approximately equally well on execution-rate and execution-accuracy, while
ASP’s execution-rate tends to stay relatively high. Further, NLTK’s execution-accuracy is relatively
high, as it stays above 60%. FOL’s behavior is similar to NLTK’s. Still FOL has a higher execution-rate,
resulting in a higher overall-accuracy.</p>
      <p>In Table 1 (all except left top) we show the individual results of the formal languages on the LLMs.
We report that the results difer widely between LLMs. Considering the average case, FOL achieves
the highest results on three (GPT-4o-mini, Ministral-8B, and DeepSeek-8B), NLTK on two (Llama-8B
and DeepSeek-V3) and ASP on three (DeepSeek-32B, DeepSeek-R1, and DeepSeek-R1 (20480)) LLMs.
However, these values are often in the range of the SEM, therefore, inconclusive. For example, on
GPT-4o-mini FOL has 72.85% and an SEM of 4.75%, whereas NLTK has 72.74% and an SEM of
5.42%. Regarding the best results, FOL achieves on four (GPT-4o-mini, Ministral-8B, DeepSeek-V3, and
DeepSeek-R1 (20480)), NLTK on two (Llama-8B and DeepSeek-V3), and the baselines on three
(DeepSeek8B, DeepSeeek-32B, and DeepSeek-R1) LLMs. We report the largest performance improvements on SLMs,
such as GPT-4o-mini, Ministral-8B, and Llama-8B. However, on SLMs also the performance fluctuates
the most. For example, ASP does not perform well on Llama-8B, whereas Pyke does not perform well
on DeepSeek-8B. The reasoning model DeepSeek-R1 needs more output tokens to generate suitable
responses, when compared to the chat models. Although not a surprise, this has consequences. While for
all chat LLMs a max-output-token size of 2048 is suficient, for DeepSeek-R1 only a max-output-token
size of 20480 yields reliably non-truncated results.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Discussion</title>
      <p>
        Logical reasoning tasks pose a problem to LLMs, as they remain limited in their ability to perform
probabilistic retrieval [
        <xref ref-type="bibr" rid="ref2">3</xref>
        ]. Neurosymbolic approaches help, by constraining the probabilistic nature to
the translation step of a natural language-posed problem into a formal language [
        <xref ref-type="bibr" rid="ref6 ref7">7, 8</xref>
        ]. Therefore, the
reasoning step itself is not afected by the probabilistic nature of LLMs.
      </p>
      <p>In this paper, we discuss the efect of the chosen formal language on a model’s reasoning performance.
We introduce the intermediate language challenge, which refers to the problem of choosing a suitable
formal language for neurosymbolic reasoning. In our experiments, we compare Pyke, ASP, NLTK, and
FOL as formal languages. The results show that FOL performs best, followed by NLTK and ASP, with
Pyke coming last. When analyzing the behavior of diferent formal languages we observed translation
errors which were unique to the formal language, and errors common across diferent formal languages.
For Pyke in particular, we notice that LLMs format the output incorrectly, by missing line breaks. When
translating to ASP, LLMs struggle to distinguish the two notions of negation: strong, written as − , and
default, written as . This results in program statements such as -not p1(wren), which are syntactically
incorrect. The syntactic errors between NLTK and FOL are similar. Examples include incorrectly setting
parentheses or using a predicate with multiple arities - e.g., 14() and 14(,  ).</p>
      <p>Our results between diferent LLMs vary widely and suggest that for neurosymbolic LLM reasoning
the usage of huge LLMs is not always justified, as for our tasks we already achieved 100% max-accuracy
on GPT-4o-mini and Ministral-8B (8B models). Further, while for all chat models max-output-tokens of
2048 was suficient, this was not the case for DeepSeek-R1, where we needed more (we increased the
parameter to 20480). Interestingly, our results indicate that ASP uses less output tokens, as it achieved
decent results on DeepSeek-R1 (2048). Further, we observed the largest improvements w.r.t. to the
baselines and the largest variations in performance on small language models (SLMs). We hypothesize
that the performance diferences can be explained with a lack or abundance of the formal languages in
the training data. This opens the door for crafting custom intermediate languages and fine-tuning the
SLMs on these custom languages, which we plan to explore in a future study.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used OpenAI O3 and Grammarly in order to: Grammar
and spelling check. The seven LLMs (GPT-4o-mini, Ministral-8B, Llama-8B, DeepSeek-8B, DeepSeek-32B,
DeepSeek-V3 and DeepSeek-R1) were used, as discussed in Section 4.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements References</title>
      <p>This research was supported by Frequentis, FWF grant 10.55776/COE12, and the Dieberger-Peter
Skalicky Stipend.</p>
      <p>[1] A. Saparov, H. He, Language Models Are Greedy Reasoners: A Systematic Formal Analysis of</p>
      <p>Chain-of-Thought, ICLR23 (2023).
[23] B. Frederiksen, Applying Expert System Technology to Code Reuse with Pyke, 2008. URL: https:
//pyke.sourceforge.net/PyCon2008-paper.html.
[24] M. Gelfond, N. Leone, Logic programming and knowledge representation—The A-Prolog
perspective, AI 138 (2002) 3–38. doi:10.1016/S0004-3702(02)00207-2.
[25] T. Schaub, S. Woltran, Special Issue on Answer Set Programming, Künstliche Intell. 32 (2018)
101–103. doi:10.1007/s13218-018-0554-8.
[26] R. Kaminski, T. Schaub, On the Foundations of Grounding in Answer Set Programming, TPLP23
23 (2023) 1138–1197. doi:10.1017/S1471068422000308.
[27] M. Gebser, R. Kaminski, B. Kaufmann, M. Ostrowski, T. Schaub, P. Wanko, Theory Solving Made</p>
      <p>Easy with Clingo 5, ICLP16 52 (2016) 1–15. doi:10.4230/OASICS.ICLP.2016.2.
[28] S. Bird, E. Klein, E. Loper, Natural Language Processing with Python, O’Reilly Media Inc., 2009.
[29] W. McCune, Prover9 and Mace4, 2010. URL: http://www.cs.unm.edu/~mccune/Prover9.
[30] T. Fahle, S. Schamberger, M. Sellmann, Symmetry breaking, in: T. Walsh (Ed.), CP01, 2001, pp.</p>
      <p>93–107.
[31] A. Beiser, M. Hecher, K. Unalan, S. Woltran, Bypassing the ASP Bottleneck: Hybrid Grounding by
Splitting and Rewriting, in: IJCAI24, 2024, pp. 3250–3258. doi:10.24963/ijcai.2024/360.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Lampinen</surname>
          </string-name>
          , I. Dasgupta,
          <string-name>
            <given-names>S. C. Y.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Sheahan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Creswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kumaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>McClelland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <article-title>Language models, like humans, show content efects on reasoning tasks, PNAS Nexus 3 (</article-title>
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .1093/pnasnexus/pgae233.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Panas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Seth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Belle</surname>
          </string-name>
          ,
          <article-title>Can Large Language Models Put 2 and 2 Together? Probing for Entailed Arithmetical Relationships</article-title>
          ,
          <source>in: NeSy24</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>258</fpage>
          -
          <lpage>276</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -71170-1_
          <fpage>21</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B. Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Khanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-Trained Language Models</article-title>
          ,
          <source>in: EMNLP20</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6862</fpage>
          -
          <lpage>6868</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>557</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ichter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chainof-Thought Prompting Elicits Reasoning in Large Language Models</article-title>
          ,
          <source>in: NeurIPS22</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>24824</fpage>
          -
          <lpage>24837</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Havaldar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Apidianaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Callison-Burch</surname>
          </string-name>
          ,
          <article-title>Faithful Chain-of-Thought Reasoning</article-title>
          , in: IJCNLP23,
          <year>2023</year>
          , pp.
          <fpage>305</fpage>
          -
          <lpage>329</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          . ijcnlp-main.
          <volume>20</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Albalak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Logic-LM:
          <article-title>Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning</article-title>
          , in: EMNLP23,
          <year>2023</year>
          , pp.
          <fpage>3806</fpage>
          -
          <lpage>3824</lpage>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2023</year>
          .findings-emnlp.
          <volume>248</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Olausson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lipkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Solar-Lezama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tenenbaum</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. Levy</surname>
          </string-name>
          ,
          <article-title>LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers</article-title>
          ,
          <source>in: EMNLP23</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>5153</fpage>
          -
          <lpage>5176</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>313</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Beiser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Penz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Musliu</surname>
          </string-name>
          ,
          <source>Intermediate Languages Matter: Formal Choice Drives Neurosymbolic LLM Reasoning</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2502.17216.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kirtania</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radhakrishna</surname>
          </string-name>
          , LOGIC-LM+
          <article-title>+: Multi-Step Refinement for Symbolic Formulations</article-title>
          ,
          <source>in: ACL24</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>56</fpage>
          -
          <lpage>63</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .nlrse-
          <volume>1</volume>
          .6.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Geva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Berant</surname>
          </string-name>
          ,
          <article-title>Injecting Numerical Reasoning Skills into Language Models</article-title>
          ,
          <source>in: ACL20</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>946</fpage>
          -
          <lpage>958</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>89</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Coppolillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Calimeri</surname>
          </string-name>
          , G. Manco,
          <string-name>
            <given-names>S.</given-names>
            <surname>Perri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ricca</surname>
          </string-name>
          , LLASP:
          <article-title>Fine-tuning Large Language Models for Answer Set Programming</article-title>
          ,
          <source>in: KR24</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>834</fpage>
          -
          <lpage>844</lpage>
          . doi:
          <volume>10</volume>
          .24963/kr.2024/78.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shanahan</surname>
          </string-name>
          ,
          <source>Talking about Large Language Models, Com. ACM</source>
          <volume>67</volume>
          (
          <year>2024</year>
          )
          <fpage>68</fpage>
          -
          <lpage>79</lpage>
          . doi:
          <volume>10</volume>
          .1145/ 3624724.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Tawaha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Khattar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <article-title>Algorithm of thoughts: enhancing exploration of ideas in large language models</article-title>
          ,
          <source>in: ICML24</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A. d.</given-names>
            <surname>Garcez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Lamb</surname>
          </string-name>
          ,
          <string-name>
            <surname>Neurosymbolic</surname>
            <given-names>AI</given-names>
          </string-name>
          :
          <article-title>the 3rd wave</article-title>
          ,
          <source>Artif Intell Rev</source>
          <volume>56</volume>
          (
          <year>2023</year>
          )
          <fpage>12387</fpage>
          -
          <lpage>12406</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10462-023-10448-w.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Badreddine</surname>
          </string-name>
          , A.
          <string-name>
            <surname>d'Avila Garcez</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Serafini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Spranger</surname>
          </string-name>
          , Logic Tensor Networks,
          <source>AI</source>
          <volume>303</volume>
          (
          <year>2022</year>
          )
          <article-title>103649</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.artint.
          <year>2021</year>
          .
          <volume>103649</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Eiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Geibinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Higuera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Oetsch</surname>
          </string-name>
          ,
          <article-title>A logic-based approach to contrastive explainability for neurosymbolic visual question answering</article-title>
          ,
          <source>in: IJCAI23</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>3668</fpage>
          -
          <lpage>3676</lpage>
          . doi:
          <volume>10</volume>
          .24963/ ijcai.
          <year>2023</year>
          /408.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Rabe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Staats</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jamnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <source>Autoformalization with Large Language Models</source>
          ,
          <year>2022</year>
          . URL: http://arxiv.org/abs/2205.12615.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Hou,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>How Proficient Are Large Language Models in Formal Languages? An In-Depth Insight for Knowledge Base Question Answering</article-title>
          ,
          <source>in: ACL24</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>N.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          , S.-T. Xia,
          <article-title>Logic-of-thought: Empowering large language models with logic programs for solving puzzles in natural language</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2505. 16114.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>O.</given-names>
            <surname>Tafjord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dalvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Clark</surname>
          </string-name>
          , Proof Writer: Generating Implications, Proofs, and
          <article-title>Abductive Statements over Natural Language</article-title>
          , in: IJCNLP21,
          <year>2021</year>
          , pp.
          <fpage>3621</fpage>
          -
          <lpage>3634</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          . findings-acl.
          <volume>317</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schoelkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riddell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Coady</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          , L. Benson, e. al.,
          <article-title>FOLIO: Natural Language Reasoning with First-Order Logic</article-title>
          ,
          <source>in: EMNLP24</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>22017</fpage>
          -
          <lpage>22031</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .emnlp-main.
          <volume>1229</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>